Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky tests #731

Open
melonwater211 opened this issue Jul 21, 2021 · 0 comments
Open

Flaky tests #731

melonwater211 opened this issue Jul 21, 2021 · 0 comments

Comments

@melonwater211
Copy link

Introduction

Several tests, including test_composite_trustworthiness in umap/tests/test_composite_models.py, seem to be flaky when all seed-setting code (e.g. np.random.seed(0) or tf.random.set_seed(0)) is commented out or when a random value is assigned to a seed-setting function (e.g. sklearn.utils.check_random_state()).

For instance, in commit ae5255b, test_composite_trustworthiness failed ~12% of the time (out of 500 runs) compared to 0% of the time (out of 500 runs) when no seed-setting code is removed or altered.

test_composite_trustworthiness tests the trustworthiness of combinations of UMAP models.

Motivation

Some tests can be flaky with high failure rates, but are not discovered when the seeds are set. We are trying to stabilize such tests.

Environment

The tests were run using pytest 6.2.3 in a conda environment with Python 3.6.13. The OS used was Ubuntu 16.04.

Possible Solutions

One possible solution to reduce flakiness is to change the parameters used for prediction. We tried changing the following parameters.

Increasing n_epochs for both model1 and model2 from 50 to 70 reduced flakiness to ~6%.

Increasing n_epochs for both model1 and model2 from 50 to 100 reduced flakiness to ~2%.

Increasing n_epochs for only model1 from 50 to 100 reduced flakiness to ~2%.

Another possible solution is to change the values used in the assertions since values used in assertions may be unnecessarily conservative. Both assertions (i.e. line 27 and line 32) are flaky.

Decreasing the values checked in both assertions from .82 to .80 reduced flakiness to ~5%.

Decreasing the values checked in both assertions from .82 to .78 reduced flakiness to ~0%.

Increasing n_epochs for only model1 from 50 to 100 and decreasing the values checked in both assertions from .82 to .78 reduced flakiness to ~0%.

These changes did not increase the runtime of the test significantly.

Please let us know if these solutions are feasible or if there are any other solutions that should be incorporated. If you are interested, we can send the details of other tests demonstrating similar behavior. We will be happy to raise a Pull Request to fix the tests and incorporate any feedback that you may have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant