Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pytest-timeout #1317

Merged
merged 4 commits into from Jul 27, 2022
Merged

Remove pytest-timeout #1317

merged 4 commits into from Jul 27, 2022

Conversation

ravi-mosaicml
Copy link
Contributor

This PR removes pytest-timeout, all timeout markers from tests, and the --duration flag from running pytest.

The original intent of adding pytest-timeout was to a) incentivise tests to be fast, b) prevent stalled tests from hogging up system resources, and c) provide a heuristic that could be used to split in parallelize tests in the future. However, it has accomplished none of these goals.

a) Slow-running tests have just been giving longer timeouts to get it to pass ci/cd. Instead, slow-running tests should either be moved to run only on the daily build (i.e. @pytest.mark.daily), or be made faster.

b) If a test is stalling due to a distributed error, pytest-timeout doesn't help at all, since Composer will hang on a distributed barrier. Because of inconsistent CI/CD hardware that caused tests to sporadically exceed the timeout, we probably have used more CI/CD minutes re-running failed builds due to timeout failures compared to the amount of CI/CD time the timeout has saved. Instead, we should rely on the pytestTimeout variable in the Jenkinsfile -- this controls how long the k8s pod will run before dying, and it is not affected by torch.dist calls that could freeze the python process. It is set to 30 minutes right now, which is relatively short.

c) For when we want to split and parallelize tests, the junitxml reports already include the amount of time each test case took. So, we can split tests based on their historical execution time, rather than requiring an upper-bound be specified in the codebase.

Closes https://mosaicml.atlassian.net/browse/CO-769

This PR removes `pytest-timeout`, all timeout markers from tests, and the `--duration` flag from running pytest.

The original intent of adding pytest-timeout was to a) incentivise tests to be fast, b) prevent stalled tests from hogging up system resources, and c) provide a heuristic that could be used to split in parallelize tests in the future. However, it has accomplished none of these goals.

a) Slow-running tests have just been giving longer timeouts to get it to pass ci/cd. Instead, slow-running tests should either be moved to run only on the daily build (i.e. `@pytest.mark.daily`), or be made faster.

b) If a test is stalling due to a distributed error, pytest-timeout doesn't help at all, since Composer will hang on a distributed barrier. Because of inconsistent CI/CD hardware that caused tests to sporadically exceed the timeout, we probably have used _more_ CI/CD minutes re-running failed builds due to timeout failures compared to the amount of CI/CD time the timeout has saved. Instead, we shoudl rely on the `pytestTimeout` variable in the Jenkinsfile -- this controls how long the k8s pod will run before dying, and it is not affected by `torch.dist` calls that could freeze the python process. It is set to 30 minutes right now, which is relatively short.

c) For when we want to split and parallelize tests, the `junitxml` reports already include the amount of time each test case took. So, we can split tests based on their historical executation time, rather than requiring an upper-bound be specified in the codebase.

Closes https://mosaicml.atlassian.net/browse/CO-769
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If we eventually do need to limit test durations between PRs and @daily, another common pattern is to have a marker that invokes a larger parameterized test matrix for each test.

Copy link
Member

@bandish-shah bandish-shah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

At some point it would be good to add some sort of global timeout we can fallback on if something gets hung, but we can deal with that when we decide to revamp the test infra.

@ravi-mosaicml ravi-mosaicml enabled auto-merge (squash) July 27, 2022 01:07
@ravi-mosaicml ravi-mosaicml merged commit 9a7caa9 into mosaicml:dev Jul 27, 2022
@ravi-mosaicml ravi-mosaicml deleted the CO-769 branch July 27, 2022 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants