New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove pytest-timeout
#1317
Remove pytest-timeout
#1317
Conversation
This PR removes `pytest-timeout`, all timeout markers from tests, and the `--duration` flag from running pytest. The original intent of adding pytest-timeout was to a) incentivise tests to be fast, b) prevent stalled tests from hogging up system resources, and c) provide a heuristic that could be used to split in parallelize tests in the future. However, it has accomplished none of these goals. a) Slow-running tests have just been giving longer timeouts to get it to pass ci/cd. Instead, slow-running tests should either be moved to run only on the daily build (i.e. `@pytest.mark.daily`), or be made faster. b) If a test is stalling due to a distributed error, pytest-timeout doesn't help at all, since Composer will hang on a distributed barrier. Because of inconsistent CI/CD hardware that caused tests to sporadically exceed the timeout, we probably have used _more_ CI/CD minutes re-running failed builds due to timeout failures compared to the amount of CI/CD time the timeout has saved. Instead, we shoudl rely on the `pytestTimeout` variable in the Jenkinsfile -- this controls how long the k8s pod will run before dying, and it is not affected by `torch.dist` calls that could freeze the python process. It is set to 30 minutes right now, which is relatively short. c) For when we want to split and parallelize tests, the `junitxml` reports already include the amount of time each test case took. So, we can split tests based on their historical executation time, rather than requiring an upper-bound be specified in the codebase. Closes https://mosaicml.atlassian.net/browse/CO-769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If we eventually do need to limit test durations between PRs and @daily, another common pattern is to have a marker that invokes a larger parameterized test matrix for each test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well.
At some point it would be good to add some sort of global timeout we can fallback on if something gets hung, but we can deal with that when we decide to revamp the test infra.
This PR removes
pytest-timeout
, all timeout markers from tests, and the--duration
flag from running pytest.The original intent of adding pytest-timeout was to a) incentivise tests to be fast, b) prevent stalled tests from hogging up system resources, and c) provide a heuristic that could be used to split in parallelize tests in the future. However, it has accomplished none of these goals.
a) Slow-running tests have just been giving longer timeouts to get it to pass ci/cd. Instead, slow-running tests should either be moved to run only on the daily build (i.e.
@pytest.mark.daily
), or be made faster.b) If a test is stalling due to a distributed error, pytest-timeout doesn't help at all, since Composer will hang on a distributed barrier. Because of inconsistent CI/CD hardware that caused tests to sporadically exceed the timeout, we probably have used more CI/CD minutes re-running failed builds due to timeout failures compared to the amount of CI/CD time the timeout has saved. Instead, we should rely on the
pytestTimeout
variable in the Jenkinsfile -- this controls how long the k8s pod will run before dying, and it is not affected bytorch.dist
calls that could freeze the python process. It is set to 30 minutes right now, which is relatively short.c) For when we want to split and parallelize tests, the
junitxml
reports already include the amount of time each test case took. So, we can split tests based on their historical execution time, rather than requiring an upper-bound be specified in the codebase.Closes https://mosaicml.atlassian.net/browse/CO-769