Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Not a real PR) Diagnosing automated tests #993

Closed

Conversation

alexander-soare
Copy link
Contributor

@alexander-soare alexander-soare commented Nov 24, 2021

First, get rid of all EXCLUDE_FX_FILTERS (and keep it this way unless otherwise noted)

Experiment 1 - PASSED

Do only the 3 FX tests

  • Ubuntu PASSED
  • Mac PASSED

Experiment 2 - CANCELED

Enable all tests

Ubuntu CANCELED in 2h 41m 40s ending at:

2021-11-24T23:38:20.2060742Z tests/test_models.py::test_model_forward_fx[1-resnet50_gn] PASSED        [ 69%]
2021-11-25T00:11:34.0126256Z ##[error]The operation was canceled.
2021-11-25T00:12:31.5676618Z Post job cleanup.
2021-11-25T00:12:44.5549831Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2021-11-25T00:12:45.2649767Z Cleaning up orphan processes\

Interestingly, notice the time difference between the last passed test and the [error]The operation was canceled., about 23 mins. Why was it canceled?

Experiment 3 - OOM

Swap the positions of all 3 FX tests with their non FX counterparts. Everything enabled as usual.

Ubuntu OOM in 2h 13m 26s ending at:

2021-11-25T13:48:41.2879246Z tests/test_models.py::test_model_forward[1-botnet50ts_256] PASSED        [ 61%]
2021-11-25T13:48:51.7222624Z /home/runner/work/_temp/cb4ae48b-147c-492f-b8cb-e3d64a3bb76c.sh: line 1:  3372 Killed                  pytest -vv --durations=0 ./tests
2021-11-25T13:48:51.7238701Z tests/test_models.py::test_model_forward[1-cait_m36_384] 
2021-11-25T13:48:51.7862199Z ##[error]Process completed with exit code 137.
2021-11-25T13:48:51.9396727Z Post job cleanup.

Experiment 4 - OOM

All tests enabled and in original order. Do gc.collect() in all of them.

Ubuntu OOM in 1h 36m 33s at:

2021-11-25T16:45:20.1194322Z tests/test_models.py::test_model_backward_fx[2-visformer_tiny] PASSED    [ 85%]
2021-11-25T16:45:26.8854740Z /home/runner/work/_temp/42bf6dc7-5437-4230-a0e5-8d97d5f415ad.sh: line 1:  3576 Killed                  pytest -vv --durations=0 ./tests
2021-11-25T16:45:26.8867180Z tests/test_models.py::test_model_backward_fx[2-vit_base_patch8_224] 
2021-11-25T16:45:27.0847755Z ##[error]Process completed with exit code 137.
2021-11-25T16:45:27.1740601Z Post job cleanup.

Experiment 5 - OOM

Remove gc.collect() from exp 3. Separate FX tests into their own file.

Ubuntu OOM (well, didn't really expect that to work...) in 1h 28m 40s at:

2021-11-25T19:30:25.3113700Z tests/test_models.py::test_model_forward[1-botnet50ts_256] PASSED        [ 38%]
2021-11-25T19:30:36.8001220Z /home/runner/work/_temp/55831724-fa44-4d61-aa17-4af11218b83c.sh: line 1:  3385 Killed                  pytest -vv --durations=0 ./tests
2021-11-25T19:30:36.8006837Z tests/test_models.py::test_model_forward[1-cait_m36_384] 
2021-11-25T19:30:36.8432215Z ##[error]Process completed with exit code 137.

Experiment 6 - PASSED

Not only separate FX tests into their own file, but run each test file separately in the git workflow.

Ubuntu - PASSED
Mac -

@alexander-soare alexander-soare marked this pull request as draft November 24, 2021 18:58
@alexander-soare
Copy link
Contributor Author

alexander-soare commented Nov 25, 2021

@rwightman will come back to this, but just fyi that I've been able to totally exclude fx tests as the culprit. See experiments 1 and 3 which confirm that basically, FX tests are able to all pass when done either alone, or first.

In any case, seems that regardless of disabling FX tests, this is going to become a problem when there are more models anyway.

@rwightman
Copy link
Collaborator

@alexander-soare interesting, thanks for the analysis. I was noticing as I made it further into the FX tests it was starting to fail on smaller and smaller models, so yeah, seems like there might be some memory fragmentation, GC, or circular-ref / leak issues.

Maybe I'll try inserting some forced GC cleanup in an upcoming PR and see what happens...

@alexander-soare
Copy link
Contributor Author

Maybe I'll try inserting some forced GC cleanup in an upcoming PR and see what happens...

@rwightman just note experiment 4 - I inserted gc.collect()s in all the tests. That's the extent of my gc knowledge though lol

@rwightman
Copy link
Collaborator

@alexander-soare running locally I just tried installing pytest-xdist and running pytest with the --forked flag, it runs tests in different processes and appears to prevent some memory baggage accumulating, regardless of what it is (fragmentation, etc). I'll try this with my next PR.

@alexander-soare
Copy link
Contributor Author

@rwightman great, and in case that doesn't sort it out, my last experiment passed. Separately running each test file (with fx tests having their own file). It's kicking the can down the road as it doesn't fix the root cause though. lmk

@rwightman rwightman closed this Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants