(Not a real PR) Diagnosing automated tests #993

alexander-soare · 2021-11-24T18:58:40Z

First, get rid of all EXCLUDE_FX_FILTERS (and keep it this way unless otherwise noted)

Experiment 1 - PASSED

Do only the 3 FX tests

Ubuntu PASSED
Mac PASSED

Experiment 2 - CANCELED

Enable all tests

Ubuntu CANCELED in 2h 41m 40s ending at:

2021-11-24T23:38:20.2060742Z tests/test_models.py::test_model_forward_fx[1-resnet50_gn] PASSED        [ 69%]
2021-11-25T00:11:34.0126256Z ##[error]The operation was canceled.
2021-11-25T00:12:31.5676618Z Post job cleanup.
2021-11-25T00:12:44.5549831Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2021-11-25T00:12:45.2649767Z Cleaning up orphan processes\

Interestingly, notice the time difference between the last passed test and the [error]The operation was canceled., about 23 mins. Why was it canceled?

Experiment 3 - OOM

Swap the positions of all 3 FX tests with their non FX counterparts. Everything enabled as usual.

Ubuntu OOM in 2h 13m 26s ending at:

2021-11-25T13:48:41.2879246Z tests/test_models.py::test_model_forward[1-botnet50ts_256] PASSED        [ 61%]
2021-11-25T13:48:51.7222624Z /home/runner/work/_temp/cb4ae48b-147c-492f-b8cb-e3d64a3bb76c.sh: line 1:  3372 Killed                  pytest -vv --durations=0 ./tests
2021-11-25T13:48:51.7238701Z tests/test_models.py::test_model_forward[1-cait_m36_384] 
2021-11-25T13:48:51.7862199Z ##[error]Process completed with exit code 137.
2021-11-25T13:48:51.9396727Z Post job cleanup.

Experiment 4 - OOM

All tests enabled and in original order. Do gc.collect() in all of them.

Ubuntu OOM in 1h 36m 33s at:

2021-11-25T16:45:20.1194322Z tests/test_models.py::test_model_backward_fx[2-visformer_tiny] PASSED    [ 85%]
2021-11-25T16:45:26.8854740Z /home/runner/work/_temp/42bf6dc7-5437-4230-a0e5-8d97d5f415ad.sh: line 1:  3576 Killed                  pytest -vv --durations=0 ./tests
2021-11-25T16:45:26.8867180Z tests/test_models.py::test_model_backward_fx[2-vit_base_patch8_224] 
2021-11-25T16:45:27.0847755Z ##[error]Process completed with exit code 137.
2021-11-25T16:45:27.1740601Z Post job cleanup.

Experiment 5 - OOM

Remove gc.collect() from exp 3. Separate FX tests into their own file.

Ubuntu OOM (well, didn't really expect that to work...) in 1h 28m 40s at:

2021-11-25T19:30:25.3113700Z tests/test_models.py::test_model_forward[1-botnet50ts_256] PASSED        [ 38%]
2021-11-25T19:30:36.8001220Z /home/runner/work/_temp/55831724-fa44-4d61-aa17-4af11218b83c.sh: line 1:  3385 Killed                  pytest -vv --durations=0 ./tests
2021-11-25T19:30:36.8006837Z tests/test_models.py::test_model_forward[1-cait_m36_384] 
2021-11-25T19:30:36.8432215Z ##[error]Process completed with exit code 137.

Experiment 6 - PASSED

Not only separate FX tests into their own file, but run each test file separately in the git workflow.

Ubuntu - PASSED
Mac -

alexander-soare · 2021-11-25T20:49:14Z

@rwightman will come back to this, but just fyi that I've been able to totally exclude fx tests as the culprit. See experiments 1 and 3 which confirm that basically, FX tests are able to all pass when done either alone, or first.

In any case, seems that regardless of disabling FX tests, this is going to become a problem when there are more models anyway.

rwightman · 2021-11-25T21:00:45Z

@alexander-soare interesting, thanks for the analysis. I was noticing as I made it further into the FX tests it was starting to fail on smaller and smaller models, so yeah, seems like there might be some memory fragmentation, GC, or circular-ref / leak issues.

Maybe I'll try inserting some forced GC cleanup in an upcoming PR and see what happens...

alexander-soare · 2021-11-25T21:02:14Z

Maybe I'll try inserting some forced GC cleanup in an upcoming PR and see what happens...

@rwightman just note experiment 4 - I inserted gc.collect()s in all the tests. That's the extent of my gc knowledge though lol

rwightman · 2021-11-26T00:47:30Z

@alexander-soare running locally I just tried installing pytest-xdist and running pytest with the --forked flag, it runs tests in different processes and appears to prevent some memory baggage accumulating, regardless of what it is (fragmentation, etc). I'll try this with my next PR.

alexander-soare · 2021-11-26T12:19:22Z

@rwightman great, and in case that doesn't sort it out, my last experiment passed. Separately running each test file (with fx tests having their own file). It's kicking the can down the road as it doesn't fix the root cause though. lmk

alexander-soare marked this pull request as draft November 24, 2021 18:58

alexander-soare added 3 commits November 25, 2021 11:35

only fx tests

269477e

enable all tests

f26fbec

swap fx tests with corresponding non-fx tests

5dc8413

alexander-soare force-pushed the test-tests branch from 1dd8969 to 5dc8413 Compare November 25, 2021 11:35

alexander-soare mentioned this pull request Nov 25, 2021

ci: Fix possible OOM error Process completed with exit code 137 Lightning-Universe/lightning-bolts#409

Closed

alexander-soare added 2 commits November 25, 2021 15:08

garbage collect

738e2fe

separate file for test fx

92f71dc

run test files separately one by one

bda0d45

rwightman closed this Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Not a real PR) Diagnosing automated tests #993

(Not a real PR) Diagnosing automated tests #993

alexander-soare commented Nov 24, 2021 •

edited

alexander-soare commented Nov 25, 2021 •

edited

rwightman commented Nov 25, 2021

alexander-soare commented Nov 25, 2021

rwightman commented Nov 26, 2021

alexander-soare commented Nov 26, 2021

(Not a real PR) Diagnosing automated tests #993

(Not a real PR) Diagnosing automated tests #993

Conversation

alexander-soare commented Nov 24, 2021 • edited

Experiment 1 - PASSED

Experiment 2 - CANCELED

Experiment 3 - OOM

Experiment 4 - OOM

Experiment 5 - OOM

Experiment 6 - PASSED

alexander-soare commented Nov 25, 2021 • edited

rwightman commented Nov 25, 2021

alexander-soare commented Nov 25, 2021

rwightman commented Nov 26, 2021

alexander-soare commented Nov 26, 2021

alexander-soare commented Nov 24, 2021 •

edited

alexander-soare commented Nov 25, 2021 •

edited