Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pytorch async dataloader race condition #3120

Merged
merged 1 commit into from Aug 19, 2021

Conversation

chongxiaoc
Copy link
Collaborator

Since petastorm reader can have infinite loops, main thread has to drain
the queue in order to join the asynchronous worker thread. Otherwise
queue could always be full and asynchronous worker stucks in
queue.put(None).

Also call close_async_loader() in lightning datamodule to close async
dataloaders explicitly, to avoid uncaught exceptions.

Signed-off-by: Chongxiao Cao chongxiaoc@uber.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #3119 .

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Since petastorm reader can have infinite loops, main thread has to drain
the queue in order to join the asynchronous worker thread. Otherwise
queue could always be full and asynchronous worker stucks in
queue.put(None).

Also call close_async_loader() in lightning datamodule to close async
dataloaders explicitly, to avoid uncaught exceptions.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
@github-actions
Copy link

github-actions bot commented Aug 19, 2021

Unit Test Results

     714 files  ±0       714 suites  ±0   5h 59m 39s ⏱️ ±0s
     693 tests ±0       650 ✔️ ±0       43 💤 ±0  0 ❌ ±0 
15 778 runs  ±0  11 133 ✔️ ±0  4 645 💤 ±0  0 ❌ ±0 

Results for commit 24cdb21. ± Comparison against base commit 24cdb21.

♻️ This comment has been updated with latest results.

@chongxiaoc chongxiaoc merged commit 24cdb21 into horovod:master Aug 19, 2021
@github-actions
Copy link

github-actions bot commented Aug 19, 2021

Unit Test Results (with flaky tests)

     825 files  ±0       825 suites  ±0   6h 18m 54s ⏱️ ±0s
     693 tests ±0       650 ✔️ ±0       42 💤 ±0  1 ❌ ±0 
18 297 runs  ±0  12 706 ✔️ ±0  5 590 💤 ±0  1 ❌ ±0 

For more details on these failures, see this check.

Results for commit 24cdb21. ± Comparison against base commit 24cdb21.

♻️ This comment has been updated with latest results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Spark Lightning MNIST fails on GPU
2 participants