Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: ERA5 DALI datapipe hangs indefinitely in multi-GPU/multi-Node setting if the datapipe size is not selected correctly. #102

Open
ktangsali opened this issue Aug 2, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working

Comments

@ktangsali
Copy link
Collaborator

Version

0.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

This can mostly be fixed by modifying the number of samples in the datapipe (for example here) to be divisible by the number of processors/GPUs.

A long term fix would be to automatically avoid failure cases where the size is not exactly divisible by the number of GPUs.

Minimum reproducible example

No response

Relevant log output

No response

Environment details

No response

@ktangsali ktangsali added the bug Something isn't working label Aug 2, 2023
@akshaysubr akshaysubr added the 0 - Backlog In queue waiting for assignment label Aug 11, 2023
ktangsali added a commit that referenced this issue Nov 3, 2023
* add makefile commands to add README to sphinx docs

* add pandoc install

* add chapters to index.rst

* fix the images that should have been LFS

* add new examples

* minor fix to readme syntax

* minor fix to indent level

* fix markdown linting errors

* fix markdown linting errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants