Skip to content

Conversation

@SamuelGabriel
Copy link
Contributor

Hi There,

Thank You for this project. Just today I started using your code to train EfficientNets for my own purposes. I just came across a teeny tiny bug in train.py, that crashes trainings on multiple nodes. It seems to happen, that multiple nodes at once write to the tmp path and then try to move it after another, yielding a

FileNotFoundError: [Errno 2] No such file or directory: 'output/timm_confs/effnet_b0.txt/tmp.pth.tar' -> 'output/timm_confs/effnet_b0.txt/last.pth.tar'

With this fix in place it works for me, as there is no more concurrency.

Best regards,

Sam

@rwightman
Copy link
Collaborator

@SamuelGabriel I guess this is a better default. I'd assumed that each node would be saving to a local filesystem and that some redundancy might be appreciated as a default. You are using a shared filesystem as the output?

@rwightman rwightman merged commit b79dfd4 into huggingface:master Jun 9, 2021
@SamuelGabriel
Copy link
Contributor Author

Exactly. At least I had shared filesystems in all clusters I worked with, but don't know whether this was by chance.

guoriyue pushed a commit to guoriyue/pytorch-image-models that referenced this pull request May 24, 2024
Let only the _globally_ 0th rank write checkpoints in `train.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants