Let only the _globally_ 0th rank write checkpoints in `train.py` #693

SamuelGabriel · 2021-06-09T17:21:09Z

Hi There,

Thank You for this project. Just today I started using your code to train EfficientNets for my own purposes. I just came across a teeny tiny bug in train.py, that crashes trainings on multiple nodes. It seems to happen, that multiple nodes at once write to the tmp path and then try to move it after another, yielding a

FileNotFoundError: [Errno 2] No such file or directory: 'output/timm_confs/effnet_b0.txt/tmp.pth.tar' -> 'output/timm_confs/effnet_b0.txt/last.pth.tar'

With this fix in place it works for me, as there is no more concurrency.

Best regards,

Sam

rwightman · 2021-06-09T21:29:59Z

@SamuelGabriel I guess this is a better default. I'd assumed that each node would be saving to a local filesystem and that some redundancy might be appreciated as a default. You are using a shared filesystem as the output?

SamuelGabriel · 2021-06-10T05:47:04Z

Exactly. At least I had shared filesystems in all clusters I worked with, but don't know whether this was by chance.

Let only the _globally_ 0th rank write checkpoints in `train.py`

Global instead of local rank.

7c19c35

rwightman merged commit b79dfd4 into huggingface:master Jun 9, 2021

guoriyue pushed a commit to guoriyue/pytorch-image-models that referenced this pull request May 24, 2024

Merge pull request huggingface#693 from SamuelGabriel/patch-1

3d3b21c

Let only the _globally_ 0th rank write checkpoints in `train.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Let only the _globally_ 0th rank write checkpoints in `train.py` #693

Let only the _globally_ 0th rank write checkpoints in `train.py` #693

Uh oh!

SamuelGabriel commented Jun 9, 2021

Uh oh!

rwightman commented Jun 9, 2021

Uh oh!

SamuelGabriel commented Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Let only the _globally_ 0th rank write checkpoints in train.py #693

Let only the _globally_ 0th rank write checkpoints in train.py #693

Uh oh!

Conversation

SamuelGabriel commented Jun 9, 2021

Uh oh!

rwightman commented Jun 9, 2021

Uh oh!

SamuelGabriel commented Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Let only the _globally_ 0th rank write checkpoints in `train.py` #693

Let only the _globally_ 0th rank write checkpoints in `train.py` #693