Why logging for rank=[0,1] only in function engine.py -> _save_checkpoint() #2067
Unanswered
dunalduck0
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Link to the source code
Line 2997 only produces logging for rank 0 and 1. In multi-node training, the local main process ranks can be other values. For example, 2 nodes, each with 8 GPUs. The rank of the main process of the 2nd node is normally 8, and there would be no logging for this rank. I think the logging is useful for all ranks other than just [0,1]
Beta Was this translation helpful? Give feedback.
All reactions