Skip to content

Training freezes every num_worker iterations #705

Answered by rwightman
HaniItani asked this question in Q&A
Discussion options

You must be logged in to vote

Not many other ideas, but pretty sure it's an issue with the system / hardware / environment. We've run this a LOT on slurm clusters at very significant scale with high GPU power utilization and no significant epoch turnover delays.

A hail mary, you could try pytorch/pytorch#99625 in your python installing conda install 'llvm-openmp<16' (pip install should work too) as I've run into this issue on PT 2.0.x and it kills performance by constraining all cpu activity onto 1-2 cores.

Possibly worth trying an upgrade to PT 2.1 as well.

Replies: 10 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@HaniItani
Comment options

Answer selected by HaniItani
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants
Converted from issue

This discussion was converted from issue #703 on October 25, 2023 18:15.