-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My trainig process is frozen #107
Comments
Did u use DDP training? If so, may be spawned process might occur some errors causing it to quit. So the main process was still waiting. |
Yeah, I fixed
Then is "DDPStrategy" related to this problem? I'm new at using Lightning, so how can I fix it? |
You can use CPU training or single GPU training mode to check if any bad thing occurs. |
I already did with single GPU and got no problem. |
try to set the worker number of Dataset to 0 and see. |
Hello,
training process is initiated without problem, but when some times left, it is frozen like:
it doesn't show any change on console
and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):
I fixed parameter like batch_size, worker, etc, but it doesn't help
Can anyone help?
my env is on miniconda3, and using CUDA 11.8, so version is:
PyTorch 2.0.0
PyTorch Lightning 2.0.2
The text was updated successfully, but these errors were encountered: