You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you search issues to find if somebody asked this question before? Yes.
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? Yes
Bug report:
horovod torch currently use a dict to hold references to horovod operation results and release such reference in synchronize function.
this code, however, does not releases the handle when exception occurs.as a result, in elastic scenario, handle references may still be retained in the dict for some time in new training loop, and will not be released until the handle counter reaches the same point again. this may be a problem when memory/gpumem is heavily used, prone to oom.
I am going to submit a pr to fix this.
The text was updated successfully, but these errors were encountered:
Environment:
Framework: (TensorFlow, Keras, PyTorch, MXNet): Pytorch
Framework version: 1.6.0
Horovod version: 0.21.3
MPI version: 4.0.3
CUDA version: 10.2
NCCL version: 2.7.6
Python version: 3.6
Checklist:
Did you search issues to find if somebody asked this question before? Yes.
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? Yes
Bug report:
horovod torch currently use a dict to hold references to horovod operation results and release such reference in synchronize function.
this code, however, does not releases the handle when exception occurs.as a result, in elastic scenario, handle references may still be retained in the dict for some time in new training loop, and will not be released until the handle counter reaches the same point again. this may be a problem when memory/gpumem is heavily used, prone to oom.
I am going to submit a pr to fix this.
The text was updated successfully, but these errors were encountered: