【Elastic Horovod】torch op handles is not recollect right away as new elastic loop begins #3109

woodlgz · 2021-08-16T13:17:26Z

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet): Pytorch
Framework version: 1.6.0
Horovod version: 0.21.3
MPI version: 4.0.3
CUDA version: 10.2
NCCL version: 2.7.6
Python version: 3.6
Checklist:

Did you search issues to find if somebody asked this question before? Yes.
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? Yes

Bug report:
horovod torch currently use a dict to hold references to horovod operation results and release such reference in synchronize function.

def synchronize(handle):
    if handle not in _handle_map:
        return
    try:
        mpi_lib.horovod_torch_wait_and_clear(handle)
        output = _handle_map.pop(handle)[-1]
        return output
    except RuntimeError as e:
        raise HorovodInternalError(e)

this code, however, does not releases the handle when exception occurs.as a result, in elastic scenario, handle references may still be retained in the dict for some time in new training loop, and will not be released until the handle counter reaches the same point again. this may be a problem when memory/gpumem is heavily used, prone to oom.

I am going to submit a pr to fix this.

woodlgz · 2021-09-02T02:25:59Z

@chongxiaoc @tgaddair can you take a look at this?

woodlgz added the bug label Aug 16, 2021

woodlgz mentioned this issue Aug 16, 2021

fix torch op handles lazy release which may cause oom in elastic scenario #3110

Merged

4 tasks

chongxiaoc closed this as completed in #3110 Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Elastic Horovod】torch op handles is not recollect right away as new elastic loop begins #3109

【Elastic Horovod】torch op handles is not recollect right away as new elastic loop begins #3109

woodlgz commented Aug 16, 2021

woodlgz commented Sep 2, 2021

【Elastic Horovod】torch op handles is not recollect right away as new elastic loop begins #3109

【Elastic Horovod】torch op handles is not recollect right away as new elastic loop begins #3109

Comments

woodlgz commented Aug 16, 2021

woodlgz commented Sep 2, 2021