Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Elastic Horovod】torch op handles is not recollect right away as new elastic loop begins #3109

Closed
woodlgz opened this issue Aug 16, 2021 · 1 comment · Fixed by #3110
Closed
Labels

Comments

@woodlgz
Copy link
Contributor

woodlgz commented Aug 16, 2021

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet): Pytorch
Framework version: 1.6.0
Horovod version: 0.21.3
MPI version: 4.0.3
CUDA version: 10.2
NCCL version: 2.7.6
Python version: 3.6
Checklist:

Did you search issues to find if somebody asked this question before? Yes.
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? Yes

Bug report:
horovod torch currently use a dict to hold references to horovod operation results and release such reference in synchronize function.

def synchronize(handle):
    if handle not in _handle_map:
        return
    try:
        mpi_lib.horovod_torch_wait_and_clear(handle)
        output = _handle_map.pop(handle)[-1]
        return output
    except RuntimeError as e:
        raise HorovodInternalError(e)

this code, however, does not releases the handle when exception occurs.as a result, in elastic scenario, handle references may still be retained in the dict for some time in new training loop, and will not be released until the handle counter reaches the same point again. this may be a problem when memory/gpumem is heavily used, prone to oom.

I am going to submit a pr to fix this.

@woodlgz
Copy link
Contributor Author

woodlgz commented Sep 2, 2021

@chongxiaoc @tgaddair can you take a look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

1 participant