Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic limiting of local batchsize bounds after OOM #90

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

odp
Copy link
Collaborator

@odp odp commented Jan 12, 2021

Here we update the higher limit on local batchsize when we are hit with an OOM. The upper limit is constrained by LOCAL_BSZ_CUTOFF_PCT of current local batchsize. We have to take a quick checkpoint and restart after setting the limit because a simple retry doesn't work. PyTorch GPU memory allocator does caching and simply reducing current batchsize doesn't have much of an impact on the total allocated memory (+caches). It results in subsequent OOMs.

A new decorator retry is introduced to catch the OOM exception as it is not visible from inside the dataloader. The train function should be decorated with retry which retries (from the position before restart) the training loop after limiting the batchsize of the current dataloader.

Fixes #40

@odp odp requested a review from aurickq January 12, 2021 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Elastic Training Flexible to GPU Memory
1 participant