Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/recompute #422

Merged
merged 3 commits into from
May 16, 2024
Merged

feature/recompute #422

merged 3 commits into from
May 16, 2024

Conversation

karpathy
Copy link
Owner

Option to recompute forward activations during backward pass.
Will be an int so that 0 = don't be fancy, 1,2,3,4... (in the future) recompute more and more.
This trades off VRAM for latency of a single fwd/bwd pass, because we do more calculations, but we save less in memory.
The big upside is that the VRAM savings mean you can crank up the batch size, and can actually end up with a net win on the tokens throughput during training.

For example on my A100 40GB, with -r 0 I can only fit batch size 10 for the biggest GPT-2 model. But with -r 1 (recompute GeLU) I can fit batch size 12, and a net win of token throughput because of that.

ngc92 and others added 3 commits May 16, 2024 12:39
…r time just like ZeRO stages, as we recompute more and more of the model in the future possibly. and make it default on because it is awesome
@karpathy karpathy merged commit bd7dc7a into master May 16, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants