-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3 with torch.compile used more memory #31471
Comments
Hi @songh11 👋 If you check the documentation regarding |
Many thanks, another question I have is why does the second generate use more memory |
Interestingly I didn't get sudden memory spike after second generation and after 5 steps the memory remained around 16GB 🤔 . My specs are:
|
@zucchini-nlp In my experience the spikes are hardware-dependent, even when two devices have the same spare memory available. @songh11 "You may might also notice that the second time we run our model with torch.compile is significantly slower than the other runs, although it is much faster than the first run. This is because the "reduce-overhead" mode runs a few warm-up iterations for CUDA graphs." (source) |
NVIDIA RTX A5000, I think the second generation is also for warm-up. |
Thank you for your reply. I can use default to pass. |
System Info
transformers
version: 4.41.2Who can help?
@SunMarc, @zucchini-nlp, @gante,I hope I can get your help
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The first generate nvidia-smi showed memory about 16G, but during the second operation,nvidia-smi showed the memory will grow to 20G. But torch.cuda.max_memory_reserved() just showed about 16G. I don't know what the problem is, can you help me to answer it.
The text was updated successfully, but these errors were encountered: