-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[mm] clear the cache entry for a model that got an OOM during loading #6193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Interesting, I consistently get Catching all Exceptions might swallow
|
We re-raise the error immediately afterwards anyways - what's the harm in catching everything and clearing the cache in the event of any error? Probably oughta do this anyways. There could be some other kind of exception that results in a borked cache for that model. |
Ok. Done in latest commit. |
Thanks for tracking this down @lstein |
6076b15
to
7605632
Compare
Summary
If a CUDA Out-Of-Memory (OOM) exception occurs during model loading, this commit recovers from the error by clearing out the model manager's cache for that model. This prevents the partially-loaded model from getting "stuck" in VRAM and preventing further generations.
Related Issues / Discussions
To reproduce the underlying issue on current
main
, be sure to use a Linux system (the Windows NVIDIA driver behaves differently).nvidia-smi
afterward. It should show a small amount of VRAM being used byinvokeai
. Typically about 600Mb.ollama
.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 11.73 GiB of which 7.12 MiB is free
. If you checknvidia-smi
now, you'll see that a significant amount of VRAM (>1G) is still allocated to theinvokeai
process.ollama
server or by unloading its current LLM.Error while invoking session 95917faf-1faf-4094-9dcd-86cfd1c943e4, invocation fb7a92ba-a6e8-4345-837c-ed842665fccd (denoise_latents): Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper_CUDA__native_group_norm)
QA Instructions
Follow the recipe given above before and after applying this PR. With this PR, when the OOM error occurs, the RAM cache is cleared of the partially-loaded model, and as soon as there is sufficient unused VRAM the model should load and generate.
Note that this is a CUDA-specific fix. If there is an equivalent problem on MPS systems, this will not fix it. However, we haven't had any reports from Mac users yet.
Merge Plan
Merge when approved.
Checklist