-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reload a model in VRAM #892
Comments
|
I think this is quite cool and should go up in prio - definetly it's a way to enhance resource utilization with a little to almost none configuration for the user |
Other options: Check available VRAM before loading a model. If the model is larger than the available memory and another model is already loaded, unload models as needed to free up VRAM and allow the new model to be loaded. In my case, I'd like to keep the model in memory indefinitely until I select another model. |
JFYI meanwhile in the direction of this #925 was merged which allows to keep "one" backend loaded only - this allows to use a system which is not capable to running parallel models by just stopping the unused backends. |
this is currently supported by the watchdog in #1341 which keeps track of busy and idle backends |
Hi, @mudler , is this auto clear and replace vram function implemented now? I'm still experiencing llama.cpp can't load new models when VRAM is occupied by already loaded model kind of error. Do I need to write some parameters in the config file or envs? I can see that the watchdog function can be turn on, but I think the 'switch to new model' kind of operation is more practical. Last time I check when I user llama.cpp alone, it will switch to the newly requested model and clear the last if necessary. But it's through reset from the interface provided by llama.cpp, can you add a switch to turn on auto reset llama.cpp when a new model needs to be loaded into the GPU? |
@ER-EPR |
Is your feature request related to a problem? Please describe.
Currently when using GPU offloading, i offload 35 (the max layers with 8gb vram) layers to have a speedy generation.
If i have for example one flow that uses a vicuna model, and another one that uses gpt-4all-j, i get the following error when one of the models already have been loaded in to the VRAM of the GPU
2:58PM DBG GRPC(vicuna.bin-127.0.0.1:36471): stderr CUDA error 2 at /build/go-llama/llama.cpp/ggml-cuda.cu:5500: out of memory
Describe the solution you'd like
Could it be possible to clear the VRAM when a model is idle for longer than 5 or 10 seconds, so the VRAM is cleared to be loaded by another model ? The amount of time could be a variable so that we can tweak different flows with different models
Additional context
I have tried to use less layers, but that also makes the generation slower. It was possible to use two models at the same time with a perfomance penalty, but i rather have the option to be able to clear the vram on idle or specify when the vram should be cleared (for example reloading the page where the chatbot/api/front-end runs)
The text was updated successfully, but these errors were encountered: