Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload a model in VRAM #892

Closed
gregoryca opened this issue Aug 14, 2023 · 7 comments
Closed

Reload a model in VRAM #892

gregoryca opened this issue Aug 14, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request high prio

Comments

@gregoryca
Copy link

Is your feature request related to a problem? Please describe.
Currently when using GPU offloading, i offload 35 (the max layers with 8gb vram) layers to have a speedy generation.
If i have for example one flow that uses a vicuna model, and another one that uses gpt-4all-j, i get the following error when one of the models already have been loaded in to the VRAM of the GPU

2:58PM DBG GRPC(vicuna.bin-127.0.0.1:36471): stderr CUDA error 2 at /build/go-llama/llama.cpp/ggml-cuda.cu:5500: out of memory

Describe the solution you'd like
Could it be possible to clear the VRAM when a model is idle for longer than 5 or 10 seconds, so the VRAM is cleared to be loaded by another model ? The amount of time could be a variable so that we can tweak different flows with different models

Additional context
I have tried to use less layers, but that also makes the generation slower. It was possible to use two models at the same time with a perfomance penalty, but i rather have the option to be able to clear the vram on idle or specify when the vram should be cleared (for example reloading the page where the chatbot/api/front-end runs)

@gregoryca gregoryca added the enhancement New feature or request label Aug 14, 2023
@localai-bot
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

Thank you for your feature request. I understand that you would like to clear the VRAM of the GPU when a model is idle for longer than a certain period of time, so that it can be loaded by another model. This would help improve the performance of LocalAI when using multiple models with limited VRAM.
To implement this feature, we could add a timer that checks the amount of time a model has been loaded in the VRAM and clears it if it exceeds a certain threshold. We could also add a variable to specify the clearing time, so that it can be tweaked for different flows with different models.
I will add this feature request to our TODO list and we will consider implementing it in a future release of LocalAI. In the meantime, you can try reducing the number of layers used by each model to reduce the VRAM usage, or use fewer models in each flow to avoid running out of VRAM.

Sources:

@mudler
Copy link
Owner

mudler commented Aug 17, 2023

I think this is quite cool and should go up in prio - definetly it's a way to enhance resource utilization with a little to almost none configuration for the user

@scott4290
Copy link
Contributor

Other options: Check available VRAM before loading a model. If the model is larger than the available memory and another model is already loaded, unload models as needed to free up VRAM and allow the new model to be loaded. In my case, I'd like to keep the model in memory indefinitely until I select another model.

@mudler
Copy link
Owner

mudler commented Aug 19, 2023

JFYI meanwhile in the direction of this #925 was merged which allows to keep "one" backend loaded only - this allows to use a system which is not capable to running parallel models by just stopping the unused backends.

@mudler
Copy link
Owner

mudler commented Dec 1, 2023

this is currently supported by the watchdog in #1341 which keeps track of busy and idle backends

@ER-EPR
Copy link

ER-EPR commented Feb 26, 2024

Other options: Check available VRAM before loading a model. If the model is larger than the available memory and another model is already loaded, unload models as needed to free up VRAM and allow the new model to be loaded. In my case, I'd like to keep the model in memory indefinitely until I select another model.

Hi, @mudler , is this auto clear and replace vram function implemented now? I'm still experiencing llama.cpp can't load new models when VRAM is occupied by already loaded model kind of error. Do I need to write some parameters in the config file or envs? I can see that the watchdog function can be turn on, but I think the 'switch to new model' kind of operation is more practical. Last time I check when I user llama.cpp alone, it will switch to the newly requested model and clear the last if necessary. But it's through reset from the interface provided by llama.cpp, can you add a switch to turn on auto reset llama.cpp when a new model needs to be loaded into the GPU?

@naifmeh
Copy link

naifmeh commented May 18, 2024

@ER-EPR
In case you're still stuck with this, I believe that you could set the LOCALAI_SINGLE_ACTIVE_BACKEND environment variable to true and it will unload any existing model before loading a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high prio
Projects
None yet
Development

No branches or pull requests

7 participants