Reload a model in VRAM #892

gregoryca · 2023-08-14T15:05:07Z

Is your feature request related to a problem? Please describe.
Currently when using GPU offloading, i offload 35 (the max layers with 8gb vram) layers to have a speedy generation.
If i have for example one flow that uses a vicuna model, and another one that uses gpt-4all-j, i get the following error when one of the models already have been loaded in to the VRAM of the GPU

2:58PM DBG GRPC(vicuna.bin-127.0.0.1:36471): stderr CUDA error 2 at /build/go-llama/llama.cpp/ggml-cuda.cu:5500: out of memory

Describe the solution you'd like
Could it be possible to clear the VRAM when a model is idle for longer than 5 or 10 seconds, so the VRAM is cleared to be loaded by another model ? The amount of time could be a variable so that we can tweak different flows with different models

Additional context
I have tried to use less layers, but that also makes the generation slower. It was possible to use two models at the same time with a perfomance penalty, but i rather have the option to be able to clear the vram on idle or specify when the vram should be cleared (for example reloading the page where the chatbot/api/front-end runs)

localai-bot · 2023-08-14T20:05:38Z

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

Thank you for your feature request. I understand that you would like to clear the VRAM of the GPU when a model is idle for longer than a certain period of time, so that it can be loaded by another model. This would help improve the performance of LocalAI when using multiple models with limited VRAM.
To implement this feature, we could add a timer that checks the amount of time a model has been loaded in the VRAM and clears it if it exceeds a certain threshold. We could also add a variable to specify the clearing time, so that it can be tweaked for different flows with different models.
I will add this feature request to our TODO list and we will consider implementing it in a future release of LocalAI. In the meantime, you can try reducing the number of layers used by each model to reduce the VRAM usage, or use fewer models in each flow to avoid running out of VRAM.

Sources:

mudler · 2023-08-17T17:09:59Z

I think this is quite cool and should go up in prio - definetly it's a way to enhance resource utilization with a little to almost none configuration for the user

scott4290 · 2023-08-18T16:42:01Z

Other options: Check available VRAM before loading a model. If the model is larger than the available memory and another model is already loaded, unload models as needed to free up VRAM and allow the new model to be loaded. In my case, I'd like to keep the model in memory indefinitely until I select another model.

mudler · 2023-08-19T09:32:16Z

JFYI meanwhile in the direction of this #925 was merged which allows to keep "one" backend loaded only - this allows to use a system which is not capable to running parallel models by just stopping the unused backends.

mudler · 2023-12-01T18:20:43Z

this is currently supported by the watchdog in #1341 which keeps track of busy and idle backends

ER-EPR · 2024-02-26T04:26:17Z

Other options: Check available VRAM before loading a model. If the model is larger than the available memory and another model is already loaded, unload models as needed to free up VRAM and allow the new model to be loaded. In my case, I'd like to keep the model in memory indefinitely until I select another model.

Hi, @mudler , is this auto clear and replace vram function implemented now? I'm still experiencing llama.cpp can't load new models when VRAM is occupied by already loaded model kind of error. Do I need to write some parameters in the config file or envs? I can see that the watchdog function can be turn on, but I think the 'switch to new model' kind of operation is more practical. Last time I check when I user llama.cpp alone, it will switch to the newly requested model and clear the last if necessary. But it's through reset from the interface provided by llama.cpp, can you add a switch to turn on auto reset llama.cpp when a new model needs to be loaded into the GPU?

naifmeh · 2024-05-18T20:56:45Z

@ER-EPR
In case you're still stuck with this, I believe that you could set the LOCALAI_SINGLE_ACTIVE_BACKEND environment variable to true and it will unload any existing model before loading a new one.

gregoryca added the enhancement New feature or request label Aug 14, 2023

gregoryca assigned mudler Aug 14, 2023

mudler added the high prio label Aug 17, 2023

This was referenced Aug 17, 2023

add option to consider a "single" backend active #909

Closed

epic: resource management and control #912

Open

dave-gray101 self-assigned this Aug 19, 2023

dave-gray101 mentioned this issue Aug 19, 2023

feat: experimental Free() changes [formerly shutdown branch] #928

Closed

mudler closed this as completed Dec 1, 2023

mariopaolo mentioned this issue Dec 27, 2023

Can't switch backend on GPU after diffusers backend is used once #1498

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload a model in VRAM #892

Reload a model in VRAM #892

gregoryca commented Aug 14, 2023

localai-bot commented Aug 14, 2023

mudler commented Aug 17, 2023

scott4290 commented Aug 18, 2023

mudler commented Aug 19, 2023

mudler commented Dec 1, 2023 •

edited

Loading

ER-EPR commented Feb 26, 2024 •

edited

Loading

naifmeh commented May 18, 2024

Reload a model in VRAM #892

Reload a model in VRAM #892

Comments

gregoryca commented Aug 14, 2023

localai-bot commented Aug 14, 2023

⚠️⚠️⚠️⚠️⚠️

⚠️⚠️⚠️⚠️⚠️

mudler commented Aug 17, 2023

scott4290 commented Aug 18, 2023

mudler commented Aug 19, 2023

mudler commented Dec 1, 2023 • edited Loading

ER-EPR commented Feb 26, 2024 • edited Loading

naifmeh commented May 18, 2024

mudler commented Dec 1, 2023 •

edited

Loading

ER-EPR commented Feb 26, 2024 •

edited

Loading