feat: add --single-active-backend to allow only one backend active at the time #925

mudler · 2023-08-18T20:04:29Z

Description

This PR fixes #909 by adding a simple mechanisms to manage single devices. It adds a single-active-backend (SINGLE_ACTIVE_BACKEND) CLI flag: when enabled LocalAI will make sure to use only one backend - and automatically stop the ones in use only if idleing if there is a new request (otherwise it will wait). This allows for instance, to generate an image with one GPU, and then start chatting right after with an LLM using the same GPU. This is fundamental when having two consecutive requests to different backends targeting the same GPU or LocalAI will just crash as for now.

In scenarios with multiple-GPUs, for Llama, it is possible to specify already a CUDA device - this allows fine-grained control over the devices being used, however multi-GPU management is out of scope of this PR (as it focuses only on the specific, single case).

It also lowers down the grpc server workers for python to 1 - this allows only one request per time (it automatically queues them, as it seems) bringing back the old behavior. I just tried with diffusers, and parallel requests didn't seem to work well here at all.

Notes for Reviewers

Signed commits

Yes, I signed my commits.

… the time Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

gregoryca · 2023-08-27T18:37:45Z

This should improve how the models are handled when idling ! i'm going to test it this week to see how the interaction with different backend goes, and report back with more info.

mudler force-pushed the one_backend branch from 180a94b to c7516af Compare August 18, 2023 20:14

feat: add --single-active-backend to allow only one backend active at…

91b1902

… the time Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler force-pushed the one_backend branch from 4c858eb to 7c0cfdc Compare August 18, 2023 20:52

fix: allow only one request in the python backends

547c926

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler force-pushed the one_backend branch from 7c0cfdc to 547c926 Compare August 18, 2023 21:12

mudler added 3 commits August 18, 2023 23:36

Bump dependencies

ca45e83

fix: check for running backend before loading externals

fe50ffa

fix: do not pollute logs

348aa64

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler force-pushed the one_backend branch from ab9be01 to 348aa64 Compare August 18, 2023 22:01

mudler merged commit afdc0eb into master Aug 18, 2023
14 checks passed

mudler deleted the one_backend branch August 18, 2023 23:49

mudler mentioned this pull request Aug 19, 2023

Reload a model in VRAM #892

Closed

mudler added the enhancement New feature or request label Aug 24, 2023

mariopaolo mentioned this pull request Dec 27, 2023

Can't switch backend on GPU after diffusers backend is used once #1498

Open

thfrei mentioned this pull request Apr 14, 2024

CUDA Memory - GRPCs do not get reused or alternatively removed #1729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --single-active-backend to allow only one backend active at the time #925

feat: add --single-active-backend to allow only one backend active at the time #925

mudler commented Aug 18, 2023 •

edited

gregoryca commented Aug 27, 2023

feat: add --single-active-backend to allow only one backend active at the time #925

feat: add --single-active-backend to allow only one backend active at the time #925

Conversation

mudler commented Aug 18, 2023 • edited

gregoryca commented Aug 27, 2023

mudler commented Aug 18, 2023 •

edited