add option to consider a "single" backend active #909

mudler · 2023-08-17T16:41:36Z

Is your feature request related to a problem? Please describe.
In some cases we are bound to a GPU only - in situations where we scale the service horizontally makes sense to avoid multiple requests if we have only one GPU available

Describe the solution you'd like
A flag to allow only one grpc service to be active all the times

Describe alternatives you've considered
N/A

Additional context

dave-gray101 · 2023-08-17T16:53:13Z

Personally, I think we should shoot a little higher than a single lockout like this. As a counter-proposal, I suggest that backends intending to be automatically started by LocalAI provide a "capabilities request" object of some kind - either as a config file, or perhaps a simple go method that returns a struct?

This request should be something like: (pardon ugly pseudocode, on a phone)

{
  required: [
    {
        gpu: 1,
// gpu specifications??
    },
        memory: 7gb,
  ],
  optional: [
        memory: 12gb,
  ]
}

Where the required ones are literally required to load at all, and optional is more what it will use up to - perhaps taking into account context size, perhaps just being a greedy allocator.

Then, the user could either configure a set of resource limits, or we could attempt auto-detection (gopsutil has some stuff there, from my current research)

Finally, when the initializer goes to launch the process, we just... don't if any required resource exceeds limits. Optionals are a bit harder - I foresee that adjusting the provided config to use a shorter max context or whatever in order to fit the available constraints.

One of the reasons I actually added the /backend/monitor endpoint in my latest PR was to have an easy way to get a sense of what kind of memory consumption backends are actually using in practice, in order to have data for that sort of config.

mudler · 2023-08-17T17:11:10Z

I think what you describe falls more in a bucket of an orchestrator of workloads, which would be out of scope here - I'd be more tempted to an easy approach that is easy to digest and scales at the same time.

There is already a feature request around idleing backends (see #892) which I think makes sense, or at least for me that's more a natural direction even when looking at scale - you don't really want a single API as a front-end, but rather load-balance requests between multiple instances.

dave-gray101 · 2023-08-17T18:12:55Z

If the orchestrator role is out of scope, I think it is worth considering the role of the gallery in all this as well - it might be helpful to instead have some of the requirements for memory usage I mentioned above configured at the model configuration layer instead, and be used more for guiding users to what models are appropriate for their hardware.

In that case, I think some kind of "per GPU" mutex as you described is an appropriate response - but theoretically, we might want to support multi-gpu machines, so perhaps its worth investigating if we want to support that?

mudler added the enhancement New feature or request label Aug 17, 2023

mudler self-assigned this Aug 17, 2023

This was referenced Aug 17, 2023

epic: resource management and control #912

Open

feat: add --single-active-backend to allow only one backend active at the time #925

Merged

mudler closed this as completed in #925 Aug 18, 2023

thfrei mentioned this issue Apr 14, 2024

CUDA Memory - GRPCs do not get reused or alternatively removed #1729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add option to consider a "single" backend active #909

add option to consider a "single" backend active #909

mudler commented Aug 17, 2023

dave-gray101 commented Aug 17, 2023 •

edited

mudler commented Aug 17, 2023 •

edited

dave-gray101 commented Aug 17, 2023

add option to consider a "single" backend active #909

add option to consider a "single" backend active #909

Comments

mudler commented Aug 17, 2023

dave-gray101 commented Aug 17, 2023 • edited

mudler commented Aug 17, 2023 • edited

dave-gray101 commented Aug 17, 2023

dave-gray101 commented Aug 17, 2023 •

edited

mudler commented Aug 17, 2023 •

edited