Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to consider a "single" backend active #909

Closed
mudler opened this issue Aug 17, 2023 · 3 comments · Fixed by #925
Closed

add option to consider a "single" backend active #909

mudler opened this issue Aug 17, 2023 · 3 comments · Fixed by #925
Assignees
Labels
enhancement New feature or request

Comments

@mudler
Copy link
Owner

mudler commented Aug 17, 2023

Is your feature request related to a problem? Please describe.
In some cases we are bound to a GPU only - in situations where we scale the service horizontally makes sense to avoid multiple requests if we have only one GPU available

Describe the solution you'd like
A flag to allow only one grpc service to be active all the times

Describe alternatives you've considered
N/A

Additional context

@mudler mudler added the enhancement New feature or request label Aug 17, 2023
@mudler mudler self-assigned this Aug 17, 2023
@dave-gray101
Copy link
Collaborator

dave-gray101 commented Aug 17, 2023

Personally, I think we should shoot a little higher than a single lockout like this. As a counter-proposal, I suggest that backends intending to be automatically started by LocalAI provide a "capabilities request" object of some kind - either as a config file, or perhaps a simple go method that returns a struct?

This request should be something like: (pardon ugly pseudocode, on a phone)

{
  required: [
    {
        gpu: 1,
// gpu specifications??
    },
        memory: 7gb,
  ],
  optional: [
        memory: 12gb,
  ]
}

Where the required ones are literally required to load at all, and optional is more what it will use up to - perhaps taking into account context size, perhaps just being a greedy allocator.

Then, the user could either configure a set of resource limits, or we could attempt auto-detection (gopsutil has some stuff there, from my current research)

Finally, when the initializer goes to launch the process, we just... don't if any required resource exceeds limits. Optionals are a bit harder - I foresee that adjusting the provided config to use a shorter max context or whatever in order to fit the available constraints.

One of the reasons I actually added the /backend/monitor endpoint in my latest PR was to have an easy way to get a sense of what kind of memory consumption backends are actually using in practice, in order to have data for that sort of config.

@mudler
Copy link
Owner Author

mudler commented Aug 17, 2023

I think what you describe falls more in a bucket of an orchestrator of workloads, which would be out of scope here - I'd be more tempted to an easy approach that is easy to digest and scales at the same time.

There is already a feature request around idleing backends (see #892) which I think makes sense, or at least for me that's more a natural direction even when looking at scale - you don't really want a single API as a front-end, but rather load-balance requests between multiple instances.

@dave-gray101
Copy link
Collaborator

If the orchestrator role is out of scope, I think it is worth considering the role of the gallery in all this as well - it might be helpful to instead have some of the requirements for memory usage I mentioned above configured at the model configuration layer instead, and be used more for guiding users to what models are appropriate for their hardware.

In that case, I think some kind of "per GPU" mutex as you described is an appropriate response - but theoretically, we might want to support multi-gpu machines, so perhaps its worth investigating if we want to support that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants