You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The gRPC servers are tied to resources local to the machine (CPU, GPU etc) which might be difficult either to partition, reserve or schedule in a large deployment. Currently LocalAI allows to query multiple backends, however it has a shared lock that allows only 1 request for each backend (while multiple requests for different backends is allowed).
This ticket is to discuss potential solution and track all efforts in trying to alleviate or ease out management of hardware resources in a better and scalable way.
There are of course to consider solutions which are tangential and not in scope - for example the scope of LocalAI is to not overlap with existing orchestration mechanisms such as Kubernetes in term of functionalities, but rather be smart enough to be used in a way that can be orchestrated and scaled out to large deployments as well by leveraging such orchestration mechanisms ( rationale is keep it KISS rather than trying to be complex and "solve all")
The text was updated successfully, but these errors were encountered:
Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!
but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.
Don't engage in conversation with me,
I don't support (yet) replying!
⚠️⚠️⚠️⚠️⚠️
One potential solution could be to add support for dynamic resource allocation, where the backend can dynamically adjust its resources based on the incoming request load. This could involve implementing a queueing system that allows requests to wait in line until resources become available, or by adding support for load balancing across multiple backends. Another approach could be to introduce a scheduling mechanism that prioritizes requests based on their importance or urgency, allowing high-priority requests to access resources ahead of lower-priority ones. Additionally, it may be possible to implement a system that allows users to specify minimum and maximum resource requirements for their requests, ensuring that their requests are always processed within their specified constraints. These are just some potential solutions that could help alleviate the issue of limited hardware resources in LocalAI deployments.
Tracker for the issue.
Problem:
The gRPC servers are tied to resources local to the machine (CPU, GPU etc) which might be difficult either to partition, reserve or schedule in a large deployment. Currently LocalAI allows to query multiple backends, however it has a shared lock that allows only 1 request for each backend (while multiple requests for different backends is allowed).
This ticket is to discuss potential solution and track all efforts in trying to alleviate or ease out management of hardware resources in a better and scalable way.
There are of course to consider solutions which are tangential and not in scope - for example the scope of LocalAI is to not overlap with existing orchestration mechanisms such as Kubernetes in term of functionalities, but rather be smart enough to be used in a way that can be orchestrated and scaled out to large deployments as well by leveraging such orchestration mechanisms ( rationale is keep it KISS rather than trying to be complex and "solve all")
The text was updated successfully, but these errors were encountered: