-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Is your feature request related to a problem? Please describe.
When loading multiple models, it's hard currently to avoid filling available VRAM if not using watchdog or single active backend.
Describe the solution you'd like
LocalAI could ideally estimate the available VRAM and remove the last used model if there is no available VRAM. It is hard to predict the VRAM used across all backends, so I'm thinking here of a simple solution of monitoring the current usage, and if we fail loading the model to automatically delete one from the loaded.
Describe alternatives you've considered
Another option is to specify a fixed number of concurrent backends that are allowed. This doesn't really scale because it does not take account of the VRAM that any model could consume
Additional context