Is your feature request related to a problem? Please describe.
I would like more control over which models can load concurrently. At the moment we can use "Max Active Backends" but that would allow 3 large models to load which may not fit.
Describe the solution you'd like
Literally the matrix configuration from llama-swap: mostlygeek/llama-swap#643
This would allow specifying more complicated rules like "allow my zed prediciton model to run alongside anything but don't allow my two 120b models to run alongside eachother"
Describe alternatives you've considered
Max Active Backends - but has problem as mentioned above.
Additional context
Is your feature request related to a problem? Please describe.
I would like more control over which models can load concurrently. At the moment we can use "Max Active Backends" but that would allow 3 large models to load which may not fit.
Describe the solution you'd like
Literally the matrix configuration from llama-swap: mostlygeek/llama-swap#643
This would allow specifying more complicated rules like "allow my zed prediciton model to run alongside anything but don't allow my two 120b models to run alongside eachother"
Describe alternatives you've considered
Max Active Backends - but has problem as mentioned above.
Additional context