Skip to content

Model Concurrency Matrix #9659

@rhyst

Description

@rhyst

Is your feature request related to a problem? Please describe.
I would like more control over which models can load concurrently. At the moment we can use "Max Active Backends" but that would allow 3 large models to load which may not fit.

Describe the solution you'd like

Literally the matrix configuration from llama-swap: mostlygeek/llama-swap#643
This would allow specifying more complicated rules like "allow my zed prediciton model to run alongside anything but don't allow my two 120b models to run alongside eachother"

Describe alternatives you've considered
Max Active Backends - but has problem as mentioned above.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions