-
Notifications
You must be signed in to change notification settings - Fork 16
Initial Demo
This fast-model-actuation (FMA) demo showcases how quickly we can switch between three LLMs served on a single shared GPU.
Watch a recording of the demo here: https://youtu.be/4ecoaz0TL2s

The concise 2-minute recording has two terminal windows side by side for comparison.
The left-hand side window demonstrates FMA's fast actuation, the right-hand side window shows a cold start.
- Left window: In two minutes, FMA is able to complete 10 [hot model switch + inference].
- Right window: As the baseline, two minutes is needed for a model to cold-start and finish 1 inference.
The GPU used in this demo is one NVIDIA H100 shared by all the three LLMs.
FMA is part of the llm-d ecosystem. This demo is deployed on top of
- Inference scheduler for llm-d: https://github.com/llm-d/llm-d-inference-scheduler/
- Kubernetes Gateway API Inference Extension: https://gateway-api-inference-extension.sigs.k8s.io/
FMA uses vLLM to serve the models. The three models used in turn by this demo are:
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- ibm-granite/granite-3.3-8b-instruct
- TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
FMA leverages vLLM's sleep mode. For each inference request in the demo, FMA does the following:
- wakes up the corresponding model in response to a user's inference request
- fulfills the inference request
- puts the model back to sleep mode
The demo is based on Milestone 2 of the project, stay tuned for more in Milestone 3.