Initial Demo

Jump to bottom

Jun Duan edited this page May 2, 2026 · 11 revisions

This fast-model-actuation (FMA) demo showcases how quickly we can switch between three LLMs served on a single shared GPU.

Watch a recording of the demo here: https://youtu.be/4ecoaz0TL2s

FMA highlights

The recording explained

The concise 2-minute recording has two terminal windows side by side for comparison.

The left-hand side window demonstrates FMA's fast actuation, the right-hand side window shows a cold start.

Left window: In two minutes, FMA is able to complete 10 [hot model switch + inference].
Right window: As the baseline, two minutes is needed for a model to cold-start and finish 1 inference.

The GPU used in this demo is one NVIDIA H100 shared by all the three LLMs.

Setup

FMA is part of the llm-d ecosystem. This demo is deployed on top of

Inference scheduler for llm-d: https://github.com/llm-d/llm-d-inference-scheduler/
Kubernetes Gateway API Inference Extension: https://gateway-api-inference-extension.sigs.k8s.io/

FMA uses vLLM to serve the models. The three models used in turn by this demo are:

deepseek-ai/DeepSeek-R1-Distill-Llama-8B
ibm-granite/granite-3.3-8b-instruct
TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ

FMA leverages vLLM's sleep mode. For each inference request in the demo, FMA does the following:

wakes up the corresponding model in response to a user's inference request
fulfills the inference request
puts the model back to sleep mode

More information

The demo is based on Milestone 2 of the project, stay tuned for more in Milestone 3.