Skip to content

Initial Demo

Jun Duan edited this page May 2, 2026 · 11 revisions

This fast-model-actuation (FMA) demo showcases how quickly we can switch between three LLMs served on a single shared GPU.

Watch a recording of the demo here: https://youtu.be/4ecoaz0TL2s

FMA highlights

The recording explained

The concise 2-minute recording has two terminal windows side by side for comparison.

The left-hand side window demonstrates FMA's fast actuation, the right-hand side window shows a cold start.

  • Left window: In two minutes, FMA is able to complete 10 [hot model switch + inference].
  • Right window: As the baseline, two minutes is needed for a model to cold-start and finish 1 inference.

The GPU used in this demo is one NVIDIA H100 shared by all the three LLMs.

Setup

FMA is part of the llm-d ecosystem. This demo is deployed on top of

FMA uses vLLM to serve the models. The three models used in turn by this demo are:

  • deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • ibm-granite/granite-3.3-8b-instruct
  • TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ

FMA leverages vLLM's sleep mode. For each inference request in the demo, FMA does the following:

  1. wakes up the corresponding model in response to a user's inference request
  2. fulfills the inference request
  3. puts the model back to sleep mode

More information

The demo is based on Milestone 2 of the project, stay tuned for more in Milestone 3.

Clone this wiki locally