# Running Mistral Large 3 675B Instruct with vLLM on NVIDIA GPUs

**Authors:** [Katja Sirazitdinova](https://github.com/katjasrz), [Jay Rodge](https://github.com/jayrodge), [Mitesh Patel](https://github.com/patelmiteshn), Developer Advocates @ NVIDIA

---

This notebook provides a comprehensive guide on how to run the **Mistral Large 3 675B Instruct** model using vLLM. 

Mistral Large 3 is a state-of-the-art general-purpose multimodal granular Mixture-of-Experts model with 41B active parameters and 675B total parameters.

This model is the instruct post-trained version, fine-tuned for instruction tasks, making it ideal for chat, agentic and instruction based use cases. Designed for reliability and long-context comprehension - it is engineered for production-grade assistants, retrieval-augmented systems, scientific workloads, and complex enterprise workflows.

## Launch on NVIDIA Brev

You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-36IT87VH89qYpkwYrouOUVWeNSK)

## Table of contents

- [Prerequisites](#Prerequisites)
  - [Verifying your system](#Verifying-your-system)
  - [Installing vLLM](#Installing-vLLM)
- [Launch OpenAI-compatible server](#Launch-OpenAI-compatible-server)
- [Client setup](#Client-setup)
- [Testing some scenarios](#Testing-some-scenarios)
  - [Instruction following](#Instruction-following)
  - [Vision reasoning](#Vision-reasoning)
  - [Function calling](#Function-calling)
- [Conclusion and resources](#Conclusion-and-resources)

## Prerequisites

Mistral Large 3 is deployable on-premises at [FP8](https://huggingface.co/mistralai/Mistral-Large-3-675B-FP8-Instruct-2512) on a single node of B200 or H200 GPUs, with H200 having a reduced context window.

This notebook is configured by default to run on a machine with 8 GPUs and sufficient VRAM to hold the 675B parameter model. If your hardware is different, be sure to adjust the `--tensor-parallel-size` (tensor parallelism) and other resource-related flags in the server launch command provided further.

### Verifying your system

Let's verify your system is ready for **Mistral Large 3 675B Instruct**.

In [None]:
#!/usr/bin/env python3
import subprocess
import platform
import shutil

print("="*70)
print("="*70)
print(f"OS: {platform.system()} {platform.release()}")
print(f"Python: {platform.python_version()}")

# Check if nvidia-smi exists
if shutil.which("nvidia-smi") is None:
    print("‚ùå nvidia-smi not found ‚Äî NVIDIA drivers are missing or not in PATH.")
    print("   Unable to detect GPU hardware.")
    exit(1)

print("="*70)
print("GPU DETAILS (nvidia-smi)")
print("="*70)

try:
    # Query GPU name + total memory
    query_cmd = [
        "nvidia-smi",
        "--query-gpu=name,memory.total",
        "--format=csv,noheader,nounits"
    ]

    output = subprocess.check_output(query_cmd, text=True)
    lines = [line.strip() for line in output.splitlines() if line.strip()]
    total_memory_gb = 0.0

    print(f"Number of GPUs detected: {len(lines)}")

    for i, line in enumerate(lines):
        name, mem_mib = [x.strip() for x in line.split(",")]
        mem_gb = float(mem_mib) / 1024
        total_memory_gb += mem_gb

        print(f"\nGPU[{i}]:")
        print(f"  Name: {name}")
        print(f"  Total Memory: {mem_gb:.2f} GB")

        if "H200" in name:
            print("  Status: ‚úÖ Hopper architecture - Supported")
        elif "B200" in name or "GB200" in name:
            print("  Status: ‚úÖ Blackwell architecture - Optimal")
        else:
            print("  Status: ‚ö†Ô∏è Unknown/older architecture ‚Äî May have limitations")

    print(f"\nTotal GPU Memory (All GPUs): {total_memory_gb:.2f} GB")

except Exception as e:
    print("‚ùå Failed to parse GPU info from nvidia-smi")
    print(e)
    exit(1)

print("\n" + "="*70)
print("NVLINK STATUS")
print("="*70)

try:
    nvlink = subprocess.check_output(["nvidia-smi", "nvlink", "--status"],
                                     text=True, stderr=subprocess.STDOUT)
    print("‚úÖ NVLink detected & queryable\n")
    print(nvlink.strip())
except:
    print("‚ö†Ô∏è NVLink not detected or unavailable")

print("\n" + "="*70)
print("CONFIGURATION RECOMMENDATIONS")
print("="*70)

if total_memory_gb >= 1100:
    print("‚úÖ Enough VRAM for large models ‚Äî recommended EP/DP execution")
elif total_memory_gb >= 900:
    print("‚ö†Ô∏è Borderline for largest models ‚Äî FP8 or TP recommended")
elif total_memory_gb > 0:
    print("‚ùå VRAM too low for full-size models ‚Äî use smaller/quantized checkpoints")
else:
    print("‚ùå No GPUs detected ‚Äî GPU is required")

### Installing vLLM

Install the latest vLLM nightly build with the Mistral backends enabled so you get the Blackwell kernels and MoE optimizations required for this checkpoint:

In [None]:
!uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

Doing so should automatically install [`mistral_common >= 1.8.6`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.6).

To verify the version: 

In [None]:
import mistral_common; print(mistral_common.__version__)

You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest/images/sha256-de9032a92ffea7b5c007dad80b38fd44aac11eddc31c435f8e52f3b7404bbf39).

## Launch OpenAI-compatible server

When launching an OpenAI-compatible server, the exact configuration you use should depend heavily on the hardware available in your cluster and the level of model quantization you choose. Different GPUs and memory budgets will favor different precision settings and kernel implementations. In particular, high-end setups with multiple B200 GPUs can push for more aggressive optimizations like FP8 to maximize throughput without sacrificing too much quality.

Below are example configurations optimized for different setups. Run them in a terminal window.

Set the `"$MODEL"` variable to one of the official checkpoints published by Mistral so the launch commands pull the right weights:

- `mistralai/Mistral-Large-3-675B-Instruct-2512` (FP8, recommended for B200/H200)
- `mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4` (NVFP4 for H100/A100)
- `mistralai/Mistral-Large-3-675B-Instruct-2512-BF16` (full precision for fidelity testing)
- `mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle` (draft model for speculative decoding)

You can also initialize it as a path to the model if you have it downloaded locally:

```bash
MODEL_PATH="/path/to/the/model"
```

#### FP8 on 8xB200

On an 8√óB200 node, we can safely run the model in FP8 to squeeze out significantly higher effective throughput and better hardware utilization. 

* We rely on FlashInfer kernels for both the Multi-Head Latent Attention (MLA) path and the Mixture-of-Experts (MoE) layers in FP8. These kernels are optimized for modern NVIDIA architectures and are designed to reduce latency and improve tokens-per-second, especially at larger batch sizes and longer context lengths.

* The key‚Äìvalue (KV) cache is also kept in FP8 format. This drastically cuts memory consumption for long-context inference and allows the model to handle more concurrent requests or longer sequences without running out of GPU memory. The trade-off in numerical precision is usually minimal for inference workloads, while the performance gain is substantial.

Overall, this FP8 + FlashInfer configuration is aimed at high-throughput, production-grade serving on 8√óB200, where the priority is maximizing utilization and request throughput while still maintaining acceptable response quality.

```bash
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
vllm serve "$MODEL" \
  --load-format mistral \
  --tokenizer-mode mistral \
  --config-format mistral \
  --max_model_len 65536 \
  --max_num_seqs 128 \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --limit-mm-per-prompt '{"image":10}' \
  --kv-cache-dtype fp8 \
  --host 0.0.0.0 \
  --port 8000
```

#### NVFP4 on 8xB200

This configuration uses FlashInfer for MLA while switching MoE layers to NVFP4, a format optimized for NVIDIA architectures that provides a tighter balance between efficiency and output quality compared to raw FP8. NVFP4 reduces memory footprint substantially, allowing high batch concurrency and long-context serving without hitting capacity ceilings.

MLA operations run on FlashInfer for fast attention kernels, and MoE experts are quantized to NVFP4. This keeps expert computation light without a critical loss in fidelity, making this option well-suited for heavy MoE workloads or cost-sensitive production environments. Keeping the KV cache in FP8 further reduces memory pressure.

The result is a configuration that hits a sweet spot between speed, memory, and response quality.

```bash
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
vllm serve "$MODEL" \
  --load-format mistral \
  --tokenizer-mode mistral \
  --config-format mistral \
  --max_model_len 65536 \
  --max_num_seqs 128 \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --limit-mm-per-prompt '{"image":10}' \
  --kv-cache-dtype fp8 \
  --quantization modelopt_fp4 \
  --host 0.0.0.0 \
  --port 8000
```

#### BF16 on 8xB200

Running in full BF16 precision gives the highest numeric stability and preserves model quality, but it is considerably more memory-hungry than FP8 or FP4 variants. On an 8√óB200 configuration, the model fits ‚Äî but just barely ‚Äî so the serving parameters need to be tightened to stay within the VRAM budget.

* Reduced max context length. The maximum sequence length is pulled down to avoid buffer overflow during sustained or concurrent inference.

* GPU memory utilization up to 0.95. Increasing utilization ensures the GPUs are driven close to their physical limit.

This mode is ideal if you want maximum output fidelity and training-like precision, and you're willing to trade off context length and system elasticity for it.

```bash
vllm serve "$MODEL" \
  --load-format mistral \
  --tokenizer-mode mistral \
  --config-format mistral \
  --max_model_len 32786 \
  --max_num_seqs 128 \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --limit-mm-per-prompt '{"image":10}' \
  --gpu-memory-utilization=0.95 \
  --host 0.0.0.0 \
  --port 8000
```

The first startup might take long time.

## Client setup

Once the server is running, connect using the OpenAI Python client. The endpoint exposes an OpenAI-compatible interface, so the standard OpenAI Python client will work without any additional adapters or client-side modifications. You simply point the client to your local server URL and provide the API key expected by vLLM (it can be any non-empty string unless you explicitly enforce authentication).

In [21]:
from openai import OpenAI

# Connect to vLLM server
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="dummy"  # vLLM doesn't require a real API key
)

MODEL_NAME = "mistral-ml3"

print(f"Connected to vLLM server at http://127.0.0.1:8000")
print(f"Using model: {MODEL_NAME}")

Connected to vLLM server at http://127.0.0.1:8000
Using model: mistralai/Mistral-Large-3-675B-Instruct-2512


## Testing some scenarios

According to its authors, Mistral Large 3 is perfect for:

* Long document understanding
* Daily-driver AI assistants
* Agentic and tool-use capabilities
* Enterprise knowledge work
* General coding assistant

Let's test some of its features!

### Instruction following

To guide the model toward a specific behavior or response style, you can supply a system prompt that defines rules, tone, formatting expectations, and constraints.

In [69]:
def load_system_prompt(filename: str) -> str:
    with open(filename, "r") as file:
        system_prompt = file.read()
    return system_prompt

TEMP = 0.15
MAX_TOK = 100000
SYSTEM_PROMPT = load_system_prompt("SYSTEM_PROMPT.txt")

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Invent a fun board game and explain the rules in under 120 words."}
    ],
    temperature=TEMP,
    max_tokens=MAX_TOK,
)
print(resp.choices[0].message.content)

Okay, let's think about this. I need to invent a fun board game and explain the rules in under 120 words. First, I should think of a theme. Maybe something adventurous, like exploring a jungle or a space mission. But to keep it simple, perhaps a treasure hunt theme would work well.

Next, I need to think about the mechanics. Maybe players move around the board collecting items or solving puzzles to find the treasure. But to make it unique, perhaps there's a twist, like having to avoid traps or compete with other players.

Let's go with a treasure hunt theme. Players start at the base camp and move around the board collecting treasure maps. Each map leads to a different part of the board where treasure is hidden. But there are traps and obstacles that can set players back.

Now, to write the rules concisely:

1. Players start at the base camp.
2. On their turn, players roll the die and move their piece.
3. Land on a treasure map space to collect a map.
4. Use maps to find treasure, but 

### Vision reasoning

Vision reasoning refers to the model‚Äôs ability to interpret visual inputs and apply logical inference on top of what it sees ‚Äî not just recognizing objects, but understanding relationships, spatial context, cause-and-effect, and intent within an image. Instead of simply labelling elements, the model can describe scenes, infer actions, identify patterns, and answer questions that require comprehension rather than pattern-matching alone. This enables more advanced use cases such as analyzing diagrams, extracting information from charts, interpreting UI layouts, or evaluating photos for consistency and meaning. In short, vision reasoning bridges visual perception and conceptual understanding, allowing the model to think about images rather than merely see them.

In [84]:
TEMP = 0.15
MAX_TOK = 100000
SYSTEM_PROMPT = load_system_prompt("SYSTEM_PROMPT.txt")
# Feel free to replace with any image of your choice!
IMAGE_URL = "https://blogs.nvidia.com/wp-content/uploads/2020/11/marbles-at-night.jpg"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe what can happen next in this scene. Be creative and think of an unusual scenario",
            },
            {"type": "image_url", 
             "image_url": {"url": IMAGE_URL}
            },
        ],
    },
]

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
)

print(resp.choices[0].message.content)

Okay, the image shows a cluttered workshop or studio filled with various tools, materials, and objects. It looks like an artist's or sculptor's workspace. There are busts, tools, jars with brushes, and lots of other items scattered around.

Now, I need to think of an unusual scenario that could happen next in this scene. Let's brainstorm some ideas:

1. **Unexpected Visitor**: Maybe a curious creature, like a small animal or even a mythical being, enters the workshop. Perhaps a raccoon starts rummaging through the materials, or a tiny dragon begins to play with the tools.

2. **Magical Transformation**: The objects in the workshop could start to come to life. The busts might start talking, the tools could begin moving on their own, and the materials could start forming into new shapes.

3. **Time Travel**: The artist might discover an old device or artifact in the workshop that allows them to travel through time. They could end up in a different era, bringing back unique items or ideas

### Function calling

Function calling allows the model to generate structured outputs that trigger real functions in your application, turning natural-language queries into executable actions. Instead of returning plain text, the model produces arguments in a predefined schema, enabling you to safely map intent to code paths ‚Äî such as querying a database, retrieving documents, sending notifications, or performing calculations. This turns the model into a reasoning layer that interprets user requests, decides when a tool should be invoked, and returns well-formed call signatures that programs can act on.

In [79]:
TEMP = 0.15
MAX_TOK = 100000
IMAGE_URL = "https://cdna.artstation.com/p/assets/images/images/050/827/584/large/rafael-chies-14.jpg?1655798602"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_population",
            "description": "Get the up-to-date population of a given country.",
            "parameters": {
                "type": "object",
                "properties": {
                    "country": {
                        "type": "string",
                        "description": "The country to find the population of.",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit for the population.",
                        "enum": ["millions", "thousands"],
                    },
                },
                "required": ["country", "unit"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Can you tell me which country is shown in this image?",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": IMAGE_URL,
                },
            },
        ],
    },
]

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
    tools=tools,
    tool_choice="auto",
)

assistant_message = resp.choices[0].message.content
print(assistant_message, "\n")

messages.extend([
    {"role": "assistant", "content": assistant_message},
    {"role": "user", "content": "What is the population of that country in millions?"},
])

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
    tools=tools,
    tool_choice="auto",
)

print(resp.choices[0].message.tool_calls)

The image depicts the interior of a modern, well-equipped commercial kitchen, which doesn't inherently indicate a specific country. However, there are some clues that suggest it might be in **Japan**:

1. **Signage and Text**: The writing on the walls appears to be in Japanese.
2. **Design and Layout**: The style of the kitchen and the organization is consistent with what you might find in Japanese restaurants or food establishments.

These elements suggest that the image is likely from Japan. 

[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a50c24f1bb90c3fe', function=Function(arguments='{"country": "Japan", "unit": "millions"}', name='get_current_population'), type='function')]


## Conclusion and resources

Congratulations! You successfully deployed the **Mistral Large 3 675B Instruct** model using vLLM.

In this notebook, you have learned how to:

- Set up your environment and install vLLM.
- Launch and manage an OpenAI-compatible server to run model.
- Perform instruction following, vision reasoning, and function calling tasks using the OpenAI client.

You can adapt tensor parallelism, ports, and sampling parameters to your hardware and application needs.

Refer to the following resources if you want to learn more

### Documentation
- üìö [Mistral Large 3 Model Card](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512)
- üèóÔ∏è [NVIDIA vLLM Guide](https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/index.html)

### Code and kernels
- üíæ [Flashinfer kernel library](https://github.com/flashinfer-ai/flashinfer)
- ‚ö°  [FlashMLA Implementation](https://github.com/deepseek-ai/FlashMLA)
- üß™ [Mistral Examples](https://github.com/mistralai)

### Community
- üìß [NVIDIA Developer Forums](https://forums.developer.nvidia.com/)

### Acknowledgments

Special thanks to the Mistral and vLLM teams for their incredible work on these technologies.