# Running Mistral Large 3 675B Instruct with SGLang on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run the  **Mistral Large 3 675B Instruct** model using SGLang. 

Mistral Large 3 is a state-of-the-art general-purpose multimodal granular Mixture-of-Experts model with 41B active parameters and 675B total parameters.

This model is the instruct post-trained version, fine-tuned for instruction tasks, making it ideal for chat, agentic and instruction based use cases. Designed for reliability and long-context comprehension - it is engineered for production-grade assistants, retrieval-augmented systems, scientific workloads, and complex enterprise workflows.

## Launch on NVIDIA Brev

You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-36ITIC3pJeCMnsea4ihqV0uyU8K)

## Table of contents

- [Prerequisites](#Prerequisites)
  - [Verifying your system](#Verifying-your-system)
  - [Install SGLang and dependencies](#Install-SGLang-and-dependencies)
- [Launch SGLang server](#Launch-SGLang-server)
- [Client setup](#Client-setup)
- [Testing some scenarios](#Testing-some-scenarios)
  - [Instruction following](#Instruction-following)
  - [Vision reasoning](#Vision-reasoning)
  - [Function calling](#Function-calling)
- [Cleaning up](#Cleaning-up)
- [Conclusion and resources](#Conclusion-and-resources)

## Prerequisites

Mistral Large 3 is deployable on-premises at [FP8](https://huggingface.co/mistralai/Mistral-Large-3-675B-FP8-Instruct-2512) on a single node of B200 or H200 GPUs, with H200 having a reduced context window.

This notebook is configured by default to run on a machine with 8 GPUs and sufficient VRAM to hold the 675B parameter model. If your hardware is different, be sure to adjust the `--tensor-parallel-size` (tensor parallelism) and other resource-related flags in the server launch command.

### Verifying your system

Let's verify your system is ready for **Mistral Large 3 675B Instruct**.

In [1]:
#!/usr/bin/env python3
import subprocess
import platform
import shutil

print("="*70)
print("="*70)
print(f"OS: {platform.system()} {platform.release()}")
print(f"Python: {platform.python_version()}")

# Check if nvidia-smi exists
if shutil.which("nvidia-smi") is None:
    print("‚ùå nvidia-smi not found ‚Äî NVIDIA drivers are missing or not in PATH.")
    print("   Unable to detect GPU hardware.")
    exit(1)

print("="*70)
print("GPU DETAILS (nvidia-smi)")
print("="*70)

try:
    # Query GPU name + total memory
    query_cmd = [
        "nvidia-smi",
        "--query-gpu=name,memory.total",
        "--format=csv,noheader,nounits"
    ]

    output = subprocess.check_output(query_cmd, text=True)
    lines = [line.strip() for line in output.splitlines() if line.strip()]
    total_memory_gb = 0.0

    print(f"Number of GPUs detected: {len(lines)}")

    for i, line in enumerate(lines):
        name, mem_mib = [x.strip() for x in line.split(",")]
        mem_gb = float(mem_mib) / 1024
        total_memory_gb += mem_gb

        print(f"\nGPU[{i}]:")
        print(f"  Name: {name}")
        print(f"  Total Memory: {mem_gb:.2f} GB")

        if "H200" in name:
            print("  Status: ‚úÖ Hopper architecture - Supported")
        elif "B200" in name or "GB200" in name:
            print("  Status: ‚úÖ Blackwell architecture - Optimal")
        else:
            print("  Status: ‚ö†Ô∏è Unknown/older architecture ‚Äî May have limitations")

    print(f"\nTotal GPU Memory (All GPUs): {total_memory_gb:.2f} GB")

except Exception as e:
    print("‚ùå Failed to parse GPU info from nvidia-smi")
    print(e)
    exit(1)

print("\n" + "="*70)
print("NVLINK STATUS")
print("="*70)

try:
    nvlink = subprocess.check_output(["nvidia-smi", "nvlink", "--status"],
                                     text=True, stderr=subprocess.STDOUT)
    print("‚úÖ NVLink detected & queryable\n")
    print(nvlink.strip())
except:
    print("‚ö†Ô∏è NVLink not detected or unavailable")

print("\n" + "="*70)
print("CONFIGURATION RECOMMENDATIONS")
print("="*70)

if total_memory_gb >= 1100:
    print("‚úÖ Enough VRAM for large models ‚Äî recommended EP/DP execution")
elif total_memory_gb >= 900:
    print("‚ö†Ô∏è Borderline for largest models ‚Äî FP8 or TP recommended")
elif total_memory_gb > 0:
    print("‚ùå VRAM too low for full-size models ‚Äî use smaller/quantized checkpoints")
else:
    print("‚ùå No GPUs detected ‚Äî GPU is required")


OS: Linux 6.8.0-60-generic
Python: 3.11.14
GPU DETAILS (nvidia-smi)
Number of GPUs detected: 8

GPU[0]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[1]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[2]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[3]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[4]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[5]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[6]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

GPU[7]:
  Name: NVIDIA B200
  Total Memory: 179.06 GB
  Status: ‚úÖ Blackwell architecture - Optimal

Total GPU Memory (All GPUs): 1432.49 GB

NVLINK STATUS
‚úÖ NVLink detected & queryable



### Install SGLang and dependencies

Install the latest SGLang release (0.3 or newer) so you get the TensorRT-LLM MLA and FlashInfer kernels that power the Mistral Large 3 MoE stack on B200/H200 systems. The commands below use `uv` to make sure compatible wheels are resolved, but you can substitute plain `pip install` if your environment already has the right CUDA toolchain.

Method 1. Via `pip install`

This is still WIP as the update is coming soon. Method 2 is recommended.

In [None]:
# %pip install --upgrade pip
# %pip install uv
# %uv pip install "sglang" --prerelease=allow --quiet
# %pip install transformers accelerate huggingface_hub --quiet

Method 2. From source (Recommended)

Clone the official repository (`https://github.com/sgl-project/sglang.git`) and check out the latest release branch before installing the Python package in editable mode so you pick up the CUDA/TensorRT plugins that ship with SGLang.

In [None]:
# Use the nvfp4_support branch
!git clone -b dcampora/nvfp4_support https://github.com/dcampora/sglang.git && cd sglang

# Install the python packages
%pip install --upgrade pip
%pip install accelerate
%pip install -e "python"
%pip install transformers accelerate huggingface_hub --quiet

UsageError: Line magic function `%git` not found.


## Launch SGLang server

We will launch an OpenAI-compatible server. It can be executed directly in this notebook or you can copy the `python3 -m sglang.launch_server` command with the parameters and execute it in a separate terminal window. Make sure to specify the parameters and adjust the values based on your setup.

The startup will take long time as the checkpoints need to be loaded.

Set the `MODEL_NAME` environment variables (or edit the defaults below) to one of the official checkpoints published by Mistral so the server downloads the correct weights:

- `mistralai/Mistral-Large-3-675B-Instruct-2512` (FP8 baseline for B200/H200)
- `mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4` (NVFP4 for H100/A100)
- `mistralai/Mistral-Large-3-675B-Instruct-2512-BF16` (full BF16 precision)
- `mistralai/Mistral-Large-3-675B-Instruct-2512-Eagle` (draft model for speculative decoding)

You can also point `ML3_MODEL` to a local path if you have already mirrored the model repository.

In [2]:
import os
from sglang.utils import launch_server_cmd, wait_for_server, terminate_process

model_path = os.environ.get("ML3_MODEL", "mistralai/Mistral-Large-3-675B-Instruct-2512")
port = int(os.environ.get("SGLANG_PORT", "30000"))
MODEL_NAME = os.environ.get("MODEL_NAME", "mistralai/Mistral-Large-3-675B-Instruct-2512")
os.environ["SGLANG_ENABLE_JIT_DEEPGEMM"] = "0"
server_cmd = f"""
python3 -m sglang.launch_server \
    --model {model_path} \
    --host 0.0.0.0 --port {port} \
    --tensor-parallel-size 8 \
    --disable-radix-cache \
    --stream-interval 20 \
    --mem-fraction-static 0.95 \
    --max-running-requests 1024 \
    --cuda-graph-max-bs 16 \
    --served-model-name {MODEL_NAME} \
    --log-level warning \
    --chat-template mistral
"""

server_process, detected_port = launch_server_cmd(server_cmd)
wait_for_server(f"http://localhost:{detected_port}")
print(f"SGLang server ready on port {detected_port} with served name '{MODEL_NAME}'")

[2025-12-09 20:10:54] INFO _client.py:1025: HTTP Request: GET https://huggingface.co/api/models/mistralai/Mistral-Large-3-675B-Instruct-2512/revision/main "HTTP/1.1 200 OK"
Fetching 7 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:00<00:00, 18384.55it/s]
[2025-12-09 20:10:54] INFO server_args.py:1047: Use trtllm_mla as attention backend on sm100 for DeepseekV3ForCausalLM
[2025-12-09 20:10:54] INFO server_args.py:1056: Enable FlashInfer AllReduce Fusion on sm100 for DeepseekV3ForCausalLM
[2025-12-09 20:10:54] INFO _client.py:1025: HTTP Request: HEAD https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512/resolve/main/generation_config.json "HTTP/1.1 404 Not Found"
[2025-12-09 20:10:54] INFO model_config.py:907: Downcasting torch.float32 to torch.bfloat16.
Fetching 7 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:00<00:00, 16951.58it/s]
Fetching 7 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:00<00:00, 109963.03it/s]
Fetching 7 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

## Client setup

Once the server is running, connect using the OpenAI Python client. The endpoint exposes an OpenAI-compatible interface, so the standard OpenAI Python client will work without any additional adapters or client-side modifications. You simply point the client to your local server URL and provide the API key expected by vLLM (it can be any non-empty string unless you explicitly enforce authentication).

In [None]:
from openai import OpenAI

base_url = f"http://localhost:{detected_port}/v1"
api_key = "dummy"  # SGLang server doesn't require an API key by default

# Connect to SGLang server
client = OpenAI(base_url=base_url, api_key=api_key)

print(f"Connected to SGLang server at {base_url}")
# print(f"Using model: {MODEL_NAME}")

Connected to SGLang server at http://localhost:34916/v1


## Testing some scenarios

According to its authors, Mistral Large 3 is perfect for:

* Long document understanding
* Daily-driver AI assistants
* Agentic and tool-use capabilities
* Enterprise knowledge work
* General coding assistant

Let's test some of its features!

### Instruction following

To guide the model toward a specific behavior or response style, you can supply a system prompt that defines rules, tone, formatting expectations, and constraints.

In [None]:
from huggingface_hub import hf_hub_download

def load_system_prompt(model_name: str = "mistralai/Mistral-Large-3-675B-Instruct-2512") -> str:
    """
    Download and load the system prompt from Hugging Face.
    The file is automatically cached after first download.
    """
    prompt_path = hf_hub_download(repo_id=model_name, filename="SYSTEM_PROMPT.txt")
    with open(prompt_path, "r") as file:
        return file.read()

TEMP = 0.15
MAX_TOK = 100000

# Load system prompt from Hugging Face
SYSTEM_PROMPT = load_system_prompt()

resp = client.chat.completions.create(
    model= "mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Invent a fun board game and explain the rules in under 120 words."}
    ],
    temperature=TEMP,
    max_tokens=MAX_TOK,
)
print(resp.choices[0].message.content)

**Game Name: "Time Warp Trek"**

**Objective:** Be the first to travel through 3 different historical eras, collect unique artifacts, and return to the present with the most "Chrono Points"!

**Setup:** 2-4 players, a spiral board with 4 paths (one per era: Prehistoric, Medieval, Industrial, Future), each path has 10 spaces. Players start in the present (center).

**Gameplay:**
- Roll a 6-sided die to move forward in an era.
- Land on artifact spaces to collect them (e.g., dinosaur bone, knight‚Äôs sword, steam engine, hologram).
- Land on "Time Rift" spaces to jump to another era (but lose 1 artifact).
- Complete an era by reaching its end, earning bonus points.
- Return to the present by rolling a 6 or using a "Time Portal" card.

**Winning:** After returning, calculate points: 1 per artifact, 5 per completed era, 2 for rare artifacts. Highest score wins! ‚è≥üöÄ


### Vision reasoning

Vision reasoning refers to the model‚Äôs ability to interpret visual inputs and apply logical inference on top of what it sees ‚Äî not just recognizing objects, but understanding relationships, spatial context, cause-and-effect, and intent within an image. Instead of simply labelling elements, the model can describe scenes, infer actions, identify patterns, and answer questions that require comprehension rather than pattern-matching alone. This enables more advanced use cases such as analyzing diagrams, extracting information from charts, interpreting UI layouts, or evaluating photos for consistency and meaning. In short, vision reasoning bridges visual perception and conceptual understanding, allowing the model to think about images rather than merely see them.

In [None]:
TEMP = 0.15
MAX_TOK = 100000

# Load system prompt from Hugging Face
SYSTEM_PROMPT = load_system_prompt()

# Feel free to replace with any image of your choice!
IMAGE_URL = "https://blogs.nvidia.com/wp-content/uploads/2020/11/marbles-at-night.jpg"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe what can happen next in this scene. Be creative and think of an unusual scenario",
            },
            {"type": "image_url", 
             "image_url": {"url": IMAGE_URL}
            },
        ],
    },
]

resp = client.chat.completions.create(
    model= "mistralai/Mistral-Large-3-675B-Instruct-2512",
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
)

print(resp.choices[0].message.content)

This scene appears to be a cluttered, old-fashioned atelier or workshop‚Äîperhaps a sculptor‚Äôs studio, given the presence of busts, tools, and raw materials. Here‚Äôs an unusual and creative scenario for what could happen next:

---

### **The Unfinished Busts Awaken**
As the last flicker of the warm, dim lighting settles into the room, the studio‚Äôs eerie silence is broken by a faint *crack*. One of the half-finished clay busts on the right side of the image‚Äîthe one with a blank face‚Äîsuddenly shifts. Its hollow eyes fill with a strange, glowing mist, and the clay begins to ripple as if something beneath it is struggling to surface.

Then, another bust joins in. And another. The unfinished sculptures, long dormant, start to *breathe*. Their chests expand slightly, their fingers twitch where they‚Äôve been sculpted, and the tools scattered across the workbenches vibrate in response. The studio‚Äôs owner, who had stepped out for a late-night coffee, returns to find their workspace

### Function calling

Function calling allows the model to generate structured outputs that trigger real functions in your application, turning natural-language queries into executable actions. Instead of returning plain text, the model produces arguments in a predefined schema, enabling you to safely map intent to code paths ‚Äî such as querying a database, retrieving documents, sending notifications, or performing calculations. This turns the model into a reasoning layer that interprets user requests, decides when a tool should be invoked, and returns well-formed call signatures that programs can act on.

In [8]:
TEMP = 0.15
MAX_TOK = 100000
IMAGE_URL = "https://cdna.artstation.com/p/assets/images/images/050/827/584/large/rafael-chies-14.jpg?1655798602"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_population",
            "description": "Get the up-to-date population of a given country.",
            "parameters": {
                "type": "object",
                "properties": {
                    "country": {
                        "type": "string",
                        "description": "The country to find the population of.",
                    },
                    "unit": {
                        "type": "string",
                        "description": "The unit for the population.",
                        "enum": ["millions", "thousands"],
                    },
                },
                "required": ["country", "unit"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Can you tell me which country is shown in this image?",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": IMAGE_URL,
                },
            },
        ],
    },
]

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
    tools=tools,
    tool_choice="auto",
)

assistant_message = resp.choices[0].message.content
print(assistant_message, "\n")

messages.extend([
    {"role": "assistant", "content": assistant_message},
    {"role": "user", "content": "What is the population of that country in millions?"},
])

resp = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    temperature=TEMP,
    max_tokens=MAX_TOK,
    tools=tools,
    tool_choice="auto",
)

print(resp.choices[0].message.tool_calls)

[2025-11-28 17:42:44] INFO _client.py:1025: HTTP Request: POST http://localhost:39763/v1/chat/completions "HTTP/1.1 200 OK"


The image appears to depict the interior of a modern, industrial-style kitchen, likely in a restaurant. The signage on the wall in the background includes Japanese characters („Äå„ÅÑ„Çâ„Å£„Åó„ÇÉ„ÅÑ„Åæ„Åõ„Äç, which translates to "Welcome" in English). This suggests that the country shown in the image is Japan. Additionally, the overall design and layout of the kitchen are consistent with those commonly found in Japanese eateries. 



[2025-11-28 17:42:45] INFO _client.py:1025: HTTP Request: POST http://localhost:39763/v1/chat/completions "HTTP/1.1 200 OK"


None


## Cleaning up

If you launched the server from this notebook, run the following cell to terminate the process.

In [2]:
if 'server_process' in globals() and server_process.poll() is None:
    server_process.kill()
    print(f"Killed instruct server PID {server_process.pid}")
else:
    print("No running server process found to terminate.")

No running server process found to terminate.


## Conclusion and resources

Congratulations! You successfully deployed the **Mistral Large 3 675B Instruct** model using SGLang.

In this notebook, you have learned how to:

- Set up your environment and install SGLang.
- Launch and manage an OpenAI-compatible server to run model.
- Perform instruction following, vision reasoning, and function calling tasks using the OpenAI client.

You can adapt tensor parallelism, ports, and sampling parameters to your hardware and application needs.

Refer to the following resources if you want to learn more

### Documentation
- üìö [Mistral Large 3 Model Card](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512)
- üèóÔ∏è [NVIDIA SGLang Guide](https://docs.nvidia.com/deeplearning/frameworks/sglang-release-notes/overview.html)

### Code and kernels
- üíæ [Flashinfer kernel library](https://github.com/flashinfer-ai/flashinfer)
- ‚ö°  [FlashMLA Implementation](https://github.com/deepseek-ai/FlashMLA)
- üß™ [Mistral Examples](https://github.com/mistralai)

### Community
- üìß [NVIDIA Developer Forums](https://forums.developer.nvidia.com/)

### Acknowledgments

**Authors:** [Katja Sirazitdinova](https://github.com/katjasrz), [Jay Rodge](https://github.com/jayrodge), [Mitesh Patel](https://github.com/patelmiteshn), Developer Advocates @ NVIDIA

Special thanks to the Mistral and SGLang teams for their incredible work on these technologies.