# Running the Qwen3-Next with SGLang on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run models from the Qwen3-Next series using SGLang's high-performance, OpenAI-compatible server. It is divided into two parts, each demonstrating how to set up and interact with a different model variant.


#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F)


## Part 1: Qwen3-Next-Instruct with SGLang

This section covers the `Qwen/Qwen3-Next-80B-A3B-Instruct` model, demonstrating basic chat, streaming, and batch inference.

- Model card: [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)
- SGLang docs: [Qwen3 usage](https://docs.sglang.ai/basic_usage/qwen3.html)



## Table of Contents
- [Part 1: Qwen3-Next-Instruct with SGLang](#Part-1:-Qwen3-Next-Instruct-with-SGLang)
  - [Install Dependencies](#Install-Dependencies)
  - [Launch SGLang Server](#Launch-SGLang-Server)
  - [Client Setup](#Client-Setup)
  - [Basic Chat Completion](#Basic-Chat-Completion)
  - [Streaming Responses](#Streaming-Responses)
  - [Batch Inference](#Batch-Inference)
  - [Cleanup](#Cleanup)
- [Part 2: Qwen3-Next-Thinking with SGLang](#Part-2:-Qwen3-Next-Thinking-with-SGLang)
  - [Launch Thinking Server](#Launch-Thinking-Server)
  - [Client Setup (Thinking)](#Client-Setup-(Thinking))
  - [Basic Chat (Thinking)](#Basic-Chat-(Thinking))
  - [Batch Inference (Thinking)](#Batch-Inference-(Thinking))
  - [Cleanup (Thinking)](#Cleanup-(Thinking))
- [Resource Notes](#Resource-Notes)


## Prerequisites

**Hardware:** This notebook is configured by default to run on a machine with **4 GPUs** (`--tp 4`) and sufficient VRAM to hold the 80B parameter model. If your hardware is different, be sure to adjust the `--tp` (tensor parallelism) and other resource-related flags in the server launch command below.



## Install Dependencies


In [1]:
%pip install -U pip
%pip install -U sglang[all] transformers accelerate huggingface_hub --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Launch SGLang server

We will launch an OpenAI-compatible server. Adjust `--tp` (tensor parallelism), `--cache-capacity`, and `--context-length` to fit your hardware.

In [1]:
import os
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, terminate_process

# Choose one model path. For this cookbook, default to the 80B instruct.
model_path = os.environ.get("QWEN_MODEL", "Qwen/Qwen3-Next-80B-A3B-Instruct")
# Set a custom name for the model endpoint, which the client will use.
served_name = os.environ.get("SERVED_NAME", "qwen3-next-instruct")
port = int(os.environ.get("SGLANG_PORT", "30000"))

server_cmd = f"""
python -m sglang.launch_server \
  --model-path {model_path} \
  --host 0.0.0.0 --port {port} \
  --tp 4 \
  --max-lora-rank 0 \
  --kv-cache-dtype fp8_e4m3 \
  --chunked-prefill-size 8192 \
  --context-length 262144 \
  --served-model-name {served_name} \
  --log-level warning
"""

server_process, detected_port = launch_server_cmd(server_cmd)
wait_for_server(f"http://localhost:{detected_port}")
print(f"SGLang server ready on port {detected_port} with served name '{served_name}'")


  import pynvml  # type: ignore[import]


  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0919 23:44:52.736000 90116 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 23:44:52.736000 90116 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0919 23:44:59.115000 90357 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 23:44:59.115000 90357 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, pleas

## Client setup

We'll use the OpenAI-compatible client to talk to SGLang. Set `OPENAI_BASE_URL` to your server address.


In [6]:
import os
from openai import OpenAI

base_url = f"http://localhost:{detected_port}/v1"
api_key = "EMPTY"  # SGLang server doesn't require an API key by default

client = OpenAI(base_url=base_url, api_key=api_key)
print(f"OpenAI client configured to use server at: {base_url}")


OpenAI client configured to use server at: http://localhost:31745/v1


## Basic chat completion

Use the chat template; the model supports instruct mode only.


**Note:** The `extra_body` parameter is an SGLang-specific extension that allows you to pass additional sampling parameters not available in the standard OpenAI `create` method.


In [7]:
resp = client.chat.completions.create(
    model=served_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Qwen3-Next in one sentence."}
    ],
    temperature=0.7,  # per model card best practices
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "top_k": 20,
        "min_p": 0.0,
    },
)
print(resp.choices[0].message.content)


Qwen3-Next is the latest iteration in Alibaba’s Qwen series, designed with enhanced reasoning, multilingual support, and improved efficiency for advanced AI applications.


## Streaming responses


In [8]:
from contextlib import closing

with closing(client.chat.completions.create(
    model=served_name,
    messages=[{"role": "user", "content": "Stream a short haiku about scaling laws."}],
    stream=True,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        "top_k": 20,
    },
)) as stream:
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
print()


Data grows, model grows—  
loss falls with power law grace,  
scale brings emergent wit.


## Batch inference

Send multiple requests concurrently via the `responses.create` API.


In [9]:
import asyncio
import time
from openai import AsyncOpenAI

# Use an async client for concurrent requests
async_client = AsyncOpenAI(base_url=base_url, api_key=api_key)

batch_prompts = [
    "Write a Python function to find the nth Fibonacci number with caching.",
    "Explain the concept of 'emergent abilities' in large language models in a single paragraph.",
    "Compose a short, rhyming poem about a cat discovering a quantum computer.",
]

async def send_request(prompt: str):
    """Sends a single chat completion request."""
    return await async_client.chat.completions.create(
        model=served_name,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        top_p=0.8,
        max_tokens=512,
        extra_body={"top_k": 20},
    )

async def main():
    """Runs all requests concurrently and prints results."""
    print("Sending batch of requests concurrently...")
    start_time = time.time()
    
    # Create and run all tasks in parallel
    tasks = [send_request(p) for p in batch_prompts]
    responses = await asyncio.gather(*tasks)
    
    end_time = time.time()
    print(f"Batch completed in {end_time - start_time:.2f} seconds.\n")

    # Print results
    for i, resp in enumerate(responses):
        print(f"--- Response for Prompt {i+1}: \"{batch_prompts[i]}\" ---")
        if resp.choices:
            print(resp.choices[0].message.content.strip())
        else:
            print("Empty response.")
        print("-" * (40 + len(batch_prompts[i])))
        print()

# Run the async main function
await main()

Sending batch of requests concurrently...
Batch completed in 6.09 seconds.

--- Response for Prompt 1: "Write a Python function to find the nth Fibonacci number with caching." ---
Here's a Python function to find the nth Fibonacci number using caching (memoization):

```python
def fibonacci(n, cache=None):
    """
    Find the nth Fibonacci number using caching (memoization).
    
    Args:
        n (int): The position in the Fibonacci sequence (non-negative integer)
        cache (dict): Optional cache dictionary to store computed values
        
    Returns:
        int: The nth Fibonacci number
        
    Raises:
        ValueError: If n is negative
    """
    if n < 0:
        raise ValueError("n must be a non-negative integer")
    
    # Initialize cache if not provided
    if cache is None:
        cache = {}
    
    # Base cases
    if n == 0:
        return 0
    if n == 1:
        return 1
    
    # Check if result is already in cache
    if n in cache:
        return c

## Cleanup
If you launched the server from this notebook, run the following cell to terminate the process.

In [None]:
# Shutdown of the instruct server
if 'server_process' in globals() and server_process.poll() is None:
    server_process.kill()
    print(f"Killed instruct server PID {server_process.pid}")
else:
    print("No running instruct server process found to terminate.")

Killed instruct server PID 90116




---

## Part 2: Qwen3-Next-Thinking with SGLang

Now, we will shut down the first server and launch a new one with the `Qwen3-Next-80B-A3B-Thinking` model. This model is specialized for complex reasoning tasks.

- Model card: [Qwen/Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking)

We will use a different port and served name to avoid conflicts.

In [12]:
model_path_thinking = "Qwen/Qwen3-Next-80B-A3B-Thinking"
served_name_thinking = "qwen3-next-thinking"
port_thinking = port + 1  

server_cmd_thinking = f"""
python -m sglang.launch_server \
  --model-path {model_path_thinking} \
  --host 0.0.0.0 --port {port_thinking} \
  --tp 4 \
  --max-lora-rank 0 \
  --kv-cache-dtype fp8_e4m3 \
  --chunked-prefill-size 8192 \
  --context-length 262144 \
  --served-model-name {served_name_thinking} \
  --log-level warning
"""

server_process_thinking, detected_port_thinking = launch_server_cmd(server_cmd_thinking)
wait_for_server(f"http://localhost:{detected_port_thinking}")
print(f"SGLang server for Thinking model ready on port {detected_port_thinking} with served name '{served_name_thinking}'")


  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0919 23:48:52.678000 97734 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 23:48:52.678000 97734 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0919 23:49:01.075000 97972 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 23:49:01.075000 97972 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, pleas

In [13]:
# Client setup for Thinking model
base_url_thinking = f"http://localhost:{detected_port_thinking}/v1"
client_thinking = OpenAI(base_url=base_url_thinking, api_key=api_key)

# Recommended parameters for the Thinking model, per its model card.
# https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking#best-practices
sampling_params_thinking = {
    "temperature": 0.6,
    "top_p": 0.95,
    "extra_body": {"top_k": 20, "min_p": 0.0},
}

print(f"OpenAI client configured for Thinking model at: {base_url_thinking}")

OpenAI client configured for Thinking model at: http://localhost:32502/v1


## Basic chat (Thinking)


In [14]:
resp = client_thinking.chat.completions.create(
    model=served_name_thinking,
    messages=[
        {"role": "system", "content": "You are a helpful assistant specialized in complex reasoning."},
        {"role": "user", "content": "Explain the Hybrid Attention mechanism in the Qwen3-Next model in a few sentences."}
    ],
    max_tokens=512,
    **sampling_params_thinking,
)
print(resp.choices[0].message.content)


Okay, the user is asking about the Hybrid Attention mechanism in Qwen3-Next. Hmm, I need to recall what I know about this. Wait, Qwen3-Next isn't a real model yet. The current version is Qwen3, but there's no official "Qwen3-Next" announced. Maybe the user is confused or there's a misunderstanding.

I should check if there's any information about a model called Qwen3-Next. Let me think... No, as far as I know, Qwen3 is the latest version, and there's no "Next" version released. The user might be referring to a hypothetical model or maybe a typo. Alternatively, maybe they heard about some new feature in development but it's not public yet.

But since I don't have information on Qwen3-Next, I need to clarify that. The correct response would be to point out that there's no such model as Qwen3-Next. The latest is Qwen3. Maybe the user meant Qwen3? But the question specifically says "Qwen3-Next," which doesn't exist.

Wait, perhaps the user is referring to a different model? Or maybe it's a

## Batch inference (Thinking)


In [15]:
# Use an async client for concurrent requests
async_client_thinking = AsyncOpenAI(base_url=base_url_thinking, api_key=api_key)

batch_prompts_thinking = [
    "Write a Python function to find the nth Fibonacci number with caching, and explain the time complexity.",
    "Describe the difference between Gated DeltaNet and standard attention.",
    "Compose a short, rhyming poem about a Mixture-of-Experts model.",
]

print("Sending batch of requests to Thinking model concurrently...")
start_time = time.time()

tasks = [
    async_client_thinking.chat.completions.create(
        model=served_name_thinking,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": p}
        ],
        max_tokens=512,
        **sampling_params_thinking,
    ) for p in batch_prompts_thinking
]
responses = await asyncio.gather(*tasks)

end_time = time.time()
print(f"Batch completed in {end_time - start_time:.2f} seconds.\n")

for i, resp in enumerate(responses):
    print(f"--- Response for Prompt {i+1}: \"{batch_prompts_thinking[i]}\" ---")
    if resp.choices:
        print(resp.choices[0].message.content.strip())
    else:
        print("Empty response.")
    print("-" * (40 + len(batch_prompts_thinking[i])))
    print()

Sending batch of requests to Thinking model concurrently...
Batch completed in 5.41 seconds.

--- Response for Prompt 1: "Write a Python function to find the nth Fibonacci number with caching, and explain the time complexity." ---
Okay, the user wants a Python function to find the nth Fibonacci number using caching, along with an explanation of the time complexity. Let me think about how to approach this.

First, I recall that the Fibonacci sequence is defined as F(0) = 0, F(1) = 1, and F(n) = F(n-1) + F(n-2) for n > 1. The naive recursive approach has exponential time complexity because it recalculates the same values multiple times. Caching, or memoization, can optimize this by storing previously computed results.

So, the plan is to create a function that uses a cache (like a dictionary) to store Fibonacci numbers as they're computed. When the function is called with a specific n, it checks if the result is already in the cache. If yes, it returns it; if not, it computes it recursiv

## Cleanup (Thinking)


In [None]:
# Shutdown of the thinking server
if 'server_process_thinking' in globals() and server_process_thinking.poll() is None:
    server_process_thinking.kill()
    print(f"Killed thinking server PID {server_process_thinking.pid}")
else:
    print("No running thinking server process found to terminate.")

Killed thinking server PID 97734




## Resource Notes

- **Hardware**: Qwen3-Next-80B-A3B is a large model. Multi-GPU tensor parallel (`--tp`) is highly recommended for acceptable performance.
- **Quantization**: For environments with limited resources, consider using quantized versions of the model (e.g., AWQ, GPTQ, INT4/INT8) if available. These can significantly reduce memory usage at the cost of some accuracy. SGLang supports various quantization formats.
- **Offloading**: For development or low-throughput scenarios, you can explore smaller models from the Qwen3 family or use model offloading to run the 80B parameter model on systems with less VRAM.
- **Network**: Ensure you have sufficient network and disk bandwidth for the initial model download, as the weights are very large.

## Conclusion and Next Steps
Congratulations! You successfully deployed the `Qwen3-Next` models using SGLang.

In this notebook, you have learned how to:
- Set up your environment and install the SGLang library.
- Launch and manage an OpenAI-compatible SGLang server for the Instruct model.
- Perform basic chat, streaming, and batch inference using the OpenAI client.
- Launch a second SGLang server for the Thinking model and run inference.

You can adapt tensor parallelism, ports, and sampling parameters to your hardware and application needs.
