# Running the Qwen3-Next with SGLang on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run models from the Qwen3-Next series using SGLang's high-performance, OpenAI-compatible server. It is divided into two parts, each demonstrating how to set up and interact with a different model variant.



## Part 1: Qwen3-Next-Instruct with SGLang

This section covers the `Qwen/Qwen3-Next-80B-A3B-Instruct` model, demonstrating basic chat, streaming, and batch inference.

- Model card: [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)
- SGLang docs: [Qwen3 usage](https://docs.sglang.ai/basic_usage/qwen3.html)



## Prerequisites

**Hardware:** This notebook is configured by default to run on a machine with **4 GPUs** (`--tp 4`) and sufficient VRAM to hold the 80B parameter model. If your hardware is different, be sure to adjust the `--tp` (tensor parallelism) and other resource-related flags in the server launch command below.



## Install Dependencies


In [None]:
# Run once per environment
%pip install --upgrade pip
%pip install -U sglang[all] transformers accelerate huggingface_hub

Collecting transformers
  Using cached transformers-4.56.2-py3-none-any.whl.metadata (40 kB)


## Launch SGLang server

We will launch an OpenAI-compatible server. Adjust `--tp` (tensor parallelism), `--cache-capacity`, and `--context-length` to fit your hardware.

- For long context, enable YaRN by passing `--json-model-override-args`.
- For 80B, ensure sufficient VRAM and consider tensor parallel across multiple GPUs.



### Programmatic server launch (helper)

Use `launch_server_cmd` to start SGLang from Python and `wait_for_server` to block until ready. This example mirrors the terminal command above and sets a served model name for client requests.


In [None]:
import os
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, terminate_process

# Choose one model path. For this cookbook, default to the 80B instruct.
model_path = os.environ.get("QWEN_MODEL", "Qwen/Qwen3-Next-80B-A3B-Instruct")
# Set a custom name for the model endpoint, which the client will use.
served_name = os.environ.get("SERVED_NAME", "qwen3-next-instruct")
port = int(os.environ.get("SGLANG_PORT", "30000"))

server_cmd = f"""
python -m sglang.launch_server \
  --model-path "{model_path}" \
  --host 0.0.0.0 --port {port} \
  --tp 4 \
  --max-lora-rank 0 \
  --kv-cache-dtype fp8_e4m3 \
  --chunked-prefill-size 8192 \
  --context-length 262144 \
  --served-model-name {served_name} \
  --log-level warning
"""

server_process, detected_port = launch_server_cmd(server_cmd)
wait_for_server(f"http://localhost:{detected_port}")
print(f"SGLang server ready on port {detected_port} with served name '{served_name}'")


  import pynvml  # type: ignore[import]


  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0919 17:55:40.645000 59745 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 17:55:40.645000 59745 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
W0919 17:55:48.057000 59948 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 17:55:48.057000 59948 site-packages/torch/utils/cpp_ex

## Client setup

We'll use the OpenAI-compatible client to talk to SGLang. Set `OPENAI_BASE_URL` to your server address.


In [None]:
import os
from openai import OpenAI

base_url = f"http://localhost:{detected_port}/v1"
api_key = "EMPTY"  # SGLang server doesn't require an API key by default

client = OpenAI(base_url=base_url, api_key=api_key)
print(f"OpenAI client configured to use server at: {base_url}")


OpenAI client configured to use server at: http://localhost:35961/v1


## Basic chat completion

Use the chat template; the model supports instruct mode only.


**Note:** The `extra_body` parameter is an SGLang-specific extension that allows you to pass additional sampling parameters not available in the standard OpenAI `create` method.


In [None]:
resp = client.chat.completions.create(
    model=served_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Qwen3-Next in one sentence."}
    ],
    temperature=0.7,  # per model card best practices
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "top_k": 20,
        "min_p": 0.0,
    },
)
print(resp.choices[0].message.content)


Qwen3-Next is the latest iteration in Alibaba’s Qwen series of large language models, featuring enhanced reasoning, multilingual support, and improved efficiency for complex tasks and real-world applications.


## Streaming responses


In [None]:
from contextlib import closing

with closing(client.chat.completions.create(
    model=served_name,
    messages=[{"role": "user", "content": "Stream a short haiku about scaling laws."}],
    stream=True,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        "top_k": 20,
    },
)) as stream:
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
print()


Data grows, model grows—  
loss drops with power law grace,  
scaling’s quiet law.


## Batch inference

Send multiple requests concurrently via the `responses.create` API.


In [None]:
import asyncio
import time
from openai import AsyncOpenAI

# Use an async client for concurrent requests
async_client = AsyncOpenAI(base_url=base_url, api_key=api_key)

batch_prompts = [
    "Write a Python function to find the nth Fibonacci number with caching.",
    "Explain the concept of 'emergent abilities' in large language models in a single paragraph.",
    "Compose a short, rhyming poem about a cat discovering a quantum computer.",
]

async def send_request(prompt: str):
    """Sends a single chat completion request."""
    return await async_client.chat.completions.create(
        model=served_name,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        top_p=0.8,
        max_tokens=512,
        extra_body={"top_k": 20},
    )

async def main():
    """Runs all requests concurrently and prints results."""
    print("Sending batch of requests concurrently...")
    start_time = time.time()
    
    # Create and run all tasks in parallel
    tasks = [send_request(p) for p in batch_prompts]
    responses = await asyncio.gather(*tasks)
    
    end_time = time.time()
    print(f"Batch completed in {end_time - start_time:.2f} seconds.\n")

    # Print results
    for i, resp in enumerate(responses):
        print(f"--- Response for Prompt {i+1}: \"{batch_prompts[i]}\" ---")
        if resp.choices:
            print(resp.choices[0].message.content.strip())
        else:
            print("Empty response.")
        print("-" * (40 + len(batch_prompts[i])))
        print()

# Run the async main function
await main()

Sending batch of requests concurrently...
Batch completed in 5.69 seconds.

--- Response for Prompt 1: "Give me 3 bullet points on Qwen3-Next." ---
Actually, as of now, there is no official model named **Qwen3-Next** released by Alibaba’s Tongyi Lab. The latest publicly available model in the Qwen series is **Qwen3**, which was launched in May 2024.

If you’re referring to **Qwen3**, here are 3 key bullet points about it:

- **Enhanced Reasoning & Coding**: Qwen3 significantly improves logical reasoning, mathematical problem-solving, and code generation capabilities, outperforming its predecessors and competing models in benchmarks like GSM8K and HumanEval.

- **Multilingual Support**: It offers robust support for over 100 languages, making it highly effective for global applications and cross-lingual tasks, including low-resource languages.

- **Larger Context & Efficiency**: Qwen3 supports up to 32K tokens of context length and features optimized inference efficiency, enabling better

## Cleanup
If you launched the server from this notebook, run the following cell to terminate the process.

In [None]:
# Terminate the server process if it was started programmatically
server_stopped = False
if "server_process" in locals() and server_process.poll() is None:
    terminate_process(server_process)
    print(f"Terminated server process with PID: {server_process.pid}")
    server_stopped = True

if not server_stopped:
    print("No running server process found to terminate.")

Terminated server process with PID: 37934


---

# Part 2: Qwen3-Next-Thinking with SGLang

Now, we will shut down the first server and launch a new one with the `Qwen3-Next-80B-A3B-Thinking` model. This model is specialized for complex reasoning tasks.

- Model card: [Qwen/Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking)

We will use a different port and served name to avoid conflicts.



In [None]:
# Launch server for the Thinking model
model_path_thinking = "Qwen/Qwen3-Next-80B-A3B-Thinking"
served_name_thinking = "qwen3-next-thinking"
port_thinking = port + 1  # Use a different port

server_cmd_thinking = f"""
python -m sglang.launch_server \
  --model-path "{model_path_thinking}" \
  --host 0.0.0.0 --port {port_thinking} \
  --tp 4 \
  --max-lora-rank 0 \
  --kv-cache-dtype fp8_e4m3 \
  --chunked-prefill-size 8192 \
  --context-length 262144 \
  --served-model-name {served_name_thinking} \
  --log-level warning
"""

server_process_thinking, detected_port_thinking = launch_server_cmd(server_cmd_thinking)
wait_for_server(f"http://localhost:{detected_port_thinking}")
print(f"SGLang server for Thinking model ready on port {detected_port_thinking} with served name '{served_name_thinking}'")



In [None]:
# Client setup for Thinking model
base_url_thinking = f"http://localhost:{detected_port_thinking}/v1"
client_thinking = OpenAI(base_url=base_url_thinking, api_key=api_key)

# Recommended parameters for the Thinking model, per its model card.
# https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking#best-practices
sampling_params_thinking = {
    "temperature": 0.6,
    "top_p": 0.95,
    "extra_body": {"top_k": 20, "min_p": 0.0},
}

print(f"OpenAI client configured for Thinking model at: {base_url_thinking}")



## Basic chat (Thinking)


In [None]:
resp = client_thinking.chat.completions.create(
    model=served_name_thinking,
    messages=[
        {"role": "system", "content": "You are a helpful assistant specialized in complex reasoning."},
        {"role": "user", "content": "Explain the Hybrid Attention mechanism in the Qwen3-Next model in a few sentences."}
    ],
    max_tokens=512,
    **sampling_params_thinking,
)
print(resp.choices[0].message.content)


## Batch inference (Thinking)


In [None]:
# Use an async client for concurrent requests
async_client_thinking = AsyncOpenAI(base_url=base_url_thinking, api_key=api_key)

batch_prompts_thinking = [
    "Write a Python function to find the nth Fibonacci number with caching, and explain the time complexity.",
    "Describe the difference between Gated DeltaNet and standard attention.",
    "Compose a short, rhyming poem about a Mixture-of-Experts model.",
]

async def send_request_thinking(prompt: str):
    """Sends a single chat completion request to the Thinking server."""
    return await async_client_thinking.chat.completions.create(
        model=served_name_thinking,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=512,
        **sampling_params_thinking,
    )

async def main_thinking():
    """Runs all requests concurrently and prints results."""
    print("Sending batch of requests to Thinking model concurrently...")
    start_time = time.time()
    
    tasks = [send_request_thinking(p) for p in batch_prompts_thinking]
    responses = await asyncio.gather(*tasks)
    
    end_time = time.time()
    print(f"Batch completed in {end_time - start_time:.2f} seconds.\n")

    for i, resp in enumerate(responses):
        print(f"--- Response for Prompt {i+1}: \"{batch_prompts_thinking[i]}\" ---")
        if resp.choices:
            print(resp.choices[0].message.content.strip())
        else:
            print("Empty response.")
        print("-" * (40 + len(batch_prompts_thinking[i])))
        print()

# Run the async main function
await main_thinking()


## Cleanup (Thinking)


In [2]:
# Terminate the server process if it was started programmatically
server_stopped = False
if "server_process_thinking" in locals() and server_process_thinking.poll() is None:
    terminate_process(server_process_thinking)
    print(f"Terminated Thinking server process with PID: {server_process_thinking.pid}")
    server_stopped = True

if not server_stopped:
    print("No running Thinking server process found to terminate.")


No running Thinking server process found to terminate.


## Resource notes and quantization options

- Qwen3-Next-80B-A3B is large; multi-GPU tensor parallel (`--tp`) is recommended.
- Consider FP8 KV cache and chunked prefill to reduce memory.
- Quantized variants (e.g., AWQ/GPTQ/INT4/INT8) may be available from community fine-tunes; check model hub.
- For CPU/off-GPU hosting, consider smaller Qwen3-family models or quantized 80B with reduced context.
- Ensure network and disk bandwidth are sufficient for first-time model downloads.

