# Running Qwen3-next models with vLLM

This notebook provides a step-by-step guide on how to download and run the `Qwen3-Next` models using vLLM on NVIDIA GPUs for high-performance inference. vLLM is an open-source library that makes Large Language Model (LLM) inference and serving faster and more efficient by using advanced memory management and continuous batching. It significantly increases model throughput, reduces GPU memory usage, and lowers infrastructure costs, making it a key tool for deploying LLMs at scale.

`Qwen3-Next` is a brand-new model architecture that introduces several key improvements over its predecessor: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference. It is an 80-billion-parameter model that activates only ~3 billion parameters during inference. Refer to the [model card](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) for more details. The `Qwen3-Next` series has two variants:

- `Qwen3-Next-80B-A3B-Instruct`
- `Qwen3-Next-80B-A3B-Thinking`

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F)

## Table of Contents
- [Prerequisites](#Prerequisites)
- [Installing vLLM](#Installing-vLLM)
- [Part 1: Qwen3-Next-Instruct with vLLM](#Part-1:-Qwen3-Next-Instruct-with-vLLM)
  - [Launch Instruct Model Server](#Launch-Instruct-Model-Server)
  - [Inference using vLLM Server](#Inference-using-vLLM-Server)
  - [Inference using vLLM Python Client](#Inference-using-vLLM-Python-Client)
- [Part 2: Qwen3-Next-Thinking with vLLM](#Part-2:-Qwen3-Next-Thinking-with-vLLM)
  - [Launch Thinking Model Server](#Launch-Thinking-Model-Server)
  - [Inference against Thinking Server](#Inference-against-Thinking-Server)
  - [Batch Inference against Thinking Server](#Batch-Inference-against-Thinking-Server)
- [Conclusion and Next Steps](#Conclusion-and-Next-Steps)
- [Resource Notes](#Resource-Notes)


## Prerequisites

### Hardware
To run the `Qwen3-Next-80B-A3B` models (both `Instruct` and `Thinking`), you will need 4x A100 or 4x H100 NVIDIA GPUs.

### Software
- CUDA Toolkit 12.1 or later
- Python 3.10 or later
- vLLM (latest nightly build)

## Installing vLLM

To run `Qwen3-Next` models you will need to install the nightly build of vLLM. 

### Verify Python Environment
This notebook requires a Python 3.10+ environment. The following cell prints your kernel's Python version and executable path to confirm you are in the correct environment.

In [4]:
import sys
print(sys.version)
print(sys.executable)

3.12.11 (main, Sep 18 2025, 19:47:19) [Clang 20.1.4 ]
/home/shadeform/.venv/bin/python3


In [None]:
#if you're running this on a brev instance, you may need to install pip
%python -m ensurepip --upgrade

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.10.2rc3.dev279+gddc904839 requires setuptools<80,>=77.0.3; python_version > "3.11", but you have setuptools 80.9.0 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [None]:
%pip install -U pip wheel setuptools uv --quiet
%pip install vllm openai aiohttp --extra-index-url https://wheels.vllm.ai/nightly --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
# GPU environment check
import torch
import platform

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU[{i}]: {props.name} | SM count: {props.multi_processor_count} | Mem: {props.total_memory / 1e9:.2f} GB")

Python: 3.12.11
PyTorch: 2.8.0+cu128
CUDA available: True
Num GPUs: 4
GPU[0]: NVIDIA H100 PCIe | SM count: 114 | Mem: 85.03 GB
GPU[1]: NVIDIA H100 PCIe | SM count: 114 | Mem: 85.03 GB
GPU[2]: NVIDIA H100 PCIe | SM count: 114 | Mem: 85.03 GB
GPU[3]: NVIDIA H100 PCIe | SM count: 114 | Mem: 85.03 GB


## Part 1: Qwen3-Next-Instruct with vLLM

This part of the notebook demonstrates two ways to run inference with the `Instruct` model:
1. Launching a vLLM server and making HTTP requests to it.
2. Using the vLLM Python client directly for in-process inference.

### Launch Instruct Model Server

In [None]:
import subprocess
import time

serve_cmd_instruct = [
    "vllm", "serve", "Qwen/Qwen3-Next-80B-A3B-Instruct",
    "--tensor-parallel-size", "4",
    "--served-model-name", "qwen3-next",
    "--host", "0.0.0.0", "--port", "8000"
]

instruct_process = subprocess.Popen(serve_cmd_instruct)
print(f"Started vLLM instruct server, PID={instruct_process.pid}")

# Wait for the server to be ready.
print("Waiting for server to initialize... (approx. 5 minutes)")
time.sleep(300)
print("vLLM instruct server should be ready.")

Started vLLM instruct server, PID=60752
Waiting for server to initialize... (approx. 5 minutes)
INFO 09-19 23:20:27 [__init__.py:216] Automatically detected platform cuda.
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:30 [api_server.py:1896] vLLM API server version 0.10.2
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:30 [utils.py:328] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct', 'host': '0.0.0.0', 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct', 'served_model_name': ['qwen3-next'], 'tensor_parallel_size': 4}
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:37 [__init__.py:742] Resolved architecture: Qwen3NextForCausalLM
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:37 [__init__.py:1815] Using max model len 262144
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:37 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:37 [config.py:310] Hybrid or mamba-based model det

[1;36m(APIServer pid=60752)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!


[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:37 [config.py:390] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:20:37 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-19 23:20:41 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=61201)[0;0m INFO 09-19 23:20:43 [core.py:654] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=61201)[0;0m INFO 09-19 23:20:43 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=4, pipel



[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3[Gloo] Rank 
1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-19 23:20:53 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 23:20:53 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 23:20:53 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 23:20:53 [pynccl.py:70] vLLM is u

[1;36m(Worker_TP1 pid=61351)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP0 pid=61350)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP2 pid=61352)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP3 pid=61353)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!


[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:20:56 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP0 pid=61350)[0;0m INFO 09-19 23:20:56 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP2 pid=61352)[0;0m INFO 09-19 23:20:56 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:20:56 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:20:56 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP0 pid=61350)[0;0m INFO 09-19 23:20:56 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP2 pid=61352)[0;0m INFO 09-19 23:20:56 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:20:56 [weight_utils.py:348] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/41 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/41 [00:00<00:16,  2.44it/s]
Loading safetensors checkpoint shards:   5% Completed | 2/41 [00:00<00:16,  2.33it/s]
Loading safetensors checkpoint shards:   7% Completed | 3/41 [00:01<00:17,  2.23it/s]
Loading safetensors checkpoint shards:  10% Completed | 4/41 [00:01<00:17,  2.16it/s]
Loading safetensors checkpoint shards:  12% Completed | 5/41 [00:02<00:16,  2.13it/s]
Loading safetensors checkpoint shards:  15% Completed | 6/41 [00:02<00:17,  2.04it/s]
Loading safetensors checkpoint shards:  17% Completed | 7/41 [00:03<00:16,  2.08it/s]
Loading safetensors checkpoint shards:  20% Completed | 8/41 [00:03<00:15,  2.10it/s]
Loading safetensors checkpoint shards:  22% Completed | 9/41 [00:04<00:15,  2.08it/s]
Loading safetensors checkpoint shards:  24% Completed | 10/41 [00:04<00:13,  2.23it/s]
Loading safetensors checkpoint shards:  29% Completed | 12/41

[1;36m(Worker_TP0 pid=61350)[0;0m INFO 09-19 23:21:15 [default_loader.py:268] Loading weights took 19.07 seconds
[1;36m(Worker_TP0 pid=61350)[0;0m INFO 09-19 23:21:16 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 19.682496 seconds
[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:21:16 [default_loader.py:268] Loading weights took 19.91 seconds
[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:21:16 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 20.554601 seconds
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:21:23 [default_loader.py:268] Loading weights took 26.76 seconds
[1;36m(Worker_TP2 pid=61352)[0;0m INFO 09-19 23:21:24 [default_loader.py:268] Loading weights took 27.24 seconds
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:21:24 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 27.695779 seconds
[1;36m(Worker_TP2 pid=61352)[0;0m INFO 09-19 23:21:24 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 28.061191 seco

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:11<00:00,  5.66it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:49<00:00,  1.37it/s]


[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:23:26 [gpu_model_runner.py:3118] Graph capturing finished in 62 secs, took 3.05 GiB
[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:23:26 [gpu_worker.py:391] Free memory on device (78.66/79.19 GiB) on startup. Desired GPU memory utilization is (0.9, 71.27 GiB). Actual usage is 37.22 GiB for weight, 5.58 GiB for peak activation, 0.59 GiB for non-torch memory, and 3.05 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26511116185` to fit into requested memory, or `--kv-cache-memory=34448928256` to fully utilize gpu memory. Current kv cache memory in use is 29942056857 bytes.
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:23:26 [gpu_model_runner.py:3118] Graph capturing finished in 62 secs, took 3.05 GiB
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:23:26 [gpu_worker.py:391] Free memory on device (78.66/79.19 GiB) on startup. Desired GPU memory utilization is (0.9, 71.27 GiB). Actual usage is 

[1;36m(APIServer pid=60752)[0;0m INFO:     Started server process [60752]
[1;36m(APIServer pid=60752)[0;0m INFO:     Waiting for application startup.
[1;36m(APIServer pid=60752)[0;0m INFO:     Application startup complete.


[1;36m(Worker_TP0 pid=61350)[0;0m INFO 09-19 23:24:35 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP1 pid=61351)[0;0m INFO 09-19 23:24:35 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP2 pid=61352)[0;0m INFO 09-19 23:24:35 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP3 pid=61353)[0;0m INFO 09-19 23:24:35 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(APIServer pid=60752)[0;0m INFO 09-19 23:24:35 [launcher.py:101] Shutting down FastAPI HTTP server.


KeyboardInterrupt: 

[1;36m(APIServer pid=60752)[0;0m INFO:     Shutting down
[1;36m(APIServer pid=60752)[0;0m INFO:     Waiting for application shutdown.
[1;36m(APIServer pid=60752)[0;0m INFO:     Application shutdown complete.


### Inference using vLLM Server

In [2]:
import requests

user_prompt = "What is the capital of France and why do people travel go there?"

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "qwen3-next",
    "messages": [
        {"role": "user", "content": user_prompt}
    ]
}
response = requests.post(url, headers=headers, json=data)
output = response.json()
result = output['choices'][0]['message']['content']
print(result)

The capital of France is **Paris**.

People travel to Paris for a wide variety of reasons, thanks to its rich cultural, historical, and aesthetic appeal. Here are some of the main reasons:

1. **Iconic Landmarks**: Paris is home to world-famous attractions such as the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum (which houses the Mona Lisa and Venus de Milo), the Arc de Triomphe, and Montmartre with the Sacré-Cœur Basilica.

2. **Art and Culture**: Paris has long been a global center for art, fashion, and literature. It boasts over 100 museums, including the Musée d’Orsay and Centre Pompidou, and hosts major art exhibitions and fashion weeks.

3. **Cuisine**: French cuisine is celebrated worldwide, and Paris offers everything from Michelin-starred restaurants to cozy cafés and bustling markets. Visitors come to enjoy croissants, baguettes, cheese, wine, and pastries like macarons and éclairs.

4. **Romance**: Paris is often called “The City of Love,” making it a top destinatio

### Close Instruct Server

In [None]:
# Shutdown of the instruct server
if 'instruct_process' in globals() and instruct_process.poll() is None:
    instruct_process.kill()
    print(f"Killed instruct server PID {instruct_process.pid}")
else:
    print("No running instruct server process found to terminate.")

[1;36m(APIServer pid=38282)[0;0m INFO:     Shutting down
[1;36m(APIServer pid=38282)[0;0m INFO:     Waiting for application shutdown.
[1;36m(APIServer pid=38282)[0;0m INFO:     Application shutdown complete.


### Inference using vLLM Python Client

In [4]:
import os
from vllm import LLM, SamplingParams

MODEL_ID = "Qwen/Qwen3-Next-80B-A3B-Instruct"

llm = LLM(
    model=MODEL_ID,
    dtype="bfloat16",
    trust_remote_code=True,
    max_model_len=65536,
    gpu_memory_utilization=0.95,
    tensor_parallel_size=4,
)

print("Model ready")

  from .autonotebook import tqdm as notebook_tqdm


INFO 09-19 22:49:52 [__init__.py:216] Automatically detected platform cuda.
INFO 09-19 22:49:53 [utils.py:328] non-default args: {'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 65536, 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.95, 'disable_log_stats': True, 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 09-19 22:50:00 [model.py:543] Resolved architecture: Qwen3NextForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 09-19 22:50:00 [model.py:1604] Using max model len 65536


2025-09-19 22:50:03,224	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 09-19 22:50:03 [scheduler.py:218] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 09-19 22:50:03 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
INFO 09-19 22:50:03 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
INFO 09-19 22:50:04 [config.py:390] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.
INFO 09-19 22:50:04 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-19 22:50:07 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=47291)[0;0m INFO 09-19 22:50:09 [core.py:648] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=47291)[0;0m INFO 09-19 22:50:09 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc3.dev279+gddc904839) with config:



INFO 09-19 22:50:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_108f115a'), local_subscribe_addr='ipc:///tmp/3cd384fe-6173-4459-854e-ab109f5d0a55', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-19 22:50:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_134358bd'), local_subscribe_addr='ipc:///tmp/2eb9dc23-65ff-4e73-b30e-b4f296b6e3ce', remote_subscribe_addr=None, remote_addr_ipv6=False)




[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-19 22:50:19 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 09-19 22:50:19 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-19 22:50:19 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 09-19 22:50:19 [__init__.py:1439] Found nccl from lib

[1;36m(Worker_TP3 pid=47429)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP0 pid=47426)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP2 pid=47428)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP1 pid=47427)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!


[1;36m(Worker_TP3 pid=47429)[0;0m INFO 09-19 22:50:21 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP2 pid=47428)[0;0m INFO 09-19 22:50:21 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP0 pid=47426)[0;0m INFO 09-19 22:50:21 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:50:21 [weight_utils.py:348] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/41 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/41 [00:00<00:15,  2.54it/s]
Loading safetensors checkpoint shards:   5% Completed | 2/41 [00:00<00:16,  2.37it/s]
Loading safetensors checkpoint shards:   7% Completed | 3/41 [00:01<00:16,  2.27it/s]
Loading safetensors checkpoint shards:  10% Completed | 4/41 [00:01<00:16,  2.28it/s]
Loading safetensors checkpoint shards:  12% Completed | 5/41 [00:02<00:15,  2.31it/s]
Loading safetensors checkpoint shards:  15% Completed | 6/41 [00:02<00:15,  2.23it/s]
Loading safetensors checkpoint shards:  17% Completed | 7/41 [00:03<00:15,  2.23it/s]
Loading safetensors checkpoint shards:  20% Completed | 8/41 [00:03<00:14,  2.23it/s]
Loading safetensors checkpoint shards:  22% Completed | 9/41 [00:04<00:14,  2.20it/s]
Loading safetensors checkpoint shards:  24% Completed | 10/41 [00:04<00:13,  2.31it/s]
Loading safetensors checkpoint shards:  29% Completed | 12/41

[1;36m(Worker_TP0 pid=47426)[0;0m INFO 09-19 22:50:40 [default_loader.py:268] Loading weights took 18.03 seconds
[1;36m(Worker_TP3 pid=47429)[0;0m INFO 09-19 22:50:40 [default_loader.py:268] Loading weights took 18.37 seconds
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:50:40 [default_loader.py:268] Loading weights took 18.41 seconds
[1;36m(Worker_TP0 pid=47426)[0;0m INFO 09-19 22:50:40 [gpu_model_runner.py:2570] Model loading took 37.2151 GiB and 18.791120 seconds
[1;36m(Worker_TP3 pid=47429)[0;0m INFO 09-19 22:50:40 [gpu_model_runner.py:2570] Model loading took 37.2151 GiB and 18.894837 seconds
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:50:41 [gpu_model_runner.py:2570] Model loading took 37.2151 GiB and 18.992786 seconds
[1;36m(Worker_TP2 pid=47428)[0;0m INFO 09-19 22:50:41 [default_loader.py:268] Loading weights took 18.86 seconds
[1;36m(Worker_TP2 pid=47428)[0;0m INFO 09-19 22:50:41 [gpu_model_runner.py:2570] Model loading took 37.2151 GiB and 19.569748 seco

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:07<00:00,  9.07it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:15<00:00,  4.22it/s]


[1;36m(Worker_TP2 pid=47428)[0;0m INFO 09-19 22:51:16 [gpu_model_runner.py:3370] Graph capturing finished in 24 secs, took 3.14 GiB
[1;36m(Worker_TP2 pid=47428)[0;0m INFO 09-19 22:51:16 [gpu_worker.py:392] Free memory on device (78.66/79.19 GiB) on startup. Desired GPU memory utilization is (0.95, 75.23 GiB). Actual usage is 37.22 GiB for weight, 5.61 GiB for peak activation, 0.59 GiB for non-torch memory, and 3.14 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=30626413260` to fit into requested memory, or `--kv-cache-memory=34312793600` to fully utilize gpu memory. Current kv cache memory in use is 34160114380 bytes.
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:51:16 [gpu_model_runner.py:3370] Graph capturing finished in 24 secs, took 3.14 GiB
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:51:16 [gpu_worker.py:392] Free memory on device (78.66/79.19 GiB) on startup. Desired GPU memory utilization is (0.95, 75.23 GiB). Actual usage i

### Generate: single and batch

In [5]:
params = SamplingParams(temperature=0.6, max_tokens=200)

# Single prompt
single = llm.generate(["What is Nemotron Super?"], sampling_params=params)
print(single[0].outputs[0].text)

# Batch prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:"
]
outputs = llm.generate(prompts, sampling_params=params)
for i, out in enumerate(outputs):
    print(f"\nPrompt {i+1}: {out.prompt!r}")
    print(out.outputs[0].text)

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 145.94it/s]
[1;36m(Worker_TP0 pid=47426)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP3 pid=47429)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP1 pid=47427)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP2 pid=47428)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP1 pid=47427)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP3 pid=47429)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP2 pid=47428)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP0 pid=47426)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
Processed prompts: 100%|██████████| 1/1 [00:22<00:00, 22.11s/it, est. speed input: 0.27 toks/s, output: 9.05 toks/s]


 Nemotron Super is a new class of **synthetic data generation** models developed by **NVIDIA**. It is designed to generate **high-quality synthetic data** across a wide range of **modalities**, including **text, images, video, audio, and 3D**. This makes it a powerful tool for creating realistic training data to enhance the performance of **AI models**, especially in scenarios where real-world data is scarce, expensive, or sensitive.

Nemotron Super is part of NVIDIA's broader **Nemotron** family of models, which includes **Nemotron-4**, a family of **large language models (LLMs)** optimized for **inference**, **retrieval-augmented generation (RAG)**, **re-ranking**, and **embedding** tasks. While Nemotron-4 focuses on language understanding and generation, **Nemotron Super** extends this capability into **multimodal synthetic data generation**, enabling the creation of **synthetic data**


Adding requests: 100%|██████████| 3/3 [00:00<00:00, 2508.56it/s]
[1;36m(Worker_TP1 pid=47427)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP2 pid=47428)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP3 pid=47429)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
[1;36m(Worker_TP0 pid=47426)[0;0m   return fn(*contiguous_args, **contiguous_kwargs)
Processed prompts: 100%|██████████| 3/3 [00:02<00:00,  1.11it/s, est. speed input: 6.67 toks/s, output: 222.26 toks/s]


Prompt 1: 'Hello, my name is'
 <PRESIDIO_ANONYMIZED_PERSON> and I am a student in the Master's program in International Relations at the University of Vienna. I am writing to inquire about the possibility of conducting my Master's thesis research at the IAEA, under the supervision of Dr. Kornel Kleiner. As a student with a strong academic background in international relations and a deep interest in nuclear non-proliferation and disarmament, I believe that conducting my thesis research at the IAEA would provide me with unparalleled access to experts, data, and resources that would significantly enhance the quality of my work. I have attached my CV for your consideration.

I am particularly interested in exploring the role of the IAEA in the implementation of the Joint Comprehensive Plan of Action (JCPOA) and its implications for the future of nuclear non-proliferation. I believe that my research could contribute to the IAEA's ongoing efforts to strengthen nuclear safeguards and promote




In [6]:
# Cleanup: Delete the model and free GPU memory
# This is essential before moving to the next part (Thinking model)
print("Cleaning up Instruct model...")

# Delete the model object
if 'llm' in globals():
    del llm
    print("Deleted llm object")
    
# Force garbage collection
import gc
gc.collect()

# Clear GPU cache
import torch
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("Cleared GPU cache")

print("Cleanup complete! Ready to load Thinking model.")


Cleaning up Instruct model...
[1;36m(Worker_TP0 pid=47426)[0;0m INFO 09-19 22:58:00 [multiproc_executor.py:558] Parent process exited, terminating worker
[1;36m(Worker_TP0 pid=47426)[0;0m INFO 09-19 22:58:00 [multiproc_executor.py:599] WorkerProc shutting down.
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:58:00 [multiproc_executor.py:558] Parent process exited, terminating worker
[1;36m(Worker_TP1 pid=47427)[0;0m INFO 09-19 22:58:00 [multiproc_executor.py:599] WorkerProc shutting down.
[1;36m(Worker_TP2 pid=47428)[0;0m INFO 09-19 22:58:00 [multiproc_executor.py:558] Parent process exited, terminating worker
[1;36m(Worker_TP3 pid=47429)[0;0m INFO 09-19 22:58:00 [multiproc_executor.py:558] Parent process exited, terminating worker
Deleted llm object
Cleared GPU cache
Cleanup complete! Ready to load Thinking model.


## Part 2: Qwen3-Next-Thinking with vLLM

We will now launch a separate vLLM server for the `Qwen/Qwen3-Next-80B-A3B-Thinking` model. This variant is optimized for complex reasoning tasks.

- Model card: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
- We will use a different port and served model name to avoid conflicts.



### Launch Thinking Model Server


In [None]:
import subprocess, time

# Launch Thinking server on a different port
serve_cmd_thinking = [
    "vllm", "serve", "Qwen/Qwen3-Next-80B-A3B-Thinking",
    "--tensor-parallel-size", "4",
    "--served-model-name", "qwen3-next-thinking",
    "--host", "0.0.0.0", "--port", "8001"
]

thinking_process = subprocess.Popen(serve_cmd_thinking)
print(f"Started vLLM thinking server, PID={thinking_process.pid}")

# Wait for the server to be ready.
print("Waiting for server to initialize... (approx. 5 minutes)")
time.sleep(300)
print("vLLM thinking server should be ready.")

Started vLLM thinking server, PID=80693
Waiting for server to initialize... (approx. 5 minutes)
INFO 09-19 23:28:36 [__init__.py:216] Automatically detected platform cuda.
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:38 [api_server.py:1896] vLLM API server version 0.10.2
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:38 [utils.py:328] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Thinking', 'host': '0.0.0.0', 'port': 8001, 'model': 'Qwen/Qwen3-Next-80B-A3B-Thinking', 'served_model_name': ['qwen3-next-thinking'], 'tensor_parallel_size': 4}
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:45 [__init__.py:742] Resolved architecture: Qwen3NextForCausalLM
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:45 [__init__.py:1815] Using max model len 262144
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:45 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:45 [config.py:310] Hybrid o

[1;36m(APIServer pid=80693)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!


[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:46 [config.py:390] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:28:46 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-19 23:28:49 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=81038)[0;0m INFO 09-19 23:28:52 [core.py:654] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=81038)[0;0m INFO 09-19 23:28:52 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='Qwen/Qwen3-Next-80B-A3B-Thinking', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Thinking', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=4, pipel



[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank [Gloo] Rank 2 is connected to 33 is connected to  peer ranks. 3Expected number of connected peer ranks is :  peer ranks. 3Expected number of connected peer ranks is : 3

[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-19 23:29:02 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 23:29:02 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-19 23:29:02 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-19 23:29:02 [pynccl.py:70] vLLM is using nccl==2.2

[1;36m(Worker_TP2 pid=81206)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP0 pid=81204)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP3 pid=81207)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(Worker_TP1 pid=81205)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!


[1;36m(Worker_TP2 pid=81206)[0;0m INFO 09-19 23:29:04 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP0 pid=81204)[0;0m INFO 09-19 23:29:04 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP3 pid=81207)[0;0m INFO 09-19 23:29:04 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP1 pid=81205)[0;0m INFO 09-19 23:29:04 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP2 pid=81206)[0;0m INFO 09-19 23:29:05 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP3 pid=81207)[0;0m INFO 09-19 23:29:05 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP1 pid=81205)[0;0m INFO 09-19 23:29:05 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(Worker_TP0 pid=81204)[0;0m INFO 09-19 23:29:05 [weight_utils.py:348] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/41 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/41 [00:00<00:17,  2.35it/s]
Loading safetensors checkpoint shards:   5% Completed | 2/41 [00:00<00:17,  2.25it/s]
Loading safetensors checkpoint shards:   7% Completed | 3/41 [00:01<00:17,  2.18it/s]
Loading safetensors checkpoint shards:  10% Completed | 4/41 [00:01<00:16,  2.18it/s]
Loading safetensors checkpoint shards:  12% Completed | 5/41 [00:02<00:16,  2.20it/s]
Loading safetensors checkpoint shards:  15% Completed | 6/41 [00:02<00:16,  2.12it/s]
Loading safetensors checkpoint shards:  17% Completed | 7/41 [00:03<00:15,  2.13it/s]
Loading safetensors checkpoint shards:  20% Completed | 8/41 [00:03<00:15,  2.14it/s]
Loading safetensors checkpoint shards:  22% Completed | 9/41 [00:04<00:15,  2.11it/s]
Loading safetensors checkpoint shards:  24% Completed | 10/41 [00:04<00:13,  2.24it/s]
Loading safetensors checkpoint shards:  29% Completed | 12/41

[1;36m(Worker_TP2 pid=81206)[0;0m INFO 09-19 23:29:24 [default_loader.py:268] Loading weights took 19.09 seconds
[1;36m(Worker_TP0 pid=81204)[0;0m INFO 09-19 23:29:24 [default_loader.py:268] Loading weights took 18.94 seconds
[1;36m(Worker_TP3 pid=81207)[0;0m INFO 09-19 23:29:24 [default_loader.py:268] Loading weights took 19.10 seconds
[1;36m(Worker_TP2 pid=81206)[0;0m INFO 09-19 23:29:24 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 19.713313 seconds
[1;36m(Worker_TP0 pid=81204)[0;0m INFO 09-19 23:29:24 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 19.567156 seconds
[1;36m(Worker_TP3 pid=81207)[0;0m INFO 09-19 23:29:25 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 19.743312 seconds
[1;36m(Worker_TP1 pid=81205)[0;0m INFO 09-19 23:29:25 [default_loader.py:268] Loading weights took 19.94 seconds
[1;36m(Worker_TP1 pid=81205)[0;0m INFO 09-19 23:29:26 [gpu_model_runner.py:2392] Model loading took 37.2152 GiB and 20.796459 seco

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:10<00:00,  6.38it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:29<00:00,  2.29it/s]


[1;36m(Worker_TP0 pid=81204)[0;0m INFO 09-19 23:30:22 [gpu_model_runner.py:3118] Graph capturing finished in 41 secs, took 3.05 GiB
[1;36m(Worker_TP0 pid=81204)[0;0m INFO 09-19 23:30:22 [gpu_worker.py:391] Free memory on device (78.66/79.19 GiB) on startup. Desired GPU memory utilization is (0.9, 71.27 GiB). Actual usage is 37.22 GiB for weight, 5.58 GiB for peak activation, 0.59 GiB for non-torch memory, and 3.05 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=26511116185` to fit into requested memory, or `--kv-cache-memory=34448928256` to fully utilize gpu memory. Current kv cache memory in use is 29942056857 bytes.
[1;36m(Worker_TP1 pid=81205)[0;0m INFO 09-19 23:30:22 [gpu_model_runner.py:3118] Graph capturing finished in 41 secs, took 3.05 GiB
[1;36m(Worker_TP1 pid=81205)[0;0m INFO 09-19 23:30:22 [gpu_worker.py:391] Free memory on device (78.66/79.19 GiB) on startup. Desired GPU memory utilization is (0.9, 71.27 GiB). Actual usage is 

[1;36m(APIServer pid=80693)[0;0m INFO:     Started server process [80693]
[1;36m(APIServer pid=80693)[0;0m INFO:     Waiting for application startup.
[1;36m(APIServer pid=80693)[0;0m INFO:     Application startup complete.


[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:46416 - "GET /health HTTP/1.1" 200 OK
vLLM thinking server should be ready.


### Inference against Thinking server

Use the OpenAI-compatible endpoint exposed by vLLM on port 8001.



In [4]:
import requests

THINKING_URL = "http://localhost:8001/v1/chat/completions"

thinking_request = {
    "model": "qwen3-next-thinking",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant specialized in complex reasoning."},
        {"role": "user", "content": "Explain the Hybrid Attention mechanism in Qwen3-Next in a few sentences."}
    ],
    "temperature": 0.6,  # per model best practices
    "top_p": 0.95,
    # vLLM supports additional sampling params under 'extra_body' via OpenAI-compatible API
    "extra_body": {"top_k": 20, "min_p": 0.0},
    "max_tokens": 512,
}

resp = requests.post(THINKING_URL, json=thinking_request, timeout=600)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])

[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:33:33 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:60422 - "GET /v1/models HTTP/1.1" 200 OK
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:34:04 [loggers.py:123] Engine 000: Avg prompt throughput: 8.4 tokens/s, Avg generation throughput: 30.7 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:46426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Okay, the user is asking about the Hybrid Attention mechanism in Qwen3-Next. Hmm, I need to recall what I know about this. First, Qwen3-Next is a hypothetical or future version of Qwen, right? Because as of now, the latest version is Qwen3, but there's no official Qwen3-Next yet. Maybe the user is referring to a speculative or upcoming model.

Wait

[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:55666 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:34:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%


### Batch Inference against Thinking Server

In [None]:
import asyncio
import time
from openai import AsyncOpenAI

# Use an async client for concurrent requests
async_client = AsyncOpenAI(base_url="http://localhost:8001/v1", api_key="EMPTY")

batch_prompts_thinking = [
    "Write a Python function to find the nth Fibonacci number with caching, and explain the time complexity.",
    "Describe the difference between Gated DeltaNet and standard attention.",
    "Compose a short, rhyming poem about a Mixture-of-Experts model.",
]

print("Sending batch of requests to Thinking model concurrently...")
start_time = time.time()

tasks = [
    async_client.chat.completions.create(
        model="qwen3-next-thinking",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": p}
        ],
        max_tokens=512,
        temperature=0.6,
        top_p=0.95,
        extra_body={"top_k": 20, "min_p": 0.0},
    ) for p in batch_prompts_thinking
]
responses = await asyncio.gather(*tasks)

end_time = time.time()
print(f"Batch completed in {end_time - start_time:.2f} seconds.\n")

for i, resp in enumerate(responses):
    print(f"--- Response for Prompt {i+1}: \"{batch_prompts_thinking[i]}\" ---")
    if resp.choices:
        print(resp.choices[0].message.content.strip())
    else:
        print("Empty response.")
    print("-" * (40 + len(batch_prompts_thinking[i])))
    print()


Sending batch of requests to Thinking model concurrently...
[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:49582 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:49598 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=80693)[0;0m INFO:     127.0.0.1:49602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Batch completed in 7.71 seconds.

--- Response for Prompt 1: "Write a Python function to find the nth Fibonacci number with caching, and explain the time complexity." ---
Okay, the user wants a Python function to find the nth Fibonacci number using caching, and an explanation of the time complexity. Let me think about how to approach this.

First, I remember that the Fibonacci sequence is defined as F(0) = 0, F(1) = 1, and F(n) = F(n-1) + F(n-2) for n > 1. The naive recursive approach is inefficient because it recalculates the same values multiple times, leading to exponential time complexity. Caching (memoizati

[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:34:24 [loggers.py:123] Engine 000: Avg prompt throughput: 11.1 tokens/s, Avg generation throughput: 153.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=80693)[0;0m INFO 09-19 23:34:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%


### Close Thinking Server

In [None]:
# Shutdown of the thinking server
if 'thinking_process' in globals() and thinking_process.poll() is None:
    thinking_process.kill()
    print(f"Killed thinking server PID {thinking_process.pid}")
else:
    print("No running thinking server process found to terminate.")

subprocess.run(['pkill', '-f', 'vllm'], check=False)
subprocess.run(['pkill', '-f', 'VLLM'], check=False)
print(" Killed all vLLM processes - GPU memory should be freed")

## Resource Notes

- **Hardware**: Qwen3-Next-80B-A3B is a large model. Multi-GPU tensor parallel (`--tensor-parallel-size`) is highly recommended for acceptable performance.
- **Quantization**: For environments with limited resources, consider using quantized versions of the model (e.g., AWQ, GPTQ, INT4/INT8) if available. These can significantly reduce memory usage at the cost of some accuracy.
- **Offloading**: For development or low-throughput scenarios, you can explore smaller models from the Qwen3 family or use model offloading to run the 80B parameter model on systems with less VRAM.
- **Network**: Ensure you have sufficient network and disk bandwidth for the initial model download, as the weights are very large.


## Conclusion and Next Steps
Congratulations! You successfully deployed the `Qwen3-Next` models using vLLM.

In this notebook, you have learned how to:
- Set up your environment with the necessary dependencies.
- Launch and manage a vLLM server for the Instruct model.
- Run inference via the OpenAI-compatible HTTP API.
- Launch a second vLLM server for the Thinking model and run inference.
- Run batch inference for higher throughput.

You can adapt tensor parallelism, ports, and sampling parameters to your hardware and application needs.