# Running Qwen3-next models with vLLM

This notebook provides a step-by-step guide on how to download and run `Qwen3-next` model using vLLM on NVIDIA GPUs for high-performance inference. vLLM is an open-source library that makes Large Language Model (LLM) inference and serving faster and more efficient by using an advanced memory management and continuous batching. It significantly increases model throughput, reduces GPU memory usage, and lowers infrastructure costs, making it a key tool for deploying LLMs at scale.

`Qwen3-next`is a brand-new mdoel architecture that introduces several key improvements over its predesessor: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference. It is an 80-billion-parameter model that activates only 3 billion parameters during inference. Refer to the [model card] (https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) for more details. The `Qwen3-next` has two variants:

- `Qwen3-Next-80B-A3B-Instruct`
- `Qwen3-Next-80B-A3B-Thinking`

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get start with this guide

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-30i1YjHsRWT109HL6eYxLUeHIwF)

## Prerequisites

### Hardware
To run the `Qwen3-Next-80B-A3B-Instruct` model, you will need an 4x A100 or 4xH200 NVIDIA 

### Software
- CUDA Toolkit 12.8 or later
- Python 3.12 or later

## Installing vLLM

To run `Qwen3-Next` models you will need to install the nightly build of vLLM. 

In [None]:
# Install vLLM via pip
```bash
# uv venv
# source .venv/bin/activate
# uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
```

## Launch Model with Multi-GPU setup

You can run the below command from command line or use subprocess to run it from this cell. 
`vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \`
  `--tensor-parallel-size 4 \`
  `--served-model-name qwen3-next `

We will running it from the cell and hence will be wrapping it with subprocess

In [None]:
import subprocess

cmd = [
    "vllm", "serve", "Qwen/Qwen3-Next-80B-A3B-Instruct",
    "--tensor-parallel-size", "4",
    "--served-model-name", "qwen3-next"
]
subprocess.Popen(cmd)

## Inferencing using simple python function

In [None]:
import requests

def inference(user_prompt):
    url = "http://localhost:8000/v1/chat/completions"
    headers = {"Content-Type": "application/json"}
    data = {
        "model": "qwen3-next",
        "messages": [
            {"role": "user", "content": user_prompt}
        ]
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()

In [None]:
# Usage example

user_prompt = " What is the capital of France and why do people travel go there? "
output = inference(user_prompt)
result = output['choices'][0]['message']['content']
print(result)

# Conclusion and Next Steps
Congratulations! You successfully deployed `Qwen3-Next` using vLLM.

In this notebook, you have learned how to:
- Set up your environment with the necessary dependencies.
- Use vllm serve to deploy the model.
- Run inference.