In [None]:
pip install pandas datasets ipywidgets

## Speed evaluation

We will benchmark the inference speed of a single Qwen-3 8B model on an NVIDIA L40 using vLLM. The model will be served in four precisions: BF16, FP8, AWQ INT4 (default kernel) and AWQ INT4 (Marlin kernel).

We're going to measure three main metrics: time-to-first-token (TTFT), time-per-output-token (TPOT) and full end-to-end latency.

Note: The default AWQ INT4 is the original method in vLLM, while the Marlin kernel is a newer, more optimized version designed for even better speed and efficiency on modern GPUs.

Both AWQ (default) and AWQ Marlin use INT4 quantization to make the model smaller and faster. The default kernel is the original method for running AWQ quantized models in vLLM, while the Marlin kernel is a newer, more optimized version designed for even better speed and efficiency. Here, a “kernel” just means the low-level code that runs on the GPU to do the computation. For this lesson, you only need to know that Marlin aims to improve performance compared to the default approach. You can find more details about the Marlin kernel [here](https://github.com/IST-DASLab/marlin).

In [1]:
import os

from utils import setup_benchmark_environment, bench_single_model


INFO 06-22 14:33:10 [__init__.py:244] Automatically detected platform cuda.


In [None]:
ok = setup_benchmark_environment()
assert ok

In [5]:
print(ok)

True


In [2]:
models = [
    {
        "name": "Qwen-3 8B / BF16",
        "hf_id": "Qwen/Qwen3-8B",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
        ],
    },
    {
        "name": "Qwen-3 8B / FP8",
        "hf_id": "Qwen/Qwen3-8B-FP8",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
            "--quantization", "fp8",
        ],
    },
    {
        "name": "Qwen-3 8B / AWQ 4bit",
        "hf_id": "Qwen/Qwen3-8B-AWQ",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
            "--quantization", "awq",
        ],
    },
    {
        "name": "Qwen-3 8B / AWQ Marlin 4bit ",
        "hf_id": "Qwen/Qwen3-8B-AWQ",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
            "--dtype", "half",
            "--quantization", "awq_marlin",
        ],
    }
]

In [3]:
from huggingface_hub import snapshot_download

for model in models:
    snapshot_download(repo_id=model["hf_id"])

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

In [4]:
current_dir = os.path.dirname(os.path.abspath("__file__"))

In [None]:
results = []

for model in models:
    res = bench_single_model(
        model_name=model["hf_id"],
        port=9134,
        request_rate=10,
        num_prompts=200,
        vllm_path=os.path.join(current_dir, "vllm"),
        vllm_args=model["vllm_args"],
        input_len=512,
        output_len=128
    )
    res["name"] = model["name"]
    results.append(res)

In [7]:
import pandas as pd

df = pd.DataFrame(results)
df[['name', 'throughput_tokens_per_sec', 'ttft_mean', 'ttft_p90', 'tpot_mean', 'tpot_p90', 'e2el_mean', 'e2el_p90']]

Unnamed: 0,name,throughput_tokens_per_sec,ttft_mean,ttft_p90,tpot_mean,tpot_p90,e2el_mean,e2el_p90
0,Qwen-3 8B / BF16,1052.84,144.2,224.63,50.84,56.68,6243.78,7276.14
1,Qwen-3 8B / FP8,1119.78,100.66,147.52,33.47,37.64,4109.3,4872.23
2,Qwen-3 8B / AWQ 4bit,868.82,283.24,461.84,104.87,122.98,12778.73,15857.04
3,Qwen-3 8B / AWQ Marlin 4bit,1170.49,99.31,160.09,24.91,29.16,3032.04,3705.31


### 📝 What These Results Show

- FP8 cuts per-token latency by ~35% versus BF16 — Ada/Hopper GPUs (for example L40, H100, H200) include native FP8 tensor‑cores, so decoding cost drops from 50.8 ms per token (BF16) to 33.5 ms (FP8) while throughput rises.

- AWQ‑Marlin wins overall — it uses an optimized kernel code for dramatically less memory movement and a mixed-precision dot-product the GPU can execute in one shot, giving it the best overall throughput (1170 tok/s) and the lowest TPOT (24.9 ms).

- Kernel quality beats bit‑width.

  - The default AWQ path in vLLM dequantises 4‑bit weights to FP16 every layer and every token, and promotes BF16 activations to FP16 before the matmul. The extra memory traffic and frequent dtype conversions wipe out most of the INT4 advantage.

  - Marlin avoids this cost by expanding each weight block once per micro‑batch and re‑using it, keeping activations in FP16.

  - INT4 alone does not guarantee speed — you need a kernel that is both compute‑ and memory‑efficient.

- **Take‑aways for practitioners**

  - Measure, don’t assume. Low precision brings speed-ups only when the kernel is memory-bandwidth-bound (i.e. the slow part is transferring data to and from GPU memory, not doing thecomputations) and when it avoids redundant conversions.

  - Use FP8 on Ada/Hopper whenever memory allows — it delivers near‑FP16 quality with roughly one-third to one-half lower latency.

  - Prefer AWQ‑Marlin when you need the smallest memory footprint and high throughput.

## Quality evaluation

In this part we will see how quantization affects quality.

We will load several models, including a Llama-3.1-70B variant compressed with AWQ 4-bit, and measure their performance with `lm-eval`. Thanks to 4-bit quantization, the entire 70B checkpoint fits on a single NVIDIA L40 (46 GB), whereas the 16-bit original needs ≈140 GB and an 8-bit version ≈70 GB. 

We will also benchmark the Qwen3-8B, comparing its full-precision BF16 baseline with more compact FP8 and AWQ-INT4 variants.

For Qwen-3 the authors already [report scores](https://huggingface.co/Qwen/Qwen3-8B-AWQ#performance) on tougher leaderboard-style benchmarks (LiveBench, GPQA, etc.), showing that AWQ-INT4 loses only slightly compared with BF16 while cutting memory use by ≳50 %.

Below, the `evaluate_model` helper runs evaluation on HellaSwag, a multiple-choice commonsense reasoning benchmark designed to test a model's ability to choose the most plausible continuation of a given context. This allows you to observe the quality-vs-size trade-offs in a reproducible setup focused on everyday reasoning performance.

In [None]:
pip install lm-eval[vllm]

In [None]:
from utils import evaluate_model

models = [
    {
        "name": "Llama 70B / GGUF 4bit",
        # repo that contains the *.gguf file
        "hf_id": "bartowski/Meta-Llama-3.1-70B-Instruct-GGUF",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--max-num-seqs", "1",  # setting to 1 since evaluation requires logits that consume a lot of memory
            "--tokenizer", "meta-llama/Meta-Llama-3.1-70B-Instruct"
        ],
    },    
    {
        "name": "Llama 70b / AWQ 4bit",
        "hf_id": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--max-num-seqs", "1",  # setting to 1 since evaluation requires logits that consume a lot of memory
            "--quantization", "awq_marlin",
        ],
    },  
    {
        "name": "Qwen-3 8B / BF16",
        "hf_id": "Qwen/Qwen3-8B",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
        ],
    },
    {
        "name": "Qwen-3 8B / FP8",
        "hf_id": "Qwen/Qwen3-8B-FP8",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--quantization", "fp8",
        ],
    },
    {
        "name": "Qwen-3 8B / AWQ 4bit",
        "hf_id": "Qwen/Qwen3-8B-AWQ",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--quantization", "awq",
        ],
    },
]

INFO 06-22 13:36:27 [__init__.py:244] Automatically detected platform cuda.


In [None]:
%%time

results = []

for model in models:
    hellaswag_acc = evaluate_model(
        model_name=model["hf_id"],
        vllm_args=model["vllm_args"],
        limit=500,
    )
    results.append({
        "name": model["name"],
        "acc": hellaswag_acc
    })


In [5]:
import pandas as pd

pd.DataFrame(results)

Unnamed: 0,name,acc
0,Llama 70b / AWQ 4bit,0.728
1,Qwen-3 8B / BF16,0.644
2,Qwen-3 8B / FP8,0.64
3,Qwen-3 8B / AWQ 4bit,0.634


The results show that quantization introduces only a modest drop in accuracy while offering substantial memory savings.

The Llama 70B model, even in its 4-bit AWQ form, achieves the highest HellaSwag accuracy beating all Qwen-3 8B variants. For Qwen-3 8B, the accuracy gently decreases from BF16 (64.4%) to FP8 (64.0%) and AWQ-4bit (63.4%), highlighting the typical quality-vs-efficiency trade-off.

Overall, quantization proves effective for deploying high-performing models within limited hardware budgets.