<a href="https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic5/5.4_quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

The course is in development now, with more materials coming soon.

# 5.4. Quantization

**By: [Alexey Bukhtiyarov](https://www.linkedin.com/in/leshanbog/)**

In this notebook we evaluate both the **speed** and **quality** of LLMs that have been quantized to different precisions.

* **Speed** - models are served with **vLLM**
* **Quality** - using **lm-eval-harness** we measure downstream accuracy to see how much each quantization level impacts performance.

All helper functions that spin up vLLM, drive the benchmarks, and run the evaluations live in `utils.py` - feel free to open that file for implementation details.

In [None]:
pip install pandas datasets ipywidgets

## Speed evaluation

We will benchmark the inference speed of a single **Qwen-3 8B** model on an NVIDIA L40 using vLLM. The model will be served in four precisions:
- BF16
- FP8
- AWQ INT4 (default kernel)
- AWQ INT4 (Marlin kernel)

We're going to measure four main metrics:
- time-to-first-token (TTFT)
- time-per-output-token (TPOT)
- full end-to-end latency
- token throughput

*Note*: The default AWQ INT4 is the original method in vLLM, while the Marlin kernel is a newer, more optimized version designed for even better speed and efficiency on modern GPUs.

Both AWQ (default) and AWQ Marlin use INT4 quantization to make the model smaller and faster. The default kernel is the original method for running AWQ quantized models in vLLM, while the Marlin kernel is a newer, more optimized version designed for even better speed and efficiency. Here, a “kernel” just means the low-level code that runs on the GPU to do the computation.

For this lesson, you only need to know that Marlin aims to improve performance compared to the default approach. You can find more details about the Marlin kernel [here](https://github.com/IST-DASLab/marlin).

In [None]:
!curl -o utils.py https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic5/utils.py

In [None]:
import os

from utils import setup_benchmark_environment, bench_single_model


INFO 06-22 14:33:10 [__init__.py:244] Automatically detected platform cuda.


In [None]:
ok = setup_benchmark_environment()
assert ok

In [None]:
print(ok)

True


In [None]:
models = [
    {
        "name": "Qwen-3 8B / BF16",
        "hf_id": "Qwen/Qwen3-8B",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
        ],
    },
    {
        "name": "Qwen-3 8B / FP8",
        "hf_id": "Qwen/Qwen3-8B-FP8",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
            "--quantization", "fp8",
        ],
    },
    {
        "name": "Qwen-3 8B / AWQ 4bit",
        "hf_id": "Qwen/Qwen3-8B-AWQ",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
            "--quantization", "awq",
        ],
    },
    {
        "name": "Qwen-3 8B / AWQ Marlin 4bit ",
        "hf_id": "Qwen/Qwen3-8B-AWQ",
        "vllm_args": [
            "--max-model-len", "2048",
            "--max-seq-len-to-capture", "2048",
            "--gpu-memory-utilization", "0.96",
            "--dtype", "half",
            "--quantization", "awq_marlin",
        ],
    }
]

In [None]:
from huggingface_hub import snapshot_download

for model in models:
    snapshot_download(repo_id=model["hf_id"])

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

In [None]:
current_dir = os.path.dirname(os.path.abspath("__file__"))

In [None]:
results = []

for model in models:
    res = bench_single_model(
        model_name=model["hf_id"],
        port=9134,
        request_rate=10,
        num_prompts=200,
        vllm_path=os.path.join(current_dir, "vllm"),
        vllm_args=model["vllm_args"],
        input_len=512,
        output_len=128
    )
    res["name"] = model["name"]
    results.append(res)

In [None]:
import pandas as pd

df = pd.DataFrame(results)
df[['name', 'throughput_tokens_per_sec', 'ttft_mean', 'ttft_p90', 'tpot_mean', 'tpot_p90', 'e2el_mean', 'e2el_p90']]

Unnamed: 0,name,throughput_tokens_per_sec,ttft_mean,ttft_p90,tpot_mean,tpot_p90,e2el_mean,e2el_p90
0,Qwen-3 8B / BF16,1052.84,144.2,224.63,50.84,56.68,6243.78,7276.14
1,Qwen-3 8B / FP8,1119.78,100.66,147.52,33.47,37.64,4109.3,4872.23
2,Qwen-3 8B / AWQ 4bit,868.82,283.24,461.84,104.87,122.98,12778.73,15857.04
3,Qwen-3 8B / AWQ Marlin 4bit,1170.49,99.31,160.09,24.91,29.16,3032.04,3705.31


### 📝 What These Results Show

- FP8 cuts per-token latency by ~35% versus BF16 — Ada/Hopper GPUs (for example L40, H100, H200) include native FP8 tensor‑cores, so decoding cost drops from 50.8 ms per token (BF16) to 33.5 ms (FP8) while throughput rises.

- AWQ‑Marlin wins overall — it uses an optimized kernel code for dramatically less memory movement and a mixed-precision dot-product the GPU can execute in one shot, giving it the best overall throughput (1170 tok/s) and the lowest TPOT (24.9 ms).

- Kernel quality beats bit‑width.

  - The default AWQ path in vLLM dequantises 4‑bit weights to FP16 every layer and every token, and promotes BF16 activations to FP16 before the matmul. The extra memory traffic and frequent dtype conversions wipe out most of the INT4 advantage.

  - Marlin avoids this cost by expanding each weight block once per micro‑batch and re‑using it, keeping activations in FP16.

  - INT4 alone does not guarantee speed — you need a kernel that is both compute‑ and memory‑efficient.

- **Take‑aways for practitioners**

  - Measure, don’t assume. Low precision brings speed-ups only when the kernel is memory-bandwidth-bound (i.e. the slow part is transferring data to and from GPU memory, not doing the computations) and when it avoids redundant conversions.

  - Use FP8 on Ada/Hopper whenever memory allows — it delivers near‑FP16 quality with roughly one-third to one-half lower latency.

  - Prefer AWQ‑Marlin when you need the smallest memory footprint and high throughput.

## Quality evaluation

In this part we will see how quantization affects quality.

We will load several models, including a **Llama-3.1-70B** variant compressed with AWQ 4-bit, and measure their performance with `lm-eval`. Thanks to 4-bit quantization, the entire 70B checkpoint fits on a single NVIDIA L40 (46 GB), whereas the 16-bit original needs ≈140 GB and an 8-bit version ≈70 GB.

In addition to AWQ, we will also evaluate the Llama-3.1-70B model in [**GGUF**](https://huggingface.co/docs/hub/en/gguf) format. GGUF is a compact and portable format designed primarily for use with the [`llama.cpp`](https://github.com/ggml-org/llama.cpp) inference stack, often targeting CPU and lightweight GPU setups. While GGUF makes it easy to run models across a wide range of devices and supports very small quantized variants (e.g., Q4_K_M), it is not currently optimized for vLLM, and may show much slower performance characteristics compared to AWQ models. Here, we are primarily interested in evaluating quality rather than speed, demonstrating how quantized formats enable serving larger models, which typically result in better quality.

We will also benchmark the **Qwen3-8B**, comparing its full-precision BF16 baseline with more compact FP8 and AWQ-INT4 variants.

For Qwen-3 the authors already [report scores](https://huggingface.co/Qwen/Qwen3-8B-AWQ#performance) on tougher leaderboard-style benchmarks (LiveBench, GPQA, etc.), showing that AWQ-INT4 loses only slightly compared with BF16 while cutting memory use by ≳50 %.

Below, the `evaluate_model` helper runs evaluation on HellaSwag, a multiple-choice commonsense reasoning benchmark designed to test a model's ability to choose the most plausible continuation of a given context. This allows you to observe the quality-vs-size trade-offs in a reproducible setup focused on everyday reasoning performance.

In [None]:
pip install lm-eval[vllm]

In [None]:
from huggingface_hub import snapshot_download

local_dir_4bit = snapshot_download(
    repo_id="bartowski/Meta-Llama-3.1-70B-Instruct-GGUF",
    allow_patterns=["Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf"],
)

local_dir_2bit = snapshot_download(
    repo_id="bartowski/Meta-Llama-3.1-70B-Instruct-GGUF",
    allow_patterns=["Meta-Llama-3.1-70B-Instruct-IQ2_S.gguf"],
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Meta-Llama-3.1-70B-Instruct-IQ2_S.gguf:   0%|          | 0.00/22.2G [00:00<?, ?B/s]

In [None]:
from utils import evaluate_model

models = [
    {
        "name": "Llama 70B / GGUF 4bit",
        "hf_id": f"{local_dir_4bit}/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--max-num-seqs", "1",  # setting to 1 since evaluation requires logits that consume a lot of memory
            "--tokenizer", "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
            "--gpu-memory-utilization", "0.96",
        ],
    },
    {
        "name": "Llama 70B / GGUF 2bit",
        "hf_id": f"{local_dir_2bit}/Meta-Llama-3.1-70B-Instruct-IQ2_S.gguf",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--max-num-seqs", "1",  # setting to 1 since evaluation requires logits that consume a lot of memory
            "--tokenizer", "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
            "--gpu-memory-utilization", "0.96",
        ],
    },
    {
        "name": "Llama 70b / AWQ 4bit",
        "hf_id": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--max-num-seqs", "1",  # setting to 1 since evaluation requires logits that consume a lot of memory
            "--quantization", "awq_marlin",
        ],
    },
    {
        "name": "Qwen-3 8B / BF16",
        "hf_id": "Qwen/Qwen3-8B",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
        ],
    },
    {
        "name": "Qwen-3 8B / FP8",
        "hf_id": "Qwen/Qwen3-8B-FP8",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--quantization", "fp8",
        ],
    },
    {
        "name": "Qwen-3 8B / AWQ 4bit",
        "hf_id": "Qwen/Qwen3-8B-AWQ",
        "vllm_args": [
            "--max-model-len", "256",
            "--max-seq-len-to-capture", "256",
            "--quantization", "awq_marlin",
        ],
    },
]

INFO 07-06 14:02:13 [__init__.py:244] Automatically detected platform cuda.


In [None]:
%%time

results = []

for model in models:
    hellaswag_acc = evaluate_model(
        model_name=model["hf_id"],
        vllm_args=model["vllm_args"],
        limit=400,
    )
    results.append({
        "name": model["name"],
        "acc": hellaswag_acc
    })


INFO 07-06 14:02:46 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
ERROR 07-06 14:02:46 [config.py:114] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/lex/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf'. Use `repo_type` argument if needed., retrying 1 of 2
ERROR 07-06 14:02:48 [config.py:112] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/lex/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf'. Use `repo_type` argument if needed.
INFO 07-06 14:02:48 [config.py:3268] Downcasting torch.float32 to torch.bfloat16.
INFO 07-06 14:02:48 [config.py:2195] Chunked prefill 

Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████| 400/400 [00:00<00:00, 3553.46it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [18:53<00:00,  1.41it/s]


INFO 07-06 14:33:34 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
ERROR 07-06 14:33:34 [config.py:114] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/lex/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-IQ2_S.gguf'. Use `repo_type` argument if needed., retrying 1 of 2
ERROR 07-06 14:33:36 [config.py:112] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/lex/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-70B-Instruct-GGUF/snapshots/83fb6e83d0a8aada42d499259bc929d922e9a558/Meta-Llama-3.1-70B-Instruct-IQ2_S.gguf'. Use `repo_type` argument if needed.
INFO 07-06 14:33:36 [config.py:3268] Downcasting torch.float32 to torch.bfloat16.
INFO 07-06 14:33:36 [config.py:2195] Chunked prefill is

Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████| 400/400 [00:00<00:00, 3508.51it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [12:48<00:00,  2.08it/s]


INFO 07-06 14:53:48 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 07-06 14:53:48 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 07-06 14:53:48 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-06 14:53:56 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 14:53:59 [core.py:455] Waiting for init message from front-end.
INFO 07-06 14:53:59 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', speculative_config=None, tokenizer='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=256, download_dir=None, load_format=auto, tensor_parallel_size=1, pipel

Loading safetensors checkpoint shards:   0% Completed | 0/9 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  11% Completed | 1/9 [00:19<02:34, 19.29s/it]
Loading safetensors checkpoint shards:  22% Completed | 2/9 [00:33<01:52, 16.13s/it]
Loading safetensors checkpoint shards:  33% Completed | 3/9 [00:53<01:48, 18.02s/it]
Loading safetensors checkpoint shards:  44% Completed | 4/9 [01:13<01:34, 18.91s/it]
Loading safetensors checkpoint shards:  56% Completed | 5/9 [01:34<01:17, 19.40s/it]
Loading safetensors checkpoint shards:  67% Completed | 6/9 [01:54<00:59, 19.69s/it]
Loading safetensors checkpoint shards:  78% Completed | 7/9 [02:14<00:39, 19.99s/it]
Loading safetensors checkpoint shards:  89% Completed | 8/9 [02:23<00:16, 16.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [02:43<00:00, 17.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [02:43<00:00, 18.20s/it]



INFO 07-06 14:56:45 [default_loader.py:272] Loading weights took 163.93 seconds
INFO 07-06 14:56:50 [gpu_model_runner.py:1624] Model loading took 37.0909 GiB and 169.567562 seconds
INFO 07-06 14:57:11 [backends.py:462] Using cache directory: /home/lex/.cache/vllm/torch_compile_cache/0a7455f96e/rank_0_0 for vLLM's torch.compile
INFO 07-06 14:57:11 [backends.py:472] Dynamo bytecode transform time: 21.10 s
INFO 07-06 14:57:15 [backends.py:161] Cache the graph of shape None for later use
INFO 07-06 14:58:20 [backends.py:173] Compiling a graph for general shape takes 66.48 s
INFO 07-06 15:00:30 [monitor.py:34] torch.compile takes 87.58 s in total
INFO 07-06 15:00:32 [gpu_worker.py:227] Available KV cache memory: 0.99 GiB
INFO 07-06 15:00:32 [kv_cache_utils.py:715] GPU KV cache size: 3,216 tokens
INFO 07-06 15:00:32 [kv_cache_utils.py:719] Maximum concurrency for 256 tokens per request: 12.56x
INFO 07-06 15:01:24 [gpu_model_runner.py:2048] Graph capturing finished in 52 secs, took 1.25 GiB
I

Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████| 400/400 [00:00<00:00, 3580.91it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [01:52<00:00, 14.24it/s]


INFO 07-06 15:03:56 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 07-06 15:03:56 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-06 15:04:00 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 15:04:03 [core.py:455] Waiting for init message from front-end.
INFO 07-06 15:04:03 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-8B', speculative_config=None, tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=256, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fall

Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:15<01:02, 15.57s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:20<00:28,  9.45s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:37<00:25, 12.69s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:53<00:14, 14.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:06<00:00, 13.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:06<00:00, 13.38s/it]



INFO 07-06 15:05:12 [default_loader.py:272] Loading weights took 67.00 seconds
INFO 07-06 15:05:13 [gpu_model_runner.py:1624] Model loading took 15.2683 GiB and 67.840160 seconds
INFO 07-06 15:05:22 [backends.py:462] Using cache directory: /home/lex/.cache/vllm/torch_compile_cache/8e186832ae/rank_0_0 for vLLM's torch.compile
INFO 07-06 15:05:22 [backends.py:472] Dynamo bytecode transform time: 9.22 s
INFO 07-06 15:05:30 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 7.903 s
INFO 07-06 15:05:32 [monitor.py:34] torch.compile takes 9.22 s in total
INFO 07-06 15:05:33 [gpu_worker.py:227] Available KV cache memory: 23.33 GiB
INFO 07-06 15:05:33 [kv_cache_utils.py:715] GPU KV cache size: 169,856 tokens
INFO 07-06 15:05:33 [kv_cache_utils.py:719] Maximum concurrency for 256 tokens per request: 663.50x
INFO 07-06 15:05:58 [gpu_model_runner.py:2048] Graph capturing finished in 24 secs, took 0.60 GiB
INFO 07-06 15:05:58 [core.py:171] init engine (profil

Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████| 400/400 [00:00<00:00, 3590.18it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [00:16<00:00, 94.59it/s] 


INFO 07-06 15:06:38 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 07-06 15:06:38 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-06 15:06:43 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 15:06:46 [core.py:455] Waiting for init message from front-end.
INFO 07-06 15:06:46 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-8B-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-8B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=256, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disab

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:17<00:17, 17.59s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:38<00:00, 19.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:38<00:00, 19.06s/it]



INFO 07-06 15:07:26 [default_loader.py:272] Loading weights took 38.22 seconds
INFO 07-06 15:07:26 [gpu_model_runner.py:1624] Model loading took 8.8011 GiB and 39.210030 seconds
INFO 07-06 15:07:36 [backends.py:462] Using cache directory: /home/lex/.cache/vllm/torch_compile_cache/54c66b6b61/rank_0_0 for vLLM's torch.compile
INFO 07-06 15:07:36 [backends.py:472] Dynamo bytecode transform time: 9.26 s
INFO 07-06 15:07:44 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 7.999 s
INFO 07-06 15:07:46 [monitor.py:34] torch.compile takes 9.26 s in total
INFO 07-06 15:07:47 [gpu_worker.py:227] Available KV cache memory: 29.79 GiB
INFO 07-06 15:07:47 [kv_cache_utils.py:715] GPU KV cache size: 216,944 tokens
INFO 07-06 15:07:47 [kv_cache_utils.py:719] Maximum concurrency for 256 tokens per request: 847.44x
INFO 07-06 15:08:17 [gpu_model_runner.py:2048] Graph capturing finished in 30 secs, took 0.65 GiB
INFO 07-06 15:08:17 [core.py:171] init engine (profile

Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████| 400/400 [00:00<00:00, 3536.33it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [00:13<00:00, 119.53it/s]


INFO 07-06 15:08:54 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
INFO 07-06 15:08:54 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 07-06 15:08:54 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-06 15:08:59 [__init__.py:244] Automatically detected platform cuda.
INFO 07-06 15:09:02 [core.py:455] Waiting for init message from front-end.
INFO 07-06 15:09:02 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-8B-AWQ', speculative_config=None, tokenizer='Qwen/Qwen3-8B-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=256, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=a

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:04<00:04,  4.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:24<00:00, 13.53s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:24<00:00, 12.15s/it]



INFO 07-06 15:09:29 [default_loader.py:272] Loading weights took 24.40 seconds
INFO 07-06 15:09:30 [gpu_model_runner.py:1624] Model loading took 5.7073 GiB and 26.045704 seconds
INFO 07-06 15:09:41 [backends.py:462] Using cache directory: /home/lex/.cache/vllm/torch_compile_cache/61e64b146d/rank_0_0 for vLLM's torch.compile
INFO 07-06 15:09:41 [backends.py:472] Dynamo bytecode transform time: 11.21 s
INFO 07-06 15:09:46 [backends.py:161] Cache the graph of shape None for later use
INFO 07-06 15:10:22 [backends.py:173] Compiling a graph for general shape takes 40.25 s
INFO 07-06 15:11:16 [monitor.py:34] torch.compile takes 51.46 s in total
INFO 07-06 15:11:17 [gpu_worker.py:227] Available KV cache memory: 32.89 GiB
INFO 07-06 15:11:17 [kv_cache_utils.py:715] GPU KV cache size: 239,456 tokens
INFO 07-06 15:11:17 [kv_cache_utils.py:719] Maximum concurrency for 256 tokens per request: 935.38x
INFO 07-06 15:11:48 [gpu_model_runner.py:2048] Graph capturing finished in 30 secs, took 0.62 GiB


Overwriting default num_fewshot of hellaswag from None to 0
100%|██████████| 400/400 [00:00<00:00, 3612.30it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [00:12<00:00, 130.23it/s]


CPU times: user 3min 27s, sys: 0 ns, total: 3min 27s
Wall time: 1h 10min 7s


In [None]:
import pandas as pd

pd.DataFrame(results)

Unnamed: 0,name,acc
0,Llama 70B / GGUF 4bit,0.7075
1,Llama 70B / GGUF 2bit,0.2675
2,Llama 70b / AWQ 4bit,0.715
3,Qwen-3 8B / BF16,0.6375
4,Qwen-3 8B / FP8,0.6275
5,Qwen-3 8B / AWQ 4bit,0.625


The results show that quantization introduces only a modest drop in accuracy while offering substantial memory savings.

Both the Llama 3.1 70B GGUF 4-bit (70.8%) and AWQ 4-bit (71.5%) variants achieve the highest accuracy on HellaSwag, significantly outperforming all Qwen-3 8B variants. This performance gap is primarily due to the much larger model size of Llama 70B, showcasing how quantization makes it feasible to serve high-capacity models on limited hardware while retaining their quality advantages.

For Qwen-3 8B, the accuracy shows a gradual decline from BF16 (63.8%) to FP8 (62.8%) and AWQ 4-bit (62.5%), reflecting the expected trade-off between efficiency and quality. Notably, extreme quantization like GGUF 2-bit on Llama 70B results in a significant performance drop (26.8%), indicating that aggressive compression can severely degrade model quality.

**Overall, these findings confirm that low-bit quantization, particularly 4-bit and FP8 formats, is an effective strategy for deploying language models in resource-constrained environments. Depending on the deployment goal, quantization can be used either to speed up inference and improve throughput for a given model, or to enable serving much larger models (like Llama 70B) that deliver higher quality outputs, giving the flexibility to balance efficiency and performance.**