Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Step-by-step

In this example, you can verify the accuracy on HPU/CUDA device with emulation of MXFP4, MXFP8, NVFP4 and uNVFP4.

## Requirement

```bash
# neural-compressor-pt
pip install neural-compressor-pt==3.7
# auto-round
pip install auto-round==0.9.2
# other requirements
pip install -r requirements.txt
```

**Before neural-compressor v3.7 and auto-round v0.9.1 release, please install from source for the latest updates:**

```bash
# neural-compressor-pt
INC_PT_ONLY=1 pip install git+https://github.com/intel/neural-compressor.git@master
# auto-round
pip install git+https://github.com/intel/auto-round.git@more-ar-ext
# other requirements
pip install -r requirements.txt
```


## Quantization

### Demo (`MXFP4`, `MXFP8`, `NVFP4`, `uNVFP4`)

```bash
CUDA_VISIBLE_DEVICES=0 python quantize.py \
--model_name_or_path facebook/opt-125m \
--quantize \
--dtype MXFP8 \
--enable_torch_compile \
--low_gpu_mem_usage \
--export_format auto_round \
--export_path OPT-125M-MXFP8 \
--accuracy \
--tasks lambada_openai \
--eval_batch_size 8
```

Notes:
- Use `--export_format auto_round` for `MXFP4`, `MXFP8` data type and do inference as below.
- Use `--export_format llm_compressor` for `NVFP4` data type since public vLLM supports it.
- Use `--export_format fake` for `uNVFP4` data type since it's not fully supported.
- Setting `--quant_lm_head` applies `--dtype` for the lm_head layer.
- Setting `--iters 0` skips AutoRound tuning and uses RTN method.


#### Target_bits

To achieve optimal compression ratios in mixed-precision quantization, we provide the `--target_bits` argument for automated precision configuration.

- If you pass a single float number, it will automatically generate an optimal quantization recipe to achieve that target average bit-width.
- If you pass multiple float numbers, it will generate multiple recipes for different target bit-widths, allowing you to compare trade-offs between model size and accuracy.

Example usage:

```bash
CUDA_VISIBLE_DEVICES=0 python quantize.py \
--model_name_or_path facebook/opt-125m \
--quantize \
--dtype MXFP4 \
--target_bits 6.5 7 7.3 \
--tune_limit 100 \
--enable_torch_compile \
--low_gpu_mem_usage \
--export_format auto_round \
--export_path OPT-125m-MXFP4-MXFP8 \
--accuracy \
--tasks lambada_openai \
--eval_batch_size 8
```


### Llama3 Quantization Recipes

#### Llama 3.1 8B MXFP8

AutoRound helps improve the accuracy, `iters` and `nsamples` is higher than default.
```bash
# Quantize and export AutoRound format
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp8 --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP8
```

#### Llama 3.1 8B MXFP4 (Mixed with MXFP8, Target_bits=7.8)

```bash
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp4_mixed --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP4-MXFP8
```

#### Llama 3.3 70B MXFP8

```bash
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp8 --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP8
```

#### Llama 3.3 70B MXFP4 (Mixed with MXFP8, Target_bits=5.8)
```bash
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp4_mixed --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP4-MXFP8
```

#### Llama 3.1 70B MXFP8

```bash
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-70B --dtype=mxfp8 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-MXFP8
```
#### Llama 3.1 70B uNVFP4

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_quant.sh --topology=Llama-3.1-70B --dtype=unvfp4 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-uNVFP4
```
Note: If you got OOM issue, either increasing `CUDA_VISIBLE_DEVICES` or reducing `eval_batch_size` is suggested.

## Inference

### MXFP4 & MXFP8

- Both pure MXFP4/MXFP8 and mix-precision model generated by target bits are supported.

#### Prerequisite

```bash
# Install the forked vLLM
git clone -b fused-moe-ar --single-branch --quiet https://github.com/yiliu30/vllm-fork.git && cd vllm-fork
VLLM_USE_PRECOMPILED=1 pip install -e .
```

#### MXFP Benchmark Script

For convenience, we provide a benchmark script that automatically handles GPU detection and tensor parallelism configuration:

**All 5 MXFP benchmark cases:**

1. **Llama 3.1 8B MXFP8** (1 GPU):
```bash
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP8
```

2. **Llama 3.1 8B MXFP4 Mixed** (1 GPU):
```bash
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP4-MXFP8
```

3. **Llama 3.3 70B MXFP8** (4 GPU):
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP8
```

4. **Llama 3.3 70B MXFP4 Mixed** (4 GPU):
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP4-MXFP8
```

5. **Llama 3.1 70B MXFP8** (4 GPU):
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.1-70B-MXFP8
```

The script automatically:
- Detects available GPUs from `CUDA_VISIBLE_DEVICES` and sets `tensor_parallel_size` accordingly
- Handles different `add_bos_token` settings for different tasks (GSM8K requires `False`, others use `True`)
- Runs default tasks: `piqa,hellaswag,mmlu,gsm8k` with batch size 8
- Supports custom task selection and batch size adjustment


### NVFP4
NVFP4 is supported by vLLM already, please set `llm_compressor` format for exporting during quantization.

```bash
CUDA_VISIBLE_DEVICES=0 lm_eval --model vllm \
--model_args pretrained={nvfp4_model_path},tensor_parallel_size=1,data_parallel_size=1 \
--tasks lambada_openai \
--batch_size 4
```

### uNVFP4
uNVFP4 is saved in fake format and reloading is not available currently. To verify accuracy after quantization, setting `--accuracy --tasks lambada_openai` in command.

```bash
CUDA_VISIBLE_DEVICES=0 python quantize.py \
--model_name_or_path facebook/opt-125m \
--quantize \
--dtype uNVFP4 \
--enable_torch_compile \
--low_gpu_mem_usage \
--export_format fake \
--export_path OPT-125M-uNVFP4 \
--accuracy \
--tasks lambada_openai \
--eval_batch_size 8 \
--device_map 0
```
Loading