intel · xin3he · Nov 25, 2025 · Nov 25, 2025 · Nov 26, 2025 · Nov 26, 2025
diff --git a/...p/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md b/...p/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md
@@ -0,0 +1,197 @@
+# Step-by-step
+
+In this example, you can verify the accuracy on HPU/CUDA device with emulation of MXFP4, MXFP8, NVFP4 and uNVFP4.
+
+## Requirement
+
+```bash
+# neural-compressor-pt
+pip install neural-compressor-pt==3.7
+# auto-round
+pip install auto-round==0.9.2
+# other requirements
+pip install -r requirements.txt
+```
+
+**Before neural-compressor v3.7 and auto-round v0.9.1 release, please install from source for the latest updates:**
+
+```bash 
+# neural-compressor-pt
+INC_PT_ONLY=1 pip install git+https://github.com/intel/neural-compressor.git@master
+# auto-round
+pip install git+https://github.com/intel/auto-round.git@more-ar-ext
+# other requirements
+pip install -r requirements.txt
+```
+
+
+## Quantization
+
+### Demo (`MXFP4`, `MXFP8`, `NVFP4`, `uNVFP4`)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python quantize.py  \
+    --model_name_or_path facebook/opt-125m  \
+    --quantize \
+    --dtype MXFP8 \
+    --enable_torch_compile \
+    --low_gpu_mem_usage \
+    --export_format auto_round \
+    --export_path OPT-125M-MXFP8 \
+    --accuracy \
+    --tasks lambada_openai \
+    --eval_batch_size 8
+```
+
+Notes:
+- Use `--export_format auto_round` for `MXFP4`, `MXFP8` data type and do inference as below.
+- Use `--export_format llm_compressor` for `NVFP4` data type since public vLLM supports it.
+- Use `--export_format fake` for `uNVFP4` data type since it's not fully supported.
+- Setting `--quant_lm_head` applies `--dtype` for the lm_head layer.
+- Setting `--iters 0` skips AutoRound tuning and uses RTN method.
+
+
+#### Target_bits
+
+To achieve optimal compression ratios in mixed-precision quantization, we provide the `--target_bits` argument for automated precision configuration.
+
+- If you pass a single float number, it will automatically generate an optimal quantization recipe to achieve that target average bit-width.
+- If you pass multiple float numbers, it will generate multiple recipes for different target bit-widths, allowing you to compare trade-offs between model size and accuracy.
+
+Example usage:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python quantize.py  \
+    --model_name_or_path facebook/opt-125m \
+    --quantize \
+    --dtype MXFP4 \
+    --target_bits 6.5 7 7.3 \
+    --tune_limit 100 \
+    --enable_torch_compile \
+    --low_gpu_mem_usage \
+    --export_format auto_round \
+    --export_path OPT-125m-MXFP4-MXFP8 \
+    --accuracy \
+    --tasks lambada_openai \
+    --eval_batch_size 8
+```
+
+
+### Llama3 Quantization Recipes
+
+#### Llama 3.1 8B MXFP8
+
+AutoRound helps improve the accuracy, `iters` and `nsamples` is higher than default.
+```bash
+# Quantize and export AutoRound format
+CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp8 --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP8
+```
+
+#### Llama 3.1 8B MXFP4 (Mixed with MXFP8, Target_bits=7.8)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp4_mixed --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP4-MXFP8
+```
+
+#### Llama 3.3 70B MXFP8
+
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp8 --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP8
+```
+
+#### Llama 3.3 70B MXFP4 (Mixed with MXFP8, Target_bits=5.8)
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp4_mixed --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP4-MXFP8
+```
+
+#### Llama 3.1 70B MXFP8
+
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-70B --dtype=mxfp8 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-MXFP8
+```
+#### Llama 3.1 70B uNVFP4
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_quant.sh --topology=Llama-3.1-70B --dtype=unvfp4 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-uNVFP4
+```
+Note: If you got OOM issue, either increasing `CUDA_VISIBLE_DEVICES` or reducing `eval_batch_size` is suggested.
+
+## Inference
+
+### MXFP4 & MXFP8
+
+- Both pure MXFP4/MXFP8 and mix-precision model generated by target bits are supported.
+
+#### Prerequisite
+
+```bash
+# Install the forked vLLM
+git clone -b fused-moe-ar --single-branch --quiet https://github.com/yiliu30/vllm-fork.git && cd vllm-fork
+VLLM_USE_PRECOMPILED=1 pip install -e .
+```
+
+#### MXFP Benchmark Script
+
+For convenience, we provide a benchmark script that automatically handles GPU detection and tensor parallelism configuration:
+
+**All 5 MXFP benchmark cases:**
+
+1. **Llama 3.1 8B MXFP8** (1 GPU):
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP8
+```
+
+2. **Llama 3.1 8B MXFP4 Mixed** (1 GPU):
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP4-MXFP8
+```
+
+3. **Llama 3.3 70B MXFP8** (4 GPU):
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP8
+```
+
+4. **Llama 3.3 70B MXFP4 Mixed** (4 GPU):
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP4-MXFP8
+```
+
+5. **Llama 3.1 70B MXFP8** (4 GPU):
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.1-70B-MXFP8
+```
+
+The script automatically:
+- Detects available GPUs from `CUDA_VISIBLE_DEVICES` and sets `tensor_parallel_size` accordingly
+- Handles different `add_bos_token` settings for different tasks (GSM8K requires `False`, others use `True`)
+- Runs default tasks: `piqa,hellaswag,mmlu,gsm8k` with batch size 8
+- Supports custom task selection and batch size adjustment
+
+
+### NVFP4
+NVFP4 is supported by vLLM already, please set `llm_compressor` format for exporting during quantization.
+
+```bash
+CUDA_VISIBLE_DEVICES=0 lm_eval --model vllm \
+    --model_args pretrained={nvfp4_model_path},tensor_parallel_size=1,data_parallel_size=1 \
+    --tasks lambada_openai \
+    --batch_size 4
+```
+
+### uNVFP4
+uNVFP4 is saved in fake format and reloading is not available currently. To verify accuracy after quantization, setting `--accuracy --tasks lambada_openai` in command.
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python quantize.py  \
+    --model_name_or_path facebook/opt-125m  \
+    --quantize \
+    --dtype uNVFP4 \
+    --enable_torch_compile \
+    --low_gpu_mem_usage \
+    --export_format fake \
+    --export_path OPT-125M-uNVFP4 \
+    --accuracy \
+    --tasks lambada_openai \
+    --eval_batch_size 8  \
+    --device_map 0
+```