-
Notifications
You must be signed in to change notification settings - Fork 284
autotune target_bits example for llama recipe #2344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xin3he
wants to merge
10
commits into
master
Choose a base branch
from
xinhe/vllm
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+775
−8,430
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
2e50295
autotune target_bits example for llama recipe
xin3he 709cc71
update requirement
xin3he cc25af5
add run_quant run_benchmark
xin3he dcd69a2
update readme
xin3he f07ca2d
Update neural_compressor/torch/algorithms/weight_only/autoround.py
xin3he bca2063
Update neural_compressor/common/base_config.py
xin3he 1d812a0
Update neural_compressor/torch/algorithms/weight_only/autoround.py
xin3he 99b8fff
fix bug
xin3he 3ffb650
update readme and fix CI
xin3he 54f87bb
fix CI
xin3he File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
197 changes: 197 additions & 0 deletions
197
...p/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| # Step-by-step | ||
|
|
||
| In this example, you can verify the accuracy on HPU/CUDA device with emulation of MXFP4, MXFP8, NVFP4 and uNVFP4. | ||
|
|
||
| ## Requirement | ||
|
|
||
| ```bash | ||
| # neural-compressor-pt | ||
| pip install neural-compressor-pt==3.7 | ||
| # auto-round | ||
| pip install auto-round==0.9.2 | ||
| # other requirements | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| **Before neural-compressor v3.7 and auto-round v0.9.1 release, please install from source for the latest updates:** | ||
|
|
||
| ```bash | ||
| # neural-compressor-pt | ||
| INC_PT_ONLY=1 pip install git+https://github.com/intel/neural-compressor.git@master | ||
| # auto-round | ||
| pip install git+https://github.com/intel/auto-round.git@more-ar-ext | ||
| # other requirements | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
|
|
||
| ## Quantization | ||
|
|
||
| ### Demo (`MXFP4`, `MXFP8`, `NVFP4`, `uNVFP4`) | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 python quantize.py \ | ||
| --model_name_or_path facebook/opt-125m \ | ||
| --quantize \ | ||
| --dtype MXFP8 \ | ||
| --enable_torch_compile \ | ||
| --low_gpu_mem_usage \ | ||
| --export_format auto_round \ | ||
| --export_path OPT-125M-MXFP8 \ | ||
| --accuracy \ | ||
| --tasks lambada_openai \ | ||
| --eval_batch_size 8 | ||
| ``` | ||
|
|
||
| Notes: | ||
| - Use `--export_format auto_round` for `MXFP4`, `MXFP8` data type and do inference as below. | ||
| - Use `--export_format llm_compressor` for `NVFP4` data type since public vLLM supports it. | ||
| - Use `--export_format fake` for `uNVFP4` data type since it's not fully supported. | ||
| - Setting `--quant_lm_head` applies `--dtype` for the lm_head layer. | ||
| - Setting `--iters 0` skips AutoRound tuning and uses RTN method. | ||
|
|
||
|
|
||
| #### Target_bits | ||
|
|
||
| To achieve optimal compression ratios in mixed-precision quantization, we provide the `--target_bits` argument for automated precision configuration. | ||
|
|
||
| - If you pass a single float number, it will automatically generate an optimal quantization recipe to achieve that target average bit-width. | ||
| - If you pass multiple float numbers, it will generate multiple recipes for different target bit-widths, allowing you to compare trade-offs between model size and accuracy. | ||
|
|
||
| Example usage: | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 python quantize.py \ | ||
| --model_name_or_path facebook/opt-125m \ | ||
| --quantize \ | ||
| --dtype MXFP4 \ | ||
| --target_bits 6.5 7 7.3 \ | ||
| --tune_limit 100 \ | ||
| --enable_torch_compile \ | ||
| --low_gpu_mem_usage \ | ||
| --export_format auto_round \ | ||
| --export_path OPT-125m-MXFP4-MXFP8 \ | ||
| --accuracy \ | ||
| --tasks lambada_openai \ | ||
| --eval_batch_size 8 | ||
| ``` | ||
|
|
||
|
|
||
| ### Llama3 Quantization Recipes | ||
|
|
||
| #### Llama 3.1 8B MXFP8 | ||
|
|
||
| AutoRound helps improve the accuracy, `iters` and `nsamples` is higher than default. | ||
| ```bash | ||
| # Quantize and export AutoRound format | ||
| CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp8 --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP8 | ||
| ``` | ||
|
|
||
| #### Llama 3.1 8B MXFP4 (Mixed with MXFP8, Target_bits=7.8) | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp4_mixed --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP4-MXFP8 | ||
| ``` | ||
|
|
||
| #### Llama 3.3 70B MXFP8 | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp8 --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP8 | ||
| ``` | ||
|
|
||
| #### Llama 3.3 70B MXFP4 (Mixed with MXFP8, Target_bits=5.8) | ||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.3-70B --dtype=mxfp4_mixed --input_model=/models/Llama-3.3-70B-Instruct/ --output_model=Llama-3.3-70B-MXFP4-MXFP8 | ||
| ``` | ||
|
|
||
| #### Llama 3.1 70B MXFP8 | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-70B --dtype=mxfp8 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-MXFP8 | ||
| ``` | ||
| #### Llama 3.1 70B uNVFP4 | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_quant.sh --topology=Llama-3.1-70B --dtype=unvfp4 --input_model=/models/Llama-3.1-70B-Instruct/ --output_model=Llama-3.1-70B-uNVFP4 | ||
| ``` | ||
| Note: If you got OOM issue, either increasing `CUDA_VISIBLE_DEVICES` or reducing `eval_batch_size` is suggested. | ||
|
|
||
| ## Inference | ||
|
|
||
| ### MXFP4 & MXFP8 | ||
|
|
||
| - Both pure MXFP4/MXFP8 and mix-precision model generated by target bits are supported. | ||
|
|
||
| #### Prerequisite | ||
|
|
||
| ```bash | ||
| # Install the forked vLLM | ||
| git clone -b fused-moe-ar --single-branch --quiet https://github.com/yiliu30/vllm-fork.git && cd vllm-fork | ||
| VLLM_USE_PRECOMPILED=1 pip install -e . | ||
| ``` | ||
|
|
||
| #### MXFP Benchmark Script | ||
|
|
||
| For convenience, we provide a benchmark script that automatically handles GPU detection and tensor parallelism configuration: | ||
|
|
||
| **All 5 MXFP benchmark cases:** | ||
|
|
||
| 1. **Llama 3.1 8B MXFP8** (1 GPU): | ||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP8 | ||
| ``` | ||
|
|
||
| 2. **Llama 3.1 8B MXFP4 Mixed** (1 GPU): | ||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh --model_path=Llama-3.1-8B-MXFP4-MXFP8 | ||
| ``` | ||
|
|
||
| 3. **Llama 3.3 70B MXFP8** (4 GPU): | ||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP8 | ||
| ``` | ||
|
|
||
| 4. **Llama 3.3 70B MXFP4 Mixed** (4 GPU): | ||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.3-70B-MXFP4-MXFP8 | ||
| ``` | ||
|
|
||
| 5. **Llama 3.1 70B MXFP8** (4 GPU): | ||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_benchmark.sh --model_path=Llama-3.1-70B-MXFP8 | ||
| ``` | ||
|
|
||
| The script automatically: | ||
| - Detects available GPUs from `CUDA_VISIBLE_DEVICES` and sets `tensor_parallel_size` accordingly | ||
| - Handles different `add_bos_token` settings for different tasks (GSM8K requires `False`, others use `True`) | ||
| - Runs default tasks: `piqa,hellaswag,mmlu,gsm8k` with batch size 8 | ||
| - Supports custom task selection and batch size adjustment | ||
|
|
||
|
|
||
| ### NVFP4 | ||
| NVFP4 is supported by vLLM already, please set `llm_compressor` format for exporting during quantization. | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 lm_eval --model vllm \ | ||
| --model_args pretrained={nvfp4_model_path},tensor_parallel_size=1,data_parallel_size=1 \ | ||
| --tasks lambada_openai \ | ||
| --batch_size 4 | ||
| ``` | ||
|
|
||
| ### uNVFP4 | ||
| uNVFP4 is saved in fake format and reloading is not available currently. To verify accuracy after quantization, setting `--accuracy --tasks lambada_openai` in command. | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0 python quantize.py \ | ||
| --model_name_or_path facebook/opt-125m \ | ||
| --quantize \ | ||
| --dtype uNVFP4 \ | ||
| --enable_torch_compile \ | ||
| --low_gpu_mem_usage \ | ||
| --export_format fake \ | ||
| --export_path OPT-125M-uNVFP4 \ | ||
| --accuracy \ | ||
| --tasks lambada_openai \ | ||
| --eval_batch_size 8 \ | ||
| --device_map 0 | ||
| ``` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.