-
Notifications
You must be signed in to change notification settings - Fork 284
Cherry pick Habana software 1.18.0 update #2025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
c7f995b
Squashed commit of the following:
xinhe3 279fcb4
fix missing
xinhe3 22dcb77
[SW-192016] fix measurements tool
Yantom1 902e8b2
[SW-192996] llama promotion pip install versions collision of INC dep…
Yantom1 6a54fd7
[SW-198294] INC to read world size and local rank from torch.distribu…
Yantom1 54835da
[SW-196480] Add pow2 scale config for more modules
ulivne 9dd31fd
[SW-198690] Import numba packing func when needed
Yi4Liu 9b963e6
[SW-180042] Change HQT HF8_143 max range
Yantom1 aaefde6
[SW-187399] Support vLLM Mixtral quantization
dudilester 74c5d24
[SW-197033][SW-197034] Separate unit scale from measurement and add H…
HolyFalafel 8a1e6db
[SW-198578] fix parsing of default fake_quant config
ulivne 671de6f
[ALGO-797] enabled quarot - modified unittest, improved scripts reada…
tgafni 02d0868
[SW-199642] Fix way of getting device type in scale calculation in INC
Yantom1 b7e07e9
[SW-194429] Workaround to cholesky accuracy issue
HolyFalafel 283cd69
[SW-200060] fixed missing __init__.py in quarot
5baf7d3
[SW-199769] Bugfix: fixing ignored types in whitelist
Danielohayon ead322c
[SW-199944] Remove pre installed transformers in release dockers
xinhe3 212e96f
[SW-174155] Revert ProcessSafeReadWriteLock implementation
Tiefen-boop 0eec282
[SW-198238] support scales as scalar for low bs perf
ulivne ac49e2d
[SW-199826] rename neural_compressor_3x_pt to neural_compressor_pt an…
xinhe3 42f8960
[SW-200369] Ignoring dump_stats_path on quant only scale methods
HolyFalafel 373d052
[SW-199866] removing PatchedMoeMatmul weight contiguous call on measure
dudilester 83c3d5f
[SW-198220] Fix bug with scale_to_scalar with dim != 0
Tiefen-boop 06d5801
Revert "[SW-194429] Workaround to cholesky accuracy issue"
HolyFalafel 598ef84
[SW-198220] Set PatchedSoftmax to work with CONST scale only
Tiefen-boop 8990ab5
[SW-201115] [performance] fallback ops from hpu to cpu
xinhe3 ae450e2
[SW-203595] Update numpy version for Python 3.12
ad6c97a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f7a92a7
fix mismatch
xinhe3 dbfcfbd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 964770b
fix pre-commit issues
xinhe3 54041b2
fix UT
xinhe3 b191a23
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 89bc8af
fix import issue
xinhe3 dd94b32
update habana image
XuehaoSun 76b1911
add numba
xinhe3 27e521a
fix accuracy issue on hpu
xinhe3 8e00509
add tbb
xinhe3 7c86798
fix UT
xinhe3 3ebed80
update code coverage range
xinhe3 ec7459b
fix pre-ci check temporarily
xinhe3 7676cd8
fix pre-commit
XuehaoSun a6722e9
add comment
xinhe3 a7451c3
Update PT_FP8Quant.md
xin3he a543abc
move fp8_sample to hello_world/fp8_example
xinhe3 2b802e8
[SW-203452] Fixing and temp skipping G3 unittests
HolyFalafel 9cc006f
set deepspeed to use Habana v1.18.0
xinhe3 4dd335a
update unittest
XuehaoSun 85e223e
remove deepspeed version limit in test
chensuyue a595949
improve UT coverage
xinhe3 4f4a0bf
update example requirement
xinhe3 2323a20
add __init__
xinhe3 55d19c4
change requirement per discussion
xinhe3 b072f18
update setup.py
xinhe3 cb91dff
Merge branch 'master' into cherry_pick_1.18.0
xin3he ef5d3dc
Merge branch 'master' into cherry_pick_1.18.0
xin3he 9d8a2ee
update woq example
xinhe3 4c34307
refine document per suggestion
xinhe3 9377ac6
fix CI
XuehaoSun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,3 +11,4 @@ neural-compressor | |
| lm_eval==0.4.3 | ||
| peft | ||
| optimum-intel | ||
| intel_extension_for_pytorch | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,3 +10,4 @@ einops | |
| neural-compressor | ||
| lm_eval==0.4.3 | ||
| peft | ||
| intel_extension_for_pytorch | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,3 +13,5 @@ tiktoken #qwen | |
| einops #qwen | ||
| auto_round | ||
| lm-eval==0.4.3 | ||
| numba | ||
| tbb | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,3 +13,5 @@ einops #qwen | |
| auto_round | ||
| lm-eval==0.4.3 | ||
| huggingface_hub | ||
| numba | ||
| tbb | ||
183 changes: 50 additions & 133 deletions
183
...rch/nlp/huggingface_models/language-modeling/quantization/weight_only/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,179 +1,96 @@ | ||
| Step-by-Step | ||
| ============ | ||
| This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch. | ||
| Weight-only quantization | ||
| =============== | ||
|
|
||
| The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models. | ||
|
|
||
| # Prerequisite | ||
| ## 1. Create Environment | ||
| ## Prerequisite | ||
| ``` | ||
| # Installation | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| # Run | ||
| ## Support status on HPU | ||
|
|
||
| Here is how to run the scripts: | ||
| Below is the current support status on Intel Gaudi AI Accelerator with PyTorch. | ||
|
|
||
| **Causal Language Modeling (CLM)** | ||
| | woq_algo | Status | | ||
| |--------------|----------| | ||
| | GPTQ | ✔| | ||
|
|
||
| `run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows. | ||
| ### GPT-J-6b | ||
| > We validated the typical LLMs such as: `meta-llama/Llama-2-7b-hf`, `EleutherAI/gpt-j-6B`, `facebook/opt-125m`. | ||
|
|
||
| #### Quantization | ||
| ## Support status on CPU | ||
|
|
||
| ```bash | ||
| # "--woq_algo GPTQ" is used to enable GPTQ algorithms | ||
| # "--double_quant_type BNB_NF4" is used to enable double quant algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model EleutherAI/gpt-j-6B \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo GPTQ \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --gptq_max_seq_length 2048 \ | ||
| --gptq_use_max_length \ | ||
| --double_quant_type "BNB_NF4" \ | ||
| --output_dir saved_results | ||
| Below is the current support status on Intel® Xeon® Scalable Processor with PyTorch. | ||
|
|
||
| # "--woq_algo RTN" is used to enable RTN algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model EleutherAI/gpt-j-6B \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo RTN \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --double_quant_type "BNB_NF4" | ||
| --output_dir saved_results | ||
|
|
||
| # "--woq_algo AWQ" is used to enable AWQ algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model EleutherAI/gpt-j-6B \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo AWQ \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --calib_iters 128 | ||
| | woq_algo | status | | ||
| |--------------|----------| | ||
| | RTN | ✔ | | ||
| | GPTQ | ✔ | | ||
| | AutoRound| ✔ | | ||
| | AWQ | ✔ | | ||
| | TEQ | ✔ | | ||
|
|
||
| # "--woq_algo AutoRound" is used to enable AutoRound algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model EleutherAI/gpt-j-6B \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo AutoRound \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 | ||
| > We validated the typical LLMs such as: `meta-llama/Llama-2-7b-hf`, `EleutherAI/gpt-j-6B`, `facebook/opt-125m`. | ||
|
|
||
| # "--accuracy" for eval | ||
| python run_clm_no_trainer.py \ | ||
| --model EleutherAI/gpt-j-6B \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --int8 \ | ||
| --accuracy \ | ||
| --tasks "lambada_openai" \ | ||
| --output_dir saved_results | ||
| ``` | ||
| **Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md). Our GPTQ API support various CLMs including GPTJ, OPTs, Blooms, Llamas, Falcons, MPTs, ChatGLMs, etc. Simply replace the "--model" argument with other models to quantize different CLMs with GPTQ. | ||
|
|
||
| ## Run | ||
|
|
||
| ### OPT-125m | ||
| `run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates datasets accuracy provided by lm_eval, an example command is as follows. | ||
|
|
||
| #### Quantization | ||
| ### Quantization | ||
|
|
||
| ```bash | ||
| # "--woq_algo GPTQ" is used to enable GPTQ algorithms | ||
| # "--double_quant_type BNB_NF4" is used to enable double quant algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model facebook/opt-125m \ | ||
| --model meta-llama/Llama-2-7b-hf \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --batch_size 8 \ | ||
| --woq_algo GPTQ \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --gptq_max_seq_length 2048 \ | ||
| --gptq_use_max_length \ | ||
| --double_quant_type "BNB_NF4" | ||
|
|
||
| # "--woq_algo RTN" is used to enable RTN algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model facebook/opt-125m \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo RTN \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --double_quant_type "BNB_NF4" | ||
|
|
||
| # "--woq_algo AWQ" is used to enable AWQ algorithms | ||
| python run_clm_no_trainer.py \ | ||
| --model facebook/opt-125m \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo AWQ \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --calib_iters 128 | ||
| --output_dir saved_results | ||
| ``` | ||
| ### Evaluation | ||
|
|
||
| # "--woq_algo AutoRound" is used to enable AutoRound algorithms | ||
| ```bash | ||
| # original model | ||
| python run_clm_no_trainer.py \ | ||
| --model facebook/opt-125m \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo AutoRound \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 | ||
| --model meta-llama/Llama-2-7b-hf \ | ||
| --accuracy \ | ||
| --batch_size 8 \ | ||
| --tasks "lambada_openai,wikitext" \ | ||
| --output_dir saved_results | ||
|
|
||
| # "--accuracy" for eval | ||
| # quantized model | ||
| python run_clm_no_trainer.py \ | ||
| --model facebook/opt-125m \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --int8 \ | ||
| --model meta-llama/Llama-2-7b-hf \ | ||
| --load \ | ||
| --accuracy \ | ||
| --tasks "lambada_openai" \ | ||
| --batch_size 8 \ | ||
| --tasks "lambada_openai,wikitext" \ | ||
| --output_dir saved_results | ||
| ``` | ||
|
|
||
| ### LLAMA2-7b/13b/70b | ||
| #### Quantization | ||
| ### Benchmark | ||
|
|
||
| ```bash | ||
| # "--double_quant_type BNB_NF4" is used to enable double quant algorithms | ||
| # "--woq_algo GPTQ" is used to enable GPTQ algorithms | ||
| # original model | ||
| python run_clm_no_trainer.py \ | ||
| --model meta-llama/Llama-2-7b-hf \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo GPTQ \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --gptq_max_seq_length 2048 \ | ||
| --gptq_use_max_length \ | ||
| --double_quant_type "BNB_NF4" | ||
| --performance \ | ||
| --batch_size 8 \ | ||
| --output_dir saved_results | ||
|
|
||
| # "--woq_algo RTN" is used to enable RTN algorithms | ||
| # quantized model | ||
| python run_clm_no_trainer.py \ | ||
| --model meta-llama/Llama-2-7b-hf \ | ||
| --dataset NeelNanda/pile-10k \ | ||
| --quantize \ | ||
| --woq_algo RTN \ | ||
| --woq_bits 4 \ | ||
| --woq_scheme asym \ | ||
| --woq_group_size 128 \ | ||
| --double_quant_type "BNB_NF4" | ||
| --load \ | ||
| --performance \ | ||
| --batch_size 8 \ | ||
| --output_dir saved_results | ||
| ``` | ||
|
|
||
|
|
||
| [1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023). | ||
| [2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023). | ||
| For more information about parameter usage, please refer to [PT_WeightOnlyQuant.md](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/PT_WeightOnlyQuant.md) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,4 +11,5 @@ neural-compressor | |
| lm_eval==0.4.3 | ||
| peft | ||
| auto_round | ||
| intel_extension_for_pytorch | ||
| numba | ||
| tbb | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.