Conversation
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
|
hi @chensuyue, PR is ready for extension test |
|
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
d0882e3 to
15c4863
Compare
|
@chensuyue extension test: performance regression is caused by switching performance dataset from dummy to real dataset. |
|
extension test for the other examples. |
examples/onnxrt/object_detection/onnx_model_zoo/tiny_yolov3/quantization/ptq/main.py
Outdated
Show resolved
Hide resolved
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Retest: https://inteltf-jenk.sh.intel.com/job/intel-lpot-validation-top-mr-extension/3890/ |
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Update:
Retest: https://inteltf-jenk.sh.intel.com/job/intel-lpot-validation-top-mr-extension/3909/ |
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
|
passed: bert_squad_model_zoo_dynamic, mobilebert_squad_mlperf_dynamic, mobilebert_squad_mlperf_qdq, duc, BiDAF_dynamic and huggingface question answering models failed: gpt2_lm_head_wikitext_model_zoo_dynamic and huggingface test classification models, retest: https://inteltf-jenk.sh.intel.com/job/intel-lpot-validation-top-mr-extension/3913/ |
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
|
passed: |
* SparseLib add vtune support refine doc about profiling
Building on the vllm WoQ path, this PR adds support for re-quantizing FP8 weights w/ per-tensor or per-channel scaling. --------- Co-authored-by: Yi Liu <yiliu4@habana.ai>
Building on the vllm WoQ path, this PR adds support for re-quantizing FP8 weights w/ per-tensor or per-channel scaling. --------- Co-authored-by: Yi Liu <yiliu4@habana.ai>
Building on the vllm WoQ path, this PR adds support for re-quantizing FP8 weights w/ per-tensor or per-channel scaling. --------- Co-authored-by: Yi Liu <yiliu4@habana.ai>
Building on the vllm WoQ path, this PR adds support for re-quantizing FP8 weights w/ per-tensor or per-channel scaling. --------- Co-authored-by: Yi Liu <yiliu4@habana.ai>
* [SW-207748] Support Auto-round on HPU (#25) Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * [SW-209878] Increase threshold to avoid random error in test_layer_wise.py (#36) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-207579] support load vLLM compatible FP8 model (#18) Support load vLLM compatible FP8 model, both G2 and G3, both single card and multi-cards. --------- Signed-off-by: changwang <changwang@habana.ai> * [SW-207451] Implement block-wise calibration for LLM (#41) * [SW-207451] Implement block-wise calibration for LLM --------- Signed-off-by: Xin <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-208986] fix save&load bug (#40) * [SW-208986] fix save&load bug --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-207748] Add Auto-round Example (#42) * add autoround hpu example Change-Id: Ibd537f4667c7c077160427722a5eca2c721aa5cd Signed-off-by: Yi Liu <yiliu4@habana.ai> * add requirements Change-Id: I77a95ec05e41247db9903e8622c31f05259ca365 Signed-off-by: Yi Liu <yiliu4@habana.ai> --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug (#47) * [SW-210541] loading for fused_sdpa requires additional amax scale (#51) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * fix PatchedLoRACompatibleLinear init (#65) Signed-off-by: changwangss <changwang@habana.ai> * align files with v1.19.0 in fp8_quant folder Signed-off-by: Xin He <xinhe3@habana.ai> * fix missing SaveLoadFormat Signed-off-by: Xin He <xinhe3@habana.ai> * align and fix config after cherry-pick Signed-off-by: Xin He <xinhe3@habana.ai> * Implicit relative imports is abandoned Signed-off-by: Xin He <xinhe3@habana.ai> * fix config issue blocking CI Signed-off-by: Xin He <xinhe3@habana.ai> * remove synchronize for `pack_unpack_tensor_with_numpy` (#2070) * remove pack&unpack synchronize --------- Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * stop auto-fix of pre-commit Signed-off-by: Xin He <xinhe3@habana.ai> * update autoround example for release test Signed-off-by: xin3he <xin3.he@intel.com> * fix AWQ&TEQ loading due to input scale Signed-off-by: xin3he <xin3.he@intel.com> * fix HQQ state_dict loading caused by [SW-195965] Signed-off-by: xin3he <xin3.he@intel.com> * use per_channel as default config (#2091) Signed-off-by: yiliu30 <yi4.liu@intel.com> * workaround transformers issue in version 4.47.0 (#2092) * workaround transformers issue in version 4.47.0 Signed-off-by: xin3he <xin3.he@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Refactor FP8 pytest script (#2089) * Refactor FP8 pytest script --------- Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * update ci scan scope Signed-off-by: chensuyue <suyue.chen@intel.com> * [SW-210500] [Optimum-Habana] [Regression] [fp8] [INC] No generated text for llava models [llava-1.5-7b-hf] [llava-1.5-13b-hf ] (#54) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-213236] resolve CPU mem issue in CI (#76) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * recover pre-commit Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix `is_sharded` setting for loading quant model (#2094) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * fix error message for different python version (#2099) Signed-off-by: changwangss <changwang@habana.ai> * fix UT of RTN on HPU (#2098) Signed-off-by: xin3he <xin3.he@intel.com> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * fix device issue during calibration (#2100) Signed-off-by: Xin He <xinhe3@habana.ai> * fix woq example and update document for v1.19.0 (#2097) Signed-off-by: xin3he <xin3.he@intel.com> * Refactor version import paths to common module (#2095) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * update CI gaudi-docker to 1.19.0 (#2096) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * fix device mapping issue of llama gptq (#2101) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Doc: Update readme.md (#2083) Signed-off-by: fengding <feng1.ding@intel.com> * update publication_list.md (#2105) Signed-off-by: chensuyue <suyue.chen@intel.com> * [pre-commit.ci] pre-commit autoupdate (#2107) Signed-off-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * Update publication list with new blog (#2111) * Update publication list with new blog Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Update publication list num Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> --------- Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * update the License (#2108) Signed-off-by: yiliu30 <yi4.liu@intel.com> * Add autotune support for PT2E (#2110) Add autotune support for PT2E and disable some conv1d-related test on HPU --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Xin He <xin3.he@intel.com> * Add intel-extension-for-pytorch to Transformers-like API requirements (#2113) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Add VLM quantization & loading into transformers-like API (#2116) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * Fix hf_device_map setting for transformers-like api (#2122) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add mapping entry to 1.20 (#2126) Signed-off-by: Huang, Tai <tai.huang@intel.com> * fix bug of lwq gtpq (#2128) Signed-off-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add xfail for onnx test_layer_wise.py (#2129) * Cherry pick Habana software v1.20.0 (#2123) * [SW-210525] release HPU memory when loading neural_magic fp8 models (#48) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-211178] save generation_config when saving model if exists (#57) * [SW-211178] save generation_config when saving model if exists --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-210543] update gitignore to simplify the git message (#50) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-205334][SW-187731] llama70b vLLM fix graph breaks with torch.compile (#67) * fix graph breaks with torch.compile * remove orig_mod from helper_modules * fix typos * fix test_register_apis --------- Co-authored-by: Rafal Litka <rlitka@habana.ai> * [SW-213890] Disable test_two_step_layer_wise temporarily (#84) * [SW-205437] - Support LM-HEAD patching (#79) * [SW-205437] - Support LM-HEAD patching * fix CR comments * Enhance and rename fix_measurements tool to postprocessing_vllm_measurements (#82) * [SW-214088] Fix graph break caused by PatchedMixtralMoE (#74) * [SW-208528] Support FP8 per channel Q/DQ (#13) * add per channel qdq support Signed-off-by: changwang <changwang@habana.ai> * improve ut Signed-off-by: changwang <changwang@habana.ai> * improve get_scale_dtype func and qdq init Signed-off-by: changwangss <changwang@habana.ai> * improve DequantOutput QuantInput init Signed-off-by: changwangss <changwang@habana.ai> * add scale_method improve PCQ Signed-off-by: changwangss <changwang@habana.ai> * remove scale name Signed-off-by: changwangss <changwang@habana.ai> * fix PCQ scale_inv expanding Signed-off-by: changwangss <changwang@habana.ai> * merge the qdq_per_channel, qdq_per_tensor to qdq Signed-off-by: changwangss <changwang@habana.ai> * move scale_inv change to the QuantInput init Signed-off-by: changwangss <changwang@habana.ai> * remove scale_dtype list judge Signed-off-by: changwangss <changwang@habana.ai> * fix missing axis parameter Signed-off-by: changwangss <changwang@habana.ai> --------- Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: changwangss <changwang@habana.ai> * [SW-204341] explicit scale format for ops (#73) * [SW-204341] explicit scale format for ops Added wrapper around fp8 functions Wrapper decides which flavor of the function to call, according to scale format Helper modules call the wrapper Decide which cast flavor to call, according to scale format * [SW-204341] Adjust softmax API , remove commented-out code * [SW-204341] Fixes from CR 1 * [SW-204341] Fixed CR 2 * [SW-204341] add missing arg is fsdpa Signed-off-by: Uri Livne <ulivne@habana.ai> * [SW-204341] Enhance SDPA for measure and quant * [SW-204341] remove sdpa quantized ops * reland per op class with more enchancments * [SW-204341] reland specfic arguments , rename class to wrapper * added call with self in patched lm head rebased on top of master next force push * fix mistake in conflict resolution resotore MethodType fix * antoher fix * modified fp8 mtamul test to test quantized matmul func * another fix of rebase mistake * hopefully last rebase mistake fix * restore backward compatibly import protection --------- Signed-off-by: Uri Livne <ulivne@habana.ai> * [SW-213890] Revert "[SW-213890] Disable test_two_step_layer_wise temporarily (#84)" (#86) This reverts commit 27162ae60755aa39a40ac1a1c96a9efaac48751b. * Revert "[SW-205334][SW-187731] llama70b vLLM fix graph breaks with torch.com…" (#87) This reverts commit 01a57343d34d8d0c63935b848bdd94d9dfd40210. Co-authored-by: Danny Semiat <dsemiat@habana.ai> * [ALGO-809] PatchedLmHeadLinearAllreduce: replacing the sharding code with the one from deepspeed-fork (#85) Change-Id: Icb9670cfefdd1880c1ebb9a804a97c9ba79ecdc3 Co-authored-by: smarkovichgolan <smarkovich@habana.ai> * fix bug of FusedMoE object has no attribute w13_weight (#94) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> * [SW-208588] Add HPU fp8 Dynamic MOE (#88) * [SW-208588] Add HPU fp8 Dynamic MOE * fix review comments * fix more review comments * fix comments * fix tests * minor config fixes (#96) * [SW-0] minor cosmetic fixes in quant_config * remove hooks * [SW-196641] - Fix type mismatch in linear quantization unit tests (#99) * [SW-196641] - Fix type mismatch in linear quantization unit tests * fix atol value * add hp_dtype to fp8 config dict before parsing * [SW-214785] Apply PatchedModuleBase for all existing PatchedModules (#92) * [SW-214785] Apply PatchedModuleBase for all existing PatchedModules Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215319] threshold of memory usage in test_block_wise.py is too tight (#100) * [SW-215543] Revert "minor config fixes (#96)" (#104) This reverts commit fa40142f32404691ea2ae36d13a15c348c853bde. * fix RowParalleLinear func names from string to tuple (#106) * [SW-215615] memory is unreleased during loading neural_magic models on multi-cards (#105) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-212423] RuntimeError when load the gptq model from HF (#70) * [SW-212423] RuntimeError when load the gptq model from HF * skip tie_word_embeddings=False Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-214785] fix issue when self._mod_extra_config is None (#108) * [SW-211826] [example] demonstrate layer-wise, block-wise and lm_eval usage (#66) * [SW-211826] [example] demonstrate layer-wise&block-wise usage to quantize LLM with limited host&device memory Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215295] Force single object from quantized func wrapper classes (#103) * [SW-215295] Force single object from quantized func wrapper classes * Modify the factory object to be cleared after module patching * Move cleanup to Quantizer object * [SW-216292]Minor update for lm-eval (#113) * Enable lm-eval 0.4.2 and expose `add_bos_token` --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * [SW-209207] add vllm fp8 dynamic MoE (#116) * [SW-216239] Align Softmax fp8 scale calc with configuration (#112) * [SW-217321] Skip auto round tests (#119) (#125) * Test Commit * [SW-217321] Skip auto round tests do to CI breakage * remove uneeded print * [SW-207451] Implement block-wise calibration for LLM (#24) For LLMs, measurement on bf16 requires high hpu memory usage. This change can help measure bf16 llama-405b on 8 Gaudi2 card, or measure llama-70b on 1 Gaudi card. Shortage: cannot measure lm_head layer, maybe we can enhance it later. --------- Signed-off-by: Xin <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug in output arbitrary scales (#45) * [SW-197077] fix bug * [SW-197077] fix bug in outputs arbitrary scales Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug in output arbitrary scales (#45) * [SW-197077] fix bug * [SW-197077] fix bug in outputs arbitrary scales * [SW-210500] [Optimum-Habana] [Regression] [fp8] [INC] No generated text for llava models [llava-1.5-7b-hf] [llava-1.5-13b-hf ] (#54) (#77) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-213236] resolve CPU mem issue in CI (#76) (#83) Cherry-pick from 1.19 Co-authored-by: Xin He <xin3.he@intel.com> * [SW-213368] requirements_pt.txt: allow newer pydantic versions to >= 1.10.13 (#80) * requirements_pt.txt: upgrade pydantic version to >= 2.0.0 * allow newer version of pydantic newer deepspeed uses pydantic v2, which have slight different APIs. * Update requirements_pt.txt * [SW-212057] Enable scalar scale to support QDQ (#98) * [SW-212057] Enable scalar scale to support QDQ Change-Id: Ib5f5accd7a770675609e91c18bd04497b15937c5 * PR comment fixes Change-Id: I01be41c29721b8d59c887f3d2b4e3cef8433331c Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-215845] Run some unit tests from top level API (#109) Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-212629] Support saving weight-only quantization INT4 model in Hugging Face format (#101) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-205970] update state_dict to save scalar scales (#6) * update state_dict method in save/load function --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * Revert "[SW-205970] update state_dict to save scalar scales (#6)" (#114) This reverts commit ffcb97eaf7965bf8ac942523660a6d5e9e8e2184. * [SW-212092] Save vllm compatible format (#102) * save vllm compatible format Signed-off-by: changwangss <changwang@habana.ai> * add assertion and improve max_file_size to human reading Signed-off-by: changwangss <changwang@habana.ai> * support default the same with huggingface when saving Signed-off-by: changwangss <changwang@habana.ai> * separate save funtion for single device and multi devices. Signed-off-by: changwangss <changwang@habana.ai> * rebase Signed-off-by: changwangss <changwang@habana.ai> * rebase save Signed-off-by: changwangss <changwang@habana.ai> * remove weight and scale convert on G2 Signed-off-by: changwangss <changwang@habana.ai> * rebase master_next due to revert #6 Signed-off-by: changwangss <changwang@habana.ai> * improve convert weight to vllm compatable function Signed-off-by: changwangss <changwang@habana.ai> * replace print to logger Signed-off-by: changwangss <changwang@habana.ai> * move unit_mapping to common utils Signed-off-by: changwangss <changwang@habana.ai> --------- Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-205970] update state_dict to save scalar scales (#115) * [SW-205970] update state_dict to save scalar scales (#6) * update state_dict method in save/load function * support mixtral --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215009] support loading per-channel scales (#95) * [SW-215009] support loading per-channel scales Signed-off-by: Xin He <xinhe3@habana.ai> * fix UT Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Refactoring scales (#22) (#122) * Refactoring scales (#22) * [SW-197077] refactoring maxabs scales and adding arbitrary scales. * [SW-199696] Supporting Dynamic Quantization (#128) * Calculating dynamic scales using nn.Modules Change-Id: I8c344ae737803b39117037edaaa3d3b9cbd09f30 * [SW-199696] Supporting Dynamic Quantization Change-Id: Ic5d6f04ec0b5032ac305e1b3097747c47250385b * Code cleanup Change-Id: I213bc7438e06bd1002775066bfb0dc6f10e8a84a * Review changes and model print issue (circular dependency fix) Change-Id: I5c41d2f9a937416ce260f55cb045c86858dd201a * removed debug code from patching_common.py * Round 2 + CI import issue Change-Id: I27dbb33de8e027fb0b726336b38156b5d23a6896 Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-217334] enable fp8 qdq mode using PatchedModuleBase (#129) * [SW-217334] enable fp8 qdq mode using PatchedModuleBase * fix review commnets * [SW-218871] fp8 multi-cards is not loaded correctly (#138) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Fix bug in mixtral unitscale (#141) * [SW-218197] fix bug in Mixtral unitscale * [SW-218197] fix bug in Mixtral unitscale * update version to 3.3 for release Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-20808] Make sure save&load format is an Enum object (#58) * [SW-20808] Make sure save&load format is an Enum object Signed-off-by: Xin He <xinhe3@habana.ai> * Update save_load_entry.py --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add xfail for torchvision Signed-off-by: Xin He <xinhe3@habana.ai> * fix ILITV-3859 Signed-off-by: xin3he <xin3.he@intel.com> * workaround for ILITV-3858 Signed-off-by: xin3he <xin3.he@intel.com> * fix sdxl_smooth_quant Signed-off-by: xin3he <xin3.he@intel.com> * fix ILITV-3854 Signed-off-by: xin3he <xin3.he@intel.com> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Uri Livne <ulivne@habana.ai> Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai> Signed-off-by: Xin <xin3.he@intel.com> Signed-off-by: xin3he <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: RafLit <rafal.litka@intel.com> Co-authored-by: Rafal Litka <rlitka@habana.ai> Co-authored-by: Dany Kiazada <141814181+kiazada@users.noreply.github.com> Co-authored-by: Nir David <124874956+nirda7@users.noreply.github.com> Co-authored-by: Yuwen Zhou <yuwen.zhou@intel.com> Co-authored-by: Wang, Chang <changwang@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Oz Abramovich <oabramovich@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Danny Semiat <dsemiat@habana.ai> Co-authored-by: smarkovichgolan <smarkovich@habana.ai> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Nadav Elyahu <88962733+nelyahu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: chen, suyue <suyue.chen@intel.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * Doc: Update fp8 accuracy test data and update docker image 1.20.0 (#2130) Signed-off-by: fengding <feng1.ding@intel.com> * [SW-219274] - Changing the quant method name in lm-head (#150) (#2132) * [SW-219274] - Changing the quant method name in lm-head (#150) * Update helper_modules.py --------- Co-authored-by: Nir David <124874956+nirda7@users.noreply.github.com> * Adapt ipex xpu transformers version (#2134) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add phi3 vlm transformers example (#2135) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * fix saving issue for group_size=-1 (#2138) Signed-off-by: xin3he <xin3.he@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add v3.3 release faq (#2139) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * bump release version into 3.3 (#2140) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * upgrade numpy from 1.23.5 to 1.26.4 (#2115) Signed-off-by: xin3he <xin3.he@intel.com> * Update publications (#2145) * update publications Signed-off-by: chensuyue <suyue.chen@intel.com> * Add transformers to align onnxruntime-extensions=1.14.0 (#2147) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Freeze 2x package versions (#2151) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * fix vulnerability (#2149) Signed-off-by: xin3he <xin3.he@intel.com> * Bump into v3.3.1 (#2152) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * [SW-218939] fix memory mapping failure in UT (#2154) Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-223106] change code with robust implementation (#2153) * update habana docker and PyTorch and related packages to latest versions (#2158) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * remove mxnet (#2146) Signed-off-by: Mengni Wang <mengni.wang@intel.com> * Workaround of [SW-208658] (#2162) Signed-off-by: Xin He <xinhe3@habana.ai> * compatible with transformers version 4.50 (#2159) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * fix NotImplementedError: get_type is not implemented (#2133) Signed-off-by: Xin He <xinhe3@habana.ai> * improve 3x ut on bf16 supported machine (#2163) Signed-off-by: changwangss <chang1.wang@intel.com> * remove numpy version limitation (#2161) Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] pre-commit autoupdate (#2166) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pycqa/isort: 5.13.2 → 6.0.1](https://github.com/pycqa/isort/compare/5.13.2...6.0.1) - [github.com/psf/black.git: 24.10.0 → 25.1.0](https://github.com/psf/black.git/compare/24.10.0...25.1.0) - [github.com/codespell-project/codespell: v2.3.0 → v2.4.1](https://github.com/codespell-project/codespell/compare/v2.3.0...v2.4.1) - [github.com/astral-sh/ruff-pre-commit: v0.8.6 → v0.11.4](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.6...v0.11.4) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add revision for hf-internal-testing/tiny-random-gptj in UT (#2174) Signed-off-by: changwangss <chang1.wang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * suit transformers>=4.51 (#2171) Signed-off-by: xin3he <xin3.he@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add the missed change during cherry-pick (#2175) * Reset `accelerator` when `INC_TARGET_DEVICE` is set in code (#2168) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * Fix text-to-image example (#2176) Signed-off-by: Daniel Socek <daniel.socek@intel.com> * restrict lm_eval version <= 0.4.7 (#2177) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Add `update_g_idx` flag for setting qweight&g_idx (#2143) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * [SW-219134] dynamically remove 2x content (#2173) * [SW-219134] dynamically remove 2x content Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * fix(pytorch): Rename layer_scale parameter to avoid quantization error (#2172) * fix(pytorch): Rename layer_scale parameter to avoid quantization error --------- Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Signed-off-by: changwangss <chang1.wang@intel.com> Signed-off-by: xin3he <xin3.he@intel.com> Signed-off-by: V-E-D <vedantthote2019@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wang, Chang <chang1.wang@intel.com> Co-authored-by: Xin He <xin3.he@intel.com> * Adapt autoround v0.5 (#2187) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Update transformers version for VLM example (#2181) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Update export function for torch 2.6 (#2184) Signed-off-by: Yi Liu <yiliu4@habana.ai> * add mapping between v3.4 and v1.21 (#2189) Signed-off-by: Huang, Tai <tai.huang@intel.com> * Add continueOnError to avoid baseline stages block CI (#2188) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Revert "[SW-219134] dynamically remove 2x content (#2173)" (#2183) * Add `add_bos_token` for Llama3 evaluation (#2179) * add_bos_token for llama3 Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Update code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix security issue (#2194) * fix security issue Signed-off-by: Xin He <xinhe3@habana.ai> * update pub list with llama4 blog (#2190) Signed-off-by: Huang, Tai <tai.huang@intel.com> * Fix WoQ loading for Large Model (#2200) * fix save/load Change-Id: I36d1ca10426b65abfe2093091f41906cecf2fcaf Signed-off-by: Yi Liu <yiliu4@habana.ai> * fix check Change-Id: I2cc647ed3ecc7ff499206a3d79e0f71e25605f7e Signed-off-by: Yi Liu <yiliu4@habana.ai> --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * add qwen3 blog to pub list (#2201) Signed-off-by: Huang, Tai <tai.huang@intel.com> * replace XPU with Intel GPU (#2198) * replace XPU with Intel GPU Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Revert "remove 1x docs (#1900)" (#2205) This reverts commit d3204604aad007f3db67c46dcb0575aa8f5cd584. * Deprecate 2x Tensorflow, Keras and ONNX (#2199) Signed-off-by: Xin He <xinhe3@habana.ai> * Fix `g_idx` init in transformers-like API (#2204) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * [SW-205334][SW-187731] Reintroduce PR #67 with PatchedKVCache fix of [SW-214296] (#97) * [SW-205334] wrap calls to original module methods * add wrap for fetch_from_cache * fix PatchedKVCache test, revert the PatchedModule Base * PatchedParallelLMHead switch orig_linear_apply * fix PatchedKVCache --------- Co-authored-by: Rafal Litka <rlitka@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-218081] temproray disable fp8_static_quant test (#131) * [SW-218081] temproray disable fp8_static_quant test * skip test block wise measurements * dummy commit to re-trigger CI * [FSW-12066] Add initial multi device support (#91) * [FSW-12066] Multiple PT Bridge support Added abstraction for per device quantized func wrapper * [SW-12066] Add quantized func wrapper for multi-device Removed hpu code from quant_dequant add unit test * TEMP commit * Further adjustmetns Rebaesd above latests master_next added INC device enum (instead usage of device speicifc enums) removed unecceeray device specific imports Moved device specifc imports to internal scopes, or wraped with try-except * Add fixes after running guadi tests * added __init__ files and removed undeeded _init_ function * Call directly to get_quantized_func_wrapper_object and remove old hpu ops file * MOre fixes for xpu tests to work added import xpu_modules device for scale calculation added file headers * adjutments for dynamic MOE vllm * Adjutments after rebase of scale refactiorng * Rebase for 1.21, call in load API, refine factory class * rebase from master_next 8_2_25 * add init/clear calls in load api * refine factory class impl, use class methods and members only * refine func wrapper init file, use only api functions * Fixes after CR 10_02 * More Fixes after CR * Compare INC Acclerator enum value * fix var name * Added Matmul OP type * Rename gemm and matmul op types Signed-off-by: Xin He <xinhe3@habana.ai> * [FSW-12066] small fixes in xpu quantized func (#145) * [SW-218197] fix bug in Mixtral unitscale (#139) Signed-off-by: Xin He <xinhe3@habana.ai> * [ALGO-808] add support for int4 weights + fp8 activations - phase 1 (#43) * [ALGO-808] add support for int4 weights + fp8 activations - phase 1 * Add code for quantizing only single input to PatchedMatmul * w4a8 new kernel --------- Co-authored-by: Tomer Gafni <tgafni@habana.ai> * [SW-218081] Re-enable tests (#140) * [SW-214378] remove creation of nc_workspace in each INC run (#151) * [SW-214378] remove creation of nc_workspace in each INC run * remove the commented line * [SW-219274] - Fix getting error log when lm_head in vLLM does not have measurements for scale calculation (#152) * [SW-207602] INC to Support fp8 communication in PatchedRowParallelLinear (#1) This commit modifies PatchedRowParallelLinear collective func to custom all_reduce function that: -During measurement- measures all_reduce output and matmul_fp8 maximum output. -During quantization- quantizes all_gather and all_to_all ops inside the all_reduce func as they preformed in fp8. Add branch in reduce_forward_quant to have the fp8 optimization be done only at decode phase --------- Co-authored-by: Roi Tiefenbrunn <rtiefenbrunn@habana.ai> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-207602] Fix bug with PatchedRowParallelLinear (#158) * [SW-219745] fix fp8 GaudiMixtralSparseMoeBlock graph break (#161) * Blockwise gptq (#155) * Add blockwise quantization for GPTQ * sharded checkpoint additions * CR fixes * CR fixes #2 * fix error caused in CI * update safetensors requirements file to support safetensors hpu support Signed-off-by: Xin He <xinhe3@habana.ai> * Raise error when measuring PC without shapes (#163) * Raise error when measuring PC without shapes * Update measure.py * [SW-218303] Fix incorrect bias addition point in PatchedColumnParallelLinear (#160) * Correct `PatchedVLLMKVCache` to measure the whole input (#170) Change-Id: I41a07985d602936e5d6c4f25a061a009bc251253 Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * [SW-218484] Enhance log for saving (#134) * [SW-218484] Enhance log for saving Signed-off-by: Xin He <xinhe3@habana.ai> * fix Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-222320] Optimize code for TPC fuser in dynamic quantization (#173) * [SW-221372] allow running Mixtral measurment phase using torch.compile (#167) Co-authored-by: Ivan Antonov <Iantonov@habana.ai> * [SW-222366] Switch tests to lazy mode (#174) * [SW-222366] Move env default to init (#178) * [SW-223106] Temporary disable mixed precision test (#180) * fp8 aware gptq (hybrid gptq) (#154) * fp8 aware gptq (hybrid gptq) * review1 * loading bias to mixed low precision * fixing tests for fp8 aware quantization and hybrid re-ordering * Addressed second review round comments * Adressed review 3 comments --------- Co-authored-by: Asaf Karnieli <akarnieli@habana.ai> * [SW-222513] OSError: does not appear to have a file named generation_config.json (#181) * [SW-222513] OSError: does not appear to have a file named generation_config.json * Update save_load.py * Revert "fp8 aware gptq (hybrid gptq) (#154)" (#184) This reverts commit 050dc44424debaa659f8dd58b4a4ec6bc9af7d68. * [SW-221589] AutoRound W4A8 Quantization and Loading (#110) - Quantize model to W4A8 using auto-round - Loading W4A8 model --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Asaf Karnieli <akarnieli@habana.ai> Co-authored-by: Tomer Gafni <tgafni@habana.ai> * [SW-214855] - Set scale attributes in INC to reduce graph recompilations (#162) * [SW-219831] - Set scale attributes in INC to reduce grpah recompilation * add scaling methods ids * fix scaling method ids check and set * enable feature also for Load QuantMode * move scale tensors to cpu when feature is enabled * fix scaling methods ids to start at 1 * fix cr comments * remove unnecessary imports * fix cr comments * fix more cr comments * fix cr comments * move scale to float on cpu in scale handler for dynamic scaling * fix cr comments * Add unit test * fix sending scale tensor to bridge and unit-test bug * [SW-222220] Quantising Llama3.2. 11B/90B fails with GC error (#179) * refine PatchVLLMKVCache * move cache out of args * revert option2 * add get_cache * Revert "add get_cache" This reverts commit a89d9d23810ce594743504fea4bc5cd49e8d4192. * Revert "revert option2" This reverts commit d2b124c1d30717baf482eb887ba5ab3cb09ac51d. * add comments * update comment * Dummy commit for triggering CI * Dummy commit for triggering CI * [SW-216623] Restore patch module to original before convert (#185) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-218081] move htcore.hpu_set_env() to confest (#146) * [SW-218081] move htcore.hpu_set_env() to confest Signed-off-by: Xin He <xinhe3@habana.ai> * Update conftest.py * use htcore.hpu_set_inference_env() Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-221588] Unify weights in multi-cards for HF/INC format (N->M) (#172) * Unify weights in multi-cards for HF/INC format (N->M) M cards for loading is less than N, and can divide N with an integer number -------------------------------------------------------------------------------------------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-221594]Re-quantize the Official DeepSeek FP8 Model (#187) Building on the vllm WoQ path, this PR adds support for re-quantizing FP8 weights w/ per-tensor or per-channel scaling. --------- Co-authored-by: Yi Liu <yiliu4@habana.ai> * [SW-218277]Add support for mixtral with expert parallelism (#177) * Add support for mixtral with expert parallelism * Remove allreduce from measurement * Update PatchedVLLMKVCache for deepseek performance (#194) Co-authored-by: Linoy Buchnik <linoybu@gmail.com> * [SW-224836] disable test_mixed_precision_gptq_fp8_quant_only_nlp (#208) Co-authored-by: linoy buchnik <lbuchnik@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * Fix `PatchedMoeMatmul` and Get `num_experts` from Module (#202) Fix `PatchedMoeMatmul` and Get `num_experts` from Module --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * add back missing changes in cherry-pick Signed-off-by: Xin He <xinhe3@habana.ai> * add docstring and fix typo Signed-off-by: Xin He <xinhe3@habana.ai> * fix CI failure caused by internal changes Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support update config after initialization (#2191) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * add preprocess_quant_config to collect common code (#2192) * add preprocess_quant_config to collect common code Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * skip fp8 xpu path in CI Signed-off-by: Xin He <xinhe3@habana.ai> * remove htcore.hpu_inference_set_env to suit 1.20 Signed-off-by: Xin He <xinhe3@habana.ai> * fix typo Signed-off-by: Xin He <xinhe3@habana.ai> * support ComposableConfig setattr Signed-off-by: Xin He <xinhe3@habana.ai> * workaround for v1.20 missing attribution Signed-off-by: Xin He <xinhe3@habana.ai> * fix bug in previous UT Signed-off-by: Xin He <xinhe3@habana.ai> * remove experimental related document and code (#2207) * remove experimental Signed-off-by: Xin He <xinhe3@habana.ai> * remove docs (#2206) Signed-off-by: yiliu30 <yi4.liu@intel.com> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: Yi Liu <yi4.liu@intel.com> * Add add_bos_token for Llama3(Instruct) xpu evaluation (#2209) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Update habana docker image version and DeepSpeed dependency to 1.21.0 (#2213) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Co-authored-by: Xin He <xin3.he@intel.com> * add INC_PT_ONLY and INC_TF_ONLY (#2202) * add INC_PT_ONLY and INC_TF_ONLY * compatible with previous install method --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Bump release version into v3.4 (#2214) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * update torch to 2.7.0 in CI (#2216) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Fix VLM model ut (#2218) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Add transformers version limit for ut case (#2219) Signed-off-by: changwa1 <chang1.wang@intel.com> * Update `--int8` flag to `--optimized` flag (#2215) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Fix g_idx init for GPTQ (#2222) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: Wang, Chang <chang1.wang@intel.com> * Fix CI env(#2225) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Bump release version to v3.4.1 (#2224) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * Freeze IPEX version for INT8 SQ support (#2221) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * remove unused examples (#2230) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * [pre-commit.ci] pre-commit autoupdate (#2233) Signed-off-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> * Bump transformers (#2234) Bumps [transformers](https://github.com/huggingface/transformers) from 4.38.0 to 4.51.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.38.0...v4.51.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.51.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * update pytorch int8 notebook example (#2235) * Adapt transformers v4.53.1 (#2237) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Support saving inc model for transformers-like api (#2231) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * Bump transformers from 4.35 to 4.52.1 (#2236) Bumps [transformers](https://github.com/huggingface/transformers) from 4.35 to 4.52.1. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.35.0...v4.52.1) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.52.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump transformers from 4.51.0 to 4.52.1 (#2243) Bumps [transformers](https://github.com/huggingface/transformers) from 4.51.0 to 4.52.1. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.51.0...v4.52.1) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.52.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix Mix Precision Example (#2242) Signed-off-by: Yi Liu <yiliu4@habana.ai> * Bump transformers to 4.52.1 (#2244) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * [SW-207602] Fix all_reduce scale format in PatchedRowParallelLinear (#183) * [SW-199696] Added PatchedLinearBase (#192) * Added PatchedLinearBase * Fixed PatchedLinear forward_qdq * Changed quant strategy - scale to fix ci * Renamed QuantStrategy to QuantWrapper * Removed instance member from QuantWrapper * [SW-224403] Added ticket and throwing error when using row_parallel_linear_allreduce_quantization * Changed QuantWrapper to a simple method that stores scale * [SW-224538] Added ticket to TODO comment for init_linear * Pushed requires_grad to the tensor creation * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Moved copy_scale functions inside PatchedLinearBase * Update helper_modules.py * Update helper_modules.py * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Update helper_modules.py * Update helper_modules.py * Update helper_modules.py copy scale * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * [SW-199696] Implementing dynamic quantization design for linear ops (#188) * Implementing dynamic quantization design for linear ops * Using copy_ to store scale as a member, added qdq, removed dyn * Added PatchedLinearBase to support all linear modules * Testing dynamic quantization with scale compare * CR comments - calling cguid * Added PatchedLinearBase * Fixed PatchedLinear forward_qdq * Changed quant strategy - scale to fix ci * Renamed QuantStrategy to QuantWrapper * Removed instance member from QuantWrapper * [SW-224403] Added ticket and throwing error when using row_parallel_linear_allreduce_quantization * Changed QuantWrapper to a simple method that stores scale * [SW-224538] Added ticket to TODO comment for init_linear * Pushed requires_grad to the tensor creation * Fixed merge * Fixed load() flow - handling meta tensors with dummy scale * [SW-224609] removed non tested dynamic qdq * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Moved copy_scale functions inside PatchedLinearBase * Added and fixed test cases * Increased tolerance for new test cases * Update helper_modules.py * Update helper_modules.py * Some tests/ci fixes * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Update helper_modules.py * cr comments + cguid check change * Update helper_modules.py * Update helper_modules.py copy scale * Update neural_compressor/torch/algorithms/fp8_quant/_quant_common/helper_modules.py * Maxabs design and some structure changes * Merged MaxAbsDynamicPts To base + cguid comments * changed cguid calls to functions * Log changes * Update neural_compressor/torch/algorithms/fp8_quant/model_configs.py * Update neural_compressor/torch/algorithms/fp8_quant/model_configs.py * Re-set self.scale_input as before, value is none in dynamic * Changing back dynamic scale_input to intermediate and not member * Disabling test_linear_dynamic_quantization: not storing scale as member * Reintroduce MaxAbsDynamicPts: in dynamic we don't save scale as a member * weight to hpu comment * Fix incorrect condition in PatchedMatmul (#204) * Fixing vllm runs for dynamic quantization (#210) * Revert "Revert "fp8 aware gptq (hybrid gptq) and fix performance drop of W4A16 scheme (#190) * Revert "Revert "fp8 aware gptq (hybrid gptq) and fix performance drop in gptq test (SW-223441)" This reverts commit ba9475d7599bd1bb43fa978eef60a655b77e44dd. * addressing reviewer comments * Temporarily disable rel_err test until fixed * fixed pytest error --------- Co-authored-by: Asaf Karnieli <akarnieli@habana.ai> Co-authored-by: Mariusz Okroj <mariusz.okroj@intel.com> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Signed-off-by: Xin He <xinhe3@habana.ai> * enable bf16 h2d scales for dynamic scaling (#215) * [SW-197607] INC- change hard coded gaudi 2 scales for optimal weight … (#221) * [SW-197607] INC- change hard coded gaudi 2 scales for optimal weight quantization * cr fix * [SW-217321] Add autoround UTs Back (#197) * add autoround UTs back Change-Id: I0614ffd8be4f89e9787037ee99e24a60f8548b49 --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-225078] [INC][DynamicQuant] Reenable testing dynamic quantization… (#214) * [SW-225078] [INC][DynamicQuant] Reenable testing dynamic quantization scales on hpu graphs and torch.compile * CR fixes * tiny fix * cr fix * don't support running _quant_only_scale_methods with dynamic quantization * string check fix * fix test_matmul runs and atol in HW_ALIGNED_SINGLE_SCALE * string fixes * [SW-227504] disabled test (#226) * [FSW-13914] Fix gaudi specific code in common location (#224) Move Gaudi specific code to internal scopes, so it won't be imported in FS/JS env Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-224874] Implement support for hp/lp dtypes in KV-cache QDQ (#222) * [SW-225858] Fix unit_scale issue with unquantized modules (#225) * [SW-223055] Cleanup fetch_from_cache (#229) * [SW-226788] Fixed handling RowParallelLinear to fix accuracy (#233) * [SW-226788] Fixed handling RowParallelLinear to fix accuracy * Changed supported ops to op types * dynamic quant check changes in quantize.py * Fixed dynamic quant check changes in quantize.py * [SW-228966] add codeowners to github (#230) * fix revision bug when loading from huggingface hub (#235) * fix revision bug when loading from huggingface hub --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-228061] define ScaleMethod by its properties in INC FP8 (#227) * [SW-228061] define ScaleMethod in better way * bug fixes and new scale_method_config file * copilot cr * tiny fix * rebase * some cr fixes * bug+cr fixes * cr fix * cr fix Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-229653] disable fakequant test (#236) * [SW-226948] Removed redundant cast to hp float in invert scale + some function commentary (#237) * [SW-218668] add use_mmap in example for llama-70b GPTQ (#234) * [SW-218668] add use_mmap in example for llama-70b GPTQ --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-229704] Disable scale rounding CGUID + using exp2 for fuser (#240) * [SW-230359] Support vector of scales on cpu (#241) and minor refactor of create_scale_tensor function * Deepseek FP8 fixes (#242) * Allign moe api, switch to local_num_experts * retrigger checks * retrigger checks Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-229825] support fp32 softmax mode in fp8_fsdpa (#247) * [SW-199936] Remove collective_func usage as preparation for vLLM upstream (#249) self.collective_func is not allowed for upstream, therefore we use explicitly vLLM collective functions, while protecting from import errors and circular import * [SW-219751]improve vllm compatible save function (#217) * improve vllm compatible save to avoid OOM --------- Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-232491] Fix collective functions from vLLM (#250) * [SW-228539] Support setting scale method per node in INC FP8 (#238) * [SW-228539] Support setting scale method per node in INC FP8 * scale method parser * tiny fix * apply all scale methods validations in config parsing correctly * tiny test fix * copilot cr * cr fixes * cr fix * python3.10 not supporting strenum * fix for enum * remove not needed imports * fix hard coded path * bug fix in new scale method test * save scale_method with provided path * small fix * retriger * retriger2 * [SW-214269] support g_idx for uint4 (#246) * support g_idx for uint4 --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: Sylwester Fraczek <sylwester.fraczek@intel.com> * [SW-228570] support FP8 GaudiFluxPipeline save and load (#254) * [SW-228570] support FP8 GaudiFluxPipeline save and load --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-224612] Set scale calculation to run on cguid as default in dynamic quant (#257) * [SW-224612] Set scale calculation to run on cguid as default * Update common.py * CGUID calculation only in dynamic --------- Co-authored-by: Danny <dsemiat@habana.ai> * [SW-228576] Add Dynamic Quant Support For FusedMoE (#243) Signed-off-by: Yi Liu <yiliu4@habana.ai> * [SW-230951] Save measurements according to samples counter (#251) * Added post forward hook to dump measurements according to samples counter * add support in samples counter in config * removed function in RowParllelLinear as it is removed from the vllm upstream code * currently only blocking method is operational, will complete async methods in future commit * fix CR comments * remove unused files * add reslove_input method it can't be defined in vllm due to upstream considerations, so it is copied here * fixed logging acoording to cr * fixed resolve_input and moved the hook function * [SW-230641] Remove smoothquant related scale methods (#258) * Rename model in UT to reduce CI effort (#245) * rename model in UT to reduce CI effort --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-233731] Support FP8 QDQ quant on CPU (#239) supported module types: Linear, Conv2D, EmbeddingBag (weight-only quant) validated scheme: per-tensor, sym, E4M3 validated model: DLRM, vit --------- Signed-off-by: Mengni Wang <mewang@habana.ai> Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mewang@habana.ai> * [SW-233731] Use torchao op for CPU QDQ and abstract QDQ calling (#264) Abstract QDQ calling Fix QDQ model print issue Use torchao op for CPU QDQ (HPU doesn't has this accuracy issue) --------- Signed-off-by: Mengni Wang <mewang@habana.ai> Co-authored-by: Mengni Wang <mewang@habana.ai> * [SW-0] Update version to 3.5 (#269) * [SW-234066] fix performance drop due to ordered g_idx (#277) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> * update format after cherry_pick Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-230053] Merge with public Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-234066] update g_idx fix to make it robust (#281) The previous fix only covered part of the flow, and the second flow is failing. The current fix moves the relevant code to a common location that's shared by both flows. Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix kwargs mismatch issue in WOQ pack Signed-off-by: Xin He <xinhe3@habana.ai> * fix cpu fp8 quant unittest (#2245) * fix cpu fp8 quant unittest Signed-off-by: Mengni Wang <mengni.wang@intel.com> --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix requirement and recover skipped in UT (#2248) Fix requirement and recover skipped in UT --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin He <xin3.he@intel.com> * Fix CPU FP8 per-tensor QDQ (#2246) Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Xin He <xin3.he@intel.com> * add version check for 1.22.0 Signed-off-by: Xin He <xinhe3@habana.ai> * make version check robust Signed-off-by: Xin He <xinhe3@habana.ai> * Use SafeUnpickler (#2247) Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * better solution for checking g_idx support (#2251) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Replace all pickle load with safe load (#2252) * replace all pickle load Signed-off-by: yiliu30 <yi4.liu@intel.com> * Update neural_compressor/utils/utility.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add docstring Signed-off-by: yiliu30 <yi4.liu@intel.com> * Update utility.py --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [SW-234750] Fix reading distributed data in quant_config (#284) (#2255) Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> * add mapping entry for v3.5 (#2250) Signed-off-by: Huang, Tai <tai.huang@intel.com> * Fix autoround CI with amp (#2253) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add CPU FP8 QDQ doc (#2240) Signed-off-by: Mengni Wang <mengni.wang@intel.com> * [SW-235565] Adapt the `FusedMoE` update (#286) (#2256) * fix moe * update copy properties --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * Add dlrm_v2 CPU FP8 QDQ example (#2239) Signed-off-by: Mengni Wang <mengni.wang@intel.com> * fix corrupted file * fix CR * fix merge * fix cr * fix cr * fix cr * fix * fix * [SW-240561]Requant LLMC FP8 model (#301) * add VllmMixtureOfExpertsOpFP8PerChannel Change-Id: I1e28dbb1f6a6839fe90db8a38da4532bd335d69e Signed-off-by: Yi Liu <yiliu4@habana.ai> --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * pass extra args to moe (#302) Change-Id: I47f5259a247bbce0c6290d1d1d1bb47071bd3256 Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * fix --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: Xin <xin3.he@intel.com> Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Signed-off-by: xin3he <xin3.he@intel.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Signed-off-by: chensuyue <suyue.chen@intel.com> Signed-off-by: fengding <feng1.ding@intel.com> Signed-off-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Huang, Tai <tai.huang@intel.com> Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: Uri Livne <ulivne@habana.ai> Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Mengni Wang <mengni.wang@intel.com> Signed-off-by: changwangss <chang1.wang@intel.com> Signed-off-by: Daniel Socek <daniel.socek@intel.com> Signed-off-by: V-E-D <vedantthote2019@gmail.com> Signed-off-by: changwa1 <chang1.wang@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Mengni Wang <mewang@habana.ai> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Xin He <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: Wang, Chang <changwang@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com> Co-authored-by: chensuyue <suyue.chen@intel.com> Co-authored-by: Wang, Chang <chang1.wang@intel.com> Co-authored-by: feng-intel <110514170+feng-intel@users.noreply.github.com> Co-authored-by: Huang, Tai <tai.huang@intel.com> Co-authored-by: Heng Guo <heng.guo@intel.com> Co-authored-by: RafLit <rafal.litka@intel.com> Co-authored-by: Rafal Litka <rlitka@habana.ai> Co-authored-by: Dany Kiazada <141814181+kiazada@users.noreply.github.com> Co-authored-by: Nir David <124874956+nirda7@users.noreply.github.com> Co-authored-by: Yuwen Zhou <yuwen.zhou@intel.com> Co-authored-by: Oz Abramovich <oabramovich@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Danny Semiat <dsemiat@habana.ai> Co-authored-by: smarkovichgolan <smarkovich@habana.ai> Co-authored-by: Nadav Elyahu <88962733+nelyahu@users.noreply.github.com> Co-authored-by: fengding <feng1.ding@intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Daniel Socek <daniel.socek@intel.com> Co-authored-by: Vedant <146507396+ved1beta@users.noreply.github.com> Co-authored-by: Asaf Karnieli <akarnieli@habana.ai> Co-authored-by: Tomer Gafni <tgafni@habana.ai> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Roi Tiefenbrunn <rtiefenbrunn@habana.ai> Co-authored-by: Ivan Antonov <Iantonov@habana.ai> Co-authored-by: Tomasz Bohutyn <tbohutyn@habana.ai> Co-authored-by: Tomasz Szulist <72727299+tszulist-hbn@users.noreply.github.com> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Krzysztof Wiśniewski <kwisniewski@habana.ai> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mariusz Okroj <mariusz.okroj@intel.com> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Sylwester Fraczek <sylwester.fraczek@intel.com> Co-authored-by: Mengni Wang <mewang@habana.ai>

Type of Change
example
Description
update ONNXRT example for new API
JIRA ticket: ILITV-2468
How has this PR been tested?
extension test on onnx models
Dependency Change?
no