Perf stable check #129

chensuyue · 2022-11-15T03:12:45Z

No description provided.

* [SW-217334] enable fp8 qdq mode using PatchedModuleBase * fix review commnets

* [SW-210525] release HPU memory when loading neural_magic fp8 models (#48) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-211178] save generation_config when saving model if exists (#57) * [SW-211178] save generation_config when saving model if exists --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-210543] update gitignore to simplify the git message (#50) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-205334][SW-187731] llama70b vLLM fix graph breaks with torch.compile (#67) * fix graph breaks with torch.compile * remove orig_mod from helper_modules * fix typos * fix test_register_apis --------- Co-authored-by: Rafal Litka <rlitka@habana.ai> * [SW-213890] Disable test_two_step_layer_wise temporarily (#84) * [SW-205437] - Support LM-HEAD patching (#79) * [SW-205437] - Support LM-HEAD patching * fix CR comments * Enhance and rename fix_measurements tool to postprocessing_vllm_measurements (#82) * [SW-214088] Fix graph break caused by PatchedMixtralMoE (#74) * [SW-208528] Support FP8 per channel Q/DQ (#13) * add per channel qdq support Signed-off-by: changwang <changwang@habana.ai> * improve ut Signed-off-by: changwang <changwang@habana.ai> * improve get_scale_dtype func and qdq init Signed-off-by: changwangss <changwang@habana.ai> * improve DequantOutput QuantInput init Signed-off-by: changwangss <changwang@habana.ai> * add scale_method improve PCQ Signed-off-by: changwangss <changwang@habana.ai> * remove scale name Signed-off-by: changwangss <changwang@habana.ai> * fix PCQ scale_inv expanding Signed-off-by: changwangss <changwang@habana.ai> * merge the qdq_per_channel, qdq_per_tensor to qdq Signed-off-by: changwangss <changwang@habana.ai> * move scale_inv change to the QuantInput init Signed-off-by: changwangss <changwang@habana.ai> * remove scale_dtype list judge Signed-off-by: changwangss <changwang@habana.ai> * fix missing axis parameter Signed-off-by: changwangss <changwang@habana.ai> --------- Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: changwangss <changwang@habana.ai> * [SW-204341] explicit scale format for ops (#73) * [SW-204341] explicit scale format for ops Added wrapper around fp8 functions Wrapper decides which flavor of the function to call, according to scale format Helper modules call the wrapper Decide which cast flavor to call, according to scale format * [SW-204341] Adjust softmax API , remove commented-out code * [SW-204341] Fixes from CR 1 * [SW-204341] Fixed CR 2 * [SW-204341] add missing arg is fsdpa Signed-off-by: Uri Livne <ulivne@habana.ai> * [SW-204341] Enhance SDPA for measure and quant * [SW-204341] remove sdpa quantized ops * reland per op class with more enchancments * [SW-204341] reland specfic arguments , rename class to wrapper * added call with self in patched lm head rebased on top of master next force push * fix mistake in conflict resolution resotore MethodType fix * antoher fix * modified fp8 mtamul test to test quantized matmul func * another fix of rebase mistake * hopefully last rebase mistake fix * restore backward compatibly import protection --------- Signed-off-by: Uri Livne <ulivne@habana.ai> * [SW-213890] Revert "[SW-213890] Disable test_two_step_layer_wise temporarily (#84)" (#86) This reverts commit 27162ae. * Revert "[SW-205334][SW-187731] llama70b vLLM fix graph breaks with torch.com…" (#87) This reverts commit 01a5734. Co-authored-by: Danny Semiat <dsemiat@habana.ai> * [ALGO-809] PatchedLmHeadLinearAllreduce: replacing the sharding code with the one from deepspeed-fork (#85) Change-Id: Icb9670cfefdd1880c1ebb9a804a97c9ba79ecdc3 Co-authored-by: smarkovichgolan <smarkovich@habana.ai> * fix bug of FusedMoE object has no attribute w13_weight (#94) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> * [SW-208588] Add HPU fp8 Dynamic MOE (#88) * [SW-208588] Add HPU fp8 Dynamic MOE * fix review comments * fix more review comments * fix comments * fix tests * minor config fixes (#96) * [SW-0] minor cosmetic fixes in quant_config * remove hooks * [SW-196641] - Fix type mismatch in linear quantization unit tests (#99) * [SW-196641] - Fix type mismatch in linear quantization unit tests * fix atol value * add hp_dtype to fp8 config dict before parsing * [SW-214785] Apply PatchedModuleBase for all existing PatchedModules (#92) * [SW-214785] Apply PatchedModuleBase for all existing PatchedModules Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215319] threshold of memory usage in test_block_wise.py is too tight (#100) * [SW-215543] Revert "minor config fixes (#96)" (#104) This reverts commit fa40142. * fix RowParalleLinear func names from string to tuple (#106) * [SW-215615] memory is unreleased during loading neural_magic models on multi-cards (#105) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-212423] RuntimeError when load the gptq model from HF (#70) * [SW-212423] RuntimeError when load the gptq model from HF * skip tie_word_embeddings=False Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-214785] fix issue when self._mod_extra_config is None (#108) * [SW-211826] [example] demonstrate layer-wise, block-wise and lm_eval usage (#66) * [SW-211826] [example] demonstrate layer-wise&block-wise usage to quantize LLM with limited host&device memory Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215295] Force single object from quantized func wrapper classes (#103) * [SW-215295] Force single object from quantized func wrapper classes * Modify the factory object to be cleared after module patching * Move cleanup to Quantizer object * [SW-216292]Minor update for lm-eval (#113) * Enable lm-eval 0.4.2 and expose `add_bos_token` --------- Signed-off-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Yi Liu <yiliu4@habana.ai> * [SW-209207] add vllm fp8 dynamic MoE (#116) * [SW-216239] Align Softmax fp8 scale calc with configuration (#112) * [SW-217321] Skip auto round tests (#119) (#125) * Test Commit * [SW-217321] Skip auto round tests do to CI breakage * remove uneeded print * [SW-207451] Implement block-wise calibration for LLM (#24) For LLMs, measurement on bf16 requires high hpu memory usage. This change can help measure bf16 llama-405b on 8 Gaudi2 card, or measure llama-70b on 1 Gaudi card. Shortage: cannot measure lm_head layer, maybe we can enhance it later. --------- Signed-off-by: Xin <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug in output arbitrary scales (#45) * [SW-197077] fix bug * [SW-197077] fix bug in outputs arbitrary scales Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-197077] fix bug in output arbitrary scales (#45) * [SW-197077] fix bug * [SW-197077] fix bug in outputs arbitrary scales * [SW-210500] [Optimum-Habana] [Regression] [fp8] [INC] No generated text for llava models [llava-1.5-7b-hf] [llava-1.5-13b-hf ] (#54) (#77) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-213236] resolve CPU mem issue in CI (#76) (#83) Cherry-pick from 1.19 Co-authored-by: Xin He <xin3.he@intel.com> * [SW-213368] requirements_pt.txt: allow newer pydantic versions to >= 1.10.13 (#80) * requirements_pt.txt: upgrade pydantic version to >= 2.0.0 * allow newer version of pydantic newer deepspeed uses pydantic v2, which have slight different APIs. * Update requirements_pt.txt * [SW-212057] Enable scalar scale to support QDQ (#98) * [SW-212057] Enable scalar scale to support QDQ Change-Id: Ib5f5accd7a770675609e91c18bd04497b15937c5 * PR comment fixes Change-Id: I01be41c29721b8d59c887f3d2b4e3cef8433331c Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-215845] Run some unit tests from top level API (#109) Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-212629] Support saving weight-only quantization INT4 model in Hugging Face format (#101) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-205970] update state_dict to save scalar scales (#6) * update state_dict method in save/load function --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * Revert "[SW-205970] update state_dict to save scalar scales (#6)" (#114) This reverts commit ffcb97e. * [SW-212092] Save vllm compatible format (#102) * save vllm compatible format Signed-off-by: changwangss <changwang@habana.ai> * add assertion and improve max_file_size to human reading Signed-off-by: changwangss <changwang@habana.ai> * support default the same with huggingface when saving Signed-off-by: changwangss <changwang@habana.ai> * separate save funtion for single device and multi devices. Signed-off-by: changwangss <changwang@habana.ai> * rebase Signed-off-by: changwangss <changwang@habana.ai> * rebase save Signed-off-by: changwangss <changwang@habana.ai> * remove weight and scale convert on G2 Signed-off-by: changwangss <changwang@habana.ai> * rebase master_next due to revert #6 Signed-off-by: changwangss <changwang@habana.ai> * improve convert weight to vllm compatable function Signed-off-by: changwangss <changwang@habana.ai> * replace print to logger Signed-off-by: changwangss <changwang@habana.ai> * move unit_mapping to common utils Signed-off-by: changwangss <changwang@habana.ai> --------- Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-205970] update state_dict to save scalar scales (#115) * [SW-205970] update state_dict to save scalar scales (#6) * update state_dict method in save/load function * support mixtral --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * [SW-215009] support loading per-channel scales (#95) * [SW-215009] support loading per-channel scales Signed-off-by: Xin He <xinhe3@habana.ai> * fix UT Signed-off-by: Xin He <xinhe3@habana.ai> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Refactoring scales (#22) (#122) * Refactoring scales (#22) * [SW-197077] refactoring maxabs scales and adding arbitrary scales. * [SW-199696] Supporting Dynamic Quantization (#128) * Calculating dynamic scales using nn.Modules Change-Id: I8c344ae737803b39117037edaaa3d3b9cbd09f30 * [SW-199696] Supporting Dynamic Quantization Change-Id: Ic5d6f04ec0b5032ac305e1b3097747c47250385b * Code cleanup Change-Id: I213bc7438e06bd1002775066bfb0dc6f10e8a84a * Review changes and model print issue (circular dependency fix) Change-Id: I5c41d2f9a937416ce260f55cb045c86858dd201a * removed debug code from patching_common.py * Round 2 + CI import issue Change-Id: I27dbb33de8e027fb0b726336b38156b5d23a6896 Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-217334] enable fp8 qdq mode using PatchedModuleBase (#129) * [SW-217334] enable fp8 qdq mode using PatchedModuleBase * fix review commnets * [SW-218871] fp8 multi-cards is not loaded correctly (#138) Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> * Fix bug in mixtral unitscale (#141) * [SW-218197] fix bug in Mixtral unitscale * [SW-218197] fix bug in Mixtral unitscale * update version to 3.3 for release Signed-off-by: Xin He <xinhe3@habana.ai> * [SW-20808] Make sure save&load format is an Enum object (#58) * [SW-20808] Make sure save&load format is an Enum object Signed-off-by: Xin He <xinhe3@habana.ai> * Update save_load_entry.py --------- Signed-off-by: Xin He <xinhe3@habana.ai> Co-authored-by: Xin He <xinhe3@habana.ai> Signed-off-by: Xin He <xinhe3@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add xfail for torchvision Signed-off-by: Xin He <xinhe3@habana.ai> * fix ILITV-3859 Signed-off-by: xin3he <xin3.he@intel.com> * workaround for ILITV-3858 Signed-off-by: xin3he <xin3.he@intel.com> * fix sdxl_smooth_quant Signed-off-by: xin3he <xin3.he@intel.com> * fix ILITV-3854 Signed-off-by: xin3he <xin3.he@intel.com> --------- Signed-off-by: Xin He <xinhe3@habana.ai> Signed-off-by: changwang <changwang@habana.ai> Signed-off-by: changwangss <changwang@habana.ai> Signed-off-by: Uri Livne <ulivne@habana.ai> Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Yi Liu <yiliu4@habana.ai> Signed-off-by: Xin <xin3.he@intel.com> Signed-off-by: xin3he <xin3.he@intel.com> Co-authored-by: Xin He <xinhe3@habana.ai> Co-authored-by: RafLit <rafal.litka@intel.com> Co-authored-by: Rafal Litka <rlitka@habana.ai> Co-authored-by: Dany Kiazada <141814181+kiazada@users.noreply.github.com> Co-authored-by: Nir David <124874956+nirda7@users.noreply.github.com> Co-authored-by: Yuwen Zhou <yuwen.zhou@intel.com> Co-authored-by: Wang, Chang <changwang@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Oz Abramovich <oabramovich@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Danny Semiat <dsemiat@habana.ai> Co-authored-by: smarkovichgolan <smarkovich@habana.ai> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Yi Liu <yiliu4@habana.ai> Co-authored-by: Linoy Buchnik <linoybu@gmail.com> Co-authored-by: Nadav Elyahu <88962733+nelyahu@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: chen, suyue <suyue.chen@intel.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com>

WenjiaoYue and others added 30 commits November 1, 2022 14:26

optimize model code (#42)

c0b9242

perf check template

7b10876

collect json log

aee87a3

test

7624f10

test

3330f28

test

18bc530

test

ece98bd

update

87be0ec

update

c9c94ee

change color

a4c1b5b

change color

ab3a87b

update

44ad300

update

719880d

fix bug

249d0cb

fix bug

65e1990

fix bug

14a6a33

add try catch

9b9ba67

update

4708ba7

add accuracy check

d71bfaa

fix path

1600051

update code structure

fbeff35

add global variable

8c0dedb

fix tuning status check

703cebe

fix tuning check

131e452

fix

adc73ff

fix benchmark check

0fd14b1

fix

a462192

test

dbdff5c

update

90590b9

update

59f846c

XuehaoSun and others added 5 commits November 3, 2022 11:54

validation

ef5b682

fix

8e9f003

update

1b6d74c

use python3.8

4018927

Merge branch 'azure-pipeline' into perf_stable_check

44db475

chensuyue merged commit c62f670 into intel:azure-pipeline Nov 15, 2022

VincyZhang pushed a commit that referenced this pull request Feb 12, 2023

integrate neural compressor binary update (#129)

ed746a8

xin3he pushed a commit that referenced this pull request Feb 14, 2025

[SW-217334] enable fp8 qdq mode using PatchedModuleBase (#129)

74c34c0

* [SW-217334] enable fp8 qdq mode using PatchedModuleBase * fix review commnets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf stable check #129

Perf stable check #129

Uh oh!

chensuyue commented Nov 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Perf stable check #129

Perf stable check #129

Uh oh!

Conversation

chensuyue commented Nov 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants