From 127d75cecb799faab9a98dfaef444ecbc999d9ac Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Fri, 11 Oct 2024 17:07:14 +0800 Subject: [PATCH] Support quant procedure on XPU (#2026) Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 2 ++ .../transformers/weight_only/text-generation/README.md | 4 +++- neural_compressor/transformers/quantization/utils.py | 10 ++++++---- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index 9aafeed5278..55e8d964072 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -208,6 +208,8 @@ python run_generation_gpu_woq.py --woq --benchmark --model save_dir >Note: > * Saving quantized model should be executed before the optimize_transformers function is called. > * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md). +>* The quantization process is performed on the CPU accelerator by default. Users can override this setting by specifying the environment variable `INC_TARGET_DEVICE`. Usage on bash: ```export INC_TARGET_DEVICE=xpu```. +>* For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset. ## Examples diff --git a/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/README.md b/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/README.md index 1abe2633ea3..f0760cc2fe1 100644 --- a/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/README.md +++ b/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/README.md @@ -103,6 +103,8 @@ python run_generate_cpu_woq.py \ > 1. default search algorithm is beam search with num_beams = 1. > 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well. > 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device. +> 4. The quantization process is performed on the CPU accelerator by default. Users can override this setting by specifying the environment variable `INC_TARGET_DEVICE`. Usage on bash: ```export INC_TARGET_DEVICE=xpu```. +> 5. For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset. ## Prerequisite​ ### Dependencies @@ -111,7 +113,7 @@ Intel-extension-for-pytorch dependencies are in oneapi package, before install i ### Create Environment​ Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version. ->**Note**: please install transformers==4.40.2. +>**Note**: please install transformers==4.38.1. ```bash pip install -r requirements_GPU.txt diff --git a/neural_compressor/transformers/quantization/utils.py b/neural_compressor/transformers/quantization/utils.py index e81c3295bfa..877e3be89be 100644 --- a/neural_compressor/transformers/quantization/utils.py +++ b/neural_compressor/transformers/quantization/utils.py @@ -351,10 +351,12 @@ def convert_to_quantized_model(model, config, device="cpu"): import intel_extension_for_pytorch assert hasattr(torch, "xpu") and torch.xpu.is_available(), "There is no xpu device in this system!" - os.environ["INC_TARGET_DEVICE"] = "cpu" - logger.info( - "Set the environment variable INC_TARGET_DEVICE='cpu' to ensure the quantization process occurs on the CPU." - ) + if "INC_TARGET_DEVICE" not in os.environ: + os.environ["INC_TARGET_DEVICE"] = "cpu" + logger.info( + "Set the environment variable INC_TARGET_DEVICE='cpu'" + " to ensure the quantization process occurs on the CPU." + ) orig_dtype = torch.float32 for param in model.parameters():