Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/3x/transformers_like_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,8 @@ python run_generation_gpu_woq.py --woq --benchmark --model save_dir
>Note:
> * Saving quantized model should be executed before the optimize_transformers function is called.
> * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md).
>* The quantization process is performed on the CPU accelerator by default. Users can override this setting by specifying the environment variable `INC_TARGET_DEVICE`. Usage on bash: ```export INC_TARGET_DEVICE=xpu```.
>* For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset.

## Examples

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,8 @@ python run_generate_cpu_woq.py \
> 1. default search algorithm is beam search with num_beams = 1.
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device.
> 4. The quantization process is performed on the CPU accelerator by default. Users can override this setting by specifying the environment variable `INC_TARGET_DEVICE`. Usage on bash: ```export INC_TARGET_DEVICE=xpu```.
> 5. For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset.

## Prerequisite​
### Dependencies
Expand All @@ -111,7 +113,7 @@ Intel-extension-for-pytorch dependencies are in oneapi package, before install i
### Create Environment​
Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version.

>**Note**: please install transformers==4.40.2.
>**Note**: please install transformers==4.38.1.

```bash
pip install -r requirements_GPU.txt
Expand Down
10 changes: 6 additions & 4 deletions neural_compressor/transformers/quantization/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,10 +351,12 @@ def convert_to_quantized_model(model, config, device="cpu"):
import intel_extension_for_pytorch

assert hasattr(torch, "xpu") and torch.xpu.is_available(), "There is no xpu device in this system!"
os.environ["INC_TARGET_DEVICE"] = "cpu"
logger.info(
"Set the environment variable INC_TARGET_DEVICE='cpu' to ensure the quantization process occurs on the CPU."
)
if "INC_TARGET_DEVICE" not in os.environ:
os.environ["INC_TARGET_DEVICE"] = "cpu"
logger.info(
"Set the environment variable INC_TARGET_DEVICE='cpu'"
" to ensure the quantization process occurs on the CPU."
)

orig_dtype = torch.float32
for param in model.parameters():
Expand Down