Here you can find the inference benchmarking scripts for large language models (LLM) text generation. These scripts:
- Support Llama 2, GPT-J, Qwen, OPT, and Bloom model families
- Include both single instance and distributed (DeepSpeed) use cases
- Cover model generation inference with low precision cases for different models with best performance and accuracy (fp16 AMP and weight only quantization)
Currently, only support Transformers 4.31.0. Support for newer versions of Transformers and more models will be available in the future.
MODEL FAMILY | Verified < MODEL ID > (Huggingface hub) | FP16 | Weight only quantization INT4 | Optimized on Intel® Data Center GPU Max Series (1550/1100) | Optimized on Intel® Arc™ A-Series Graphics (A770) |
---|---|---|---|---|---|
Llama 2 | "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf" | ✅ | ✅ | ✅ | ✅ |
GPT-J | "EleutherAI/gpt-j-6b" | ✅ | ✅ | ✅ | ✅ |
Qwen | "Qwen/Qwen-7B" | ✅ | ✅ | ✅ | ✅ |
OPT | "facebook/opt-6.7b", "facebook/opt-30b" | ✅ | ❎ | ✅ | ❎ |
Bloom | "bigscience/bloom-7b1", "bigscience/bloom" | ✅ | ❎ | ✅ | ❎ |
Note: The verified models mentioned above (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well-supported with all optimizations like indirect access KV cache and fused ROPE. For other LLM families, we are actively working to implement these optimizations, which will be reflected in the expanded model list above.
* Intel® Data Center GPU Max Series (1550/1100) and Optimized on Intel® Arc™ A-Series Graphics (A770) : support all the models in the model list above.
# Get the Intel® Extension for PyTorch* source code
git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout release/xpu/2.1.30
git submodule sync
git submodule update --init --recursive
# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch* prebuilt wheel files
docker build -f examples/gpu/inference/python/llm/Dockerfile --build-arg GID_RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,') -t ipex-llm:2.1.30 .
# Run the container with command below
docker run --privileged -it --rm --device /dev/dri:/dev/dri -v /dev/dri/by-path:/dev/dri/by-path \
--ipc=host --net=host --cap-add=ALL -v /lib/modules:/lib/modules --workdir /workspace \
--volume `pwd`/examples/gpu/inference/python/llm/:/workspace/llm ipex-llm:2.1.30 /bin/bash
# When the command prompt shows inside the docker container, enter llm examples directory
cd llm
# Activate environment variables
source ./tools/env_activate.sh
# Get the Intel® Extension for PyTorch* source code
git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout release/xpu/2.1.30
git submodule sync
git submodule update --init --recursive
# Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch* from source
docker build -f examples/gpu/inference/python/llm/Dockerfile --build-arg GID_RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,') --build-arg COMPILE=ON -t ipex-llm:2.1.30 .
# Run the container with command below
docker run --privileged -it --rm --device /dev/dri:/dev/dri -v /dev/dri/by-path:/dev/dri/by-path \
--ipc=host --net=host --cap-add=ALL -v /lib/modules:/lib/modules --workdir /workspace \
--volume `pwd`/examples/gpu/inference/python/llm/:/workspace/llm ipex-llm:2.1.30 /bin/bash
# When the command prompt shows inside the docker container, enter llm examples directory
cd llm
# Activate environment variables
source ./tools/env_activate.sh
Make sure the driver and Base Toolkit are installed without using a docker container. Refer to Installation Guide.
# Get the Intel® Extension for PyTorch* source code
git clone https://github.com/intel/intel-extension-for-pytorch.git
cd intel-extension-for-pytorch
git checkout release/xpu/2.1.30
git submodule sync
git submodule update --init --recursive
# Make sure you have GCC >= 11 is installed on your system.
# Create a conda environment
conda create -n llm python=3.10 -y
conda activate llm
conda install pkg-config
# Setup the environment with the provided script
cd examples/gpu/inference/python/llm
# If you want to install Intel® Extension for PyTorch\* from source, use the commands below:
bash ./tools/env_setup.sh 3 <DPCPP_ROOT> <ONEMKL_ROOT> <ONECCL_ROOT> <MPI_ROOT> <AOT>
export LD_PRELOAD=$(bash ../../../../../tools/get_libstdcpp_lib.sh)
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}
source ./tools/env_activate.sh
where
AOT
is a text string to enableAhead-Of-Time
compilation for specific GPU models. Check tutorial for details.
Benchmark mode | FP16 | Weight only quantization INT4 |
---|---|---|
Single instance | ✅ | ✅ |
Distributed (autotp) | ✅ | ❎ |
Note: During the execution, you may need to log in your Hugging Face account to access model files. Refer to HuggingFace Login
Run all inference cases with the one-click bash script run_benchmark.sh
:
bash run_benchmark.sh
Note: We only support LLM optimizations with datatype float16, so please don't change datatype to float32 or bfloat16.
# fp16 benchmark
python -u run_generation.py --benchmark -m ${model} --num-beams ${beam} --num-iter ${iter} --batch-size ${bs} --input-tokens ${input} --max-new-tokens ${output} --device xpu --ipex --dtype float16 --token-latency
Note: By default, generations are based on bs = 1
, input token size = 1024, output toke size = 128, iteration num = 10 and beam search
, and beam size = 4. For beam size = 1 and other settings, please export env settings, such as: beam=1
, input=32
, output=32
, iter=5
.
For all distributed inference cases, run LLM with the one-click bash script run_benchmark_ds.sh
:
bash run_benchmark_ds.sh
# fp16 benchmark
mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py --benchmark -m ${model} --num-beams ${beam} --num-iter ${iter} --batch-size ${bs} --input-tokens ${input} --max-new-tokens ${output} --device xpu --ipex --dtype float16 --token-latency
Note: By default, generations are based on bs = 1
, input token size = 1024, output toke size = 128, iteration num = 10 and beam search
, and beam size = 4. For beam size = 1 and other settings, please export env settings, such as: beam=1
, input=32
, output=32
, iter=5
.
Accuracy test {TASK_NAME}, choice in this [link](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md), by default we use "lambada_standard"
# one-click bash script
bash run_accuracy.sh
# float16
LLM_ACC_TEST=1 python -u run_generation.py -m ${model} --ipex --dtype float16 --accuracy-only --acc-tasks ${task}
# one-click bash script
bash run_accuracy_ds.sh
# float16
LLM_ACC_TEST=1 mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py -m ${model} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} 2>&1
Using INT4 weights can further improve performance by reducing memory bandwidth. However, direct per-channel quantization of weights to INT4 may result in poor accuracy. Some algorithms can modify weights through calibration before quantizing weights to minimize accuracy drop. You may generate modified weights and quantization info (scales, zero points) for a Llama 2/GPT-J/Qwen models with a dataset for specified tasks by such algorithms. We recommend intel extension for transformer to quantize the LLM model.
Check WOQ INT4 for more details.
pip install neural-compressor
pip install intel-extension-for-transformers
pip install tiktoken einops transformers_stream_generator
bash run_benchmark_woq.sh
Note:
- Saving quantized model should be executed before the optimize_transformers function is called.
- The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of
optimize_transformers
, please refer to Transformers Optimization Frontend API.