Skip to content

Latest commit

 

History

History
271 lines (198 loc) · 11 KB

vLLM_quickstart.md

File metadata and controls

271 lines (198 loc) · 11 KB

Serving using IPEX-LLM and vLLM on Intel GPU

vLLM is a fast and easy-to-use library for LLM inference and serving. You can find the detailed information at their homepage.

IPEX-LLM can be integrated into vLLM so that user can use IPEX-LLM to boost the performance of vLLM engine on Intel GPUs (e.g., local PC with descrete GPU such as Arc, Flex and Max).

Currently, IPEX-LLM integrated vLLM only supports the following models:

  • Qwen series models
  • Llama series models
  • ChatGLM series models
  • Baichuan series models

Table of Contents

Quick Start

This quickstart guide walks you through installing and running vLLM with ipex-llm.

1. Install IPEX-LLM for vLLM

IPEX-LLM's support for vLLM now is available for only Linux system.

Visit Install IPEX-LLM on Linux with Intel GPU and follow the instructions in section Install Prerequisites to isntall prerequisites that are needed for running code on Intel GPUs.

Then, follow instructions in section Install ipex-llm to install ipex-llm[xpu] and setup the recommended runtime configurations.

After the installation, you should have created a conda environment, named ipex-vllm for instance, for running vLLM commands with IPEX-LLM.

2. Install vLLM

Currently, we maintain a specific branch of vLLM, which only works on Intel GPUs.

Activate the ipex-vllm conda environment and install vLLM by execcuting the commands below.

conda activate ipex-vllm
source /opt/intel/oneapi/setvars.sh
git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git
cd vllm
pip install -r requirements-xpu.txt
pip install --no-deps xformers
VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e .
pip install outlines==0.0.34 --no-deps
pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy
# For Qwen model support
pip install transformers_stream_generator einops tiktoken

Now you are all set to use vLLM with IPEX-LLM

3. Offline Inference/Service

Offline inference

To run offline inference using vLLM for a quick impression, use the following example.

Note

Please modify the MODEL_PATH in offline_inference.py to use your chosen model.

You can try modify load_in_low_bit to different values in [sym_int4, fp6, fp8, fp8_e4m3, fp16] to use different quantization dtype.

#!/bin/bash
wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/vLLM-Serving/offline_inference.py
python offline_inference.py

For instructions on how to change the load_in_low_bit value in offline_inference.py, check the following example:

llm = LLM(model="YOUR_MODEL",
          device="xpu",
          dtype="float16",
          enforce_eager=True,
          # Simply change here for the desired load_in_low_bit value
          load_in_low_bit="sym_int4",
          tensor_parallel_size=1,
          trust_remote_code=True)

The result of executing Baichuan2-7B-Chat model with sym_int4 low-bit format is shown as follows:

Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a [Your Job Title] at [Your'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government in the United States. The president leads'
Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
Prompt: 'The future of AI is', Generated text: " bright, but it's not without challenges. As AI continues to evolve,"

Service

Note

Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance.

To fully utilize the continuous batching feature of the vLLM, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished.

For vLLM, you can start the service using the following command:

#!/bin/bash
model="YOUR_MODEL_PATH"
served_model_name="YOUR_MODEL_NAME"

 # You may need to adjust the value of
 # --max-model-len, --max-num-batched-tokens, --max-num-seqs
 # to acquire the best performance

 # Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 4096 \
  --max-num-batched-tokens 10240 \
  --max-num-seqs 12 \
  --tensor-parallel-size 1

You can tune the service using these four arguments:

  1. --gpu-memory-utilization: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
  2. --max-model-len: Model context length. If unspecified, will be automatically derived from the model config.
  3. --max-num-batched-token: Maximum number of batched tokens per iteration.
  4. --max-num-seq: Maximum number of sequences per iteration. Default: 256

For longer input prompt, we would suggest to use --max-num-batched-token to restrict the service. The reason behind this logic is that the peak GPU memory usage will appear when generating first token. By using --max-num-batched-token, we can restrict the input size when generating first token.

--max-num-seqs will restrict the generation for both first token and rest token. It will restrict the maximum batch size to the value set by --max-num-seqs.

When out-of-memory error occurs, the most obvious solution is to reduce the gpu-memory-utilization. Other ways to resolve this error is to set --max-num-batched-token if peak memory occurs when generating first token or using --max-num-seq if peak memory occurs when generating rest tokens.

If the service have been booted successfully, the console will display messages similar to the following:

After the service has been booted successfully, you can send a test request using curl. Here, YOUR_MODEL should be set equal to $served_model_name in your booting script, e.g. Qwen1.5.

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "YOUR_MODEL",
  "prompt": "San Francisco is a",
  "max_tokens": 128,
  "temperature": 0
}' | jq '.choices[0].text'

Below shows an example output using Qwen1.5-7B-Chat with low-bit format sym_int4:

Tip

If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service:

export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

4. About Tensor Parallel

Note

We recommend to use docker for tensor parallel deployment. Check our serving docker image intelanalytics/ipex-llm-serving-xpu.

We have also supported tensor parallel by using multiple Intel GPU cards. To enable tensor parallel, you will need to install libfabric-dev in your environment. In ubuntu, you can install it by:

sudo apt-get install libfabric-dev

To deploy your model across multiple cards, simplely change the value of --tensor-parallel-size to the desired value.

For instance, if you have two Arc A770 cards in your environment, then you can set this value to 2. Some OneCCL environment variable settings are also needed, check the following example:

#!/bin/bash
model="YOUR_MODEL_PATH"
served_model_name="YOUR_MODEL_NAME"

# CCL needed environment variables
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
 # You may need to adjust the value of
 # --max-model-len, --max-num-batched-tokens, --max-num-seqs
 # to acquire the best performance

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 4096 \
  --max-num-batched-tokens 10240 \
  --max-num-seqs 12 \
  --tensor-parallel-size 2

If the service have booted successfully, you should see the output similar to the following figure:

5. Performing Benchmark

To perform benchmark, you can use the benchmark_throughput script that is originally provided by vLLM repo.

conda activate ipex-vllm

source /opt/intel/oneapi/setvars.sh

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/docker/llm/serving/xpu/docker/benchmark_vllm_throughput.py -O benchmark_throughput.py

export MODEL="YOUR_MODEL"

# You can change load-in-low-bit from values in [sym_int4, fp6, fp8, fp8_e4m3, fp16]

python3 ./benchmark_throughput.py \
    --backend vllm \
    --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \
    --model $MODEL \
    --num-prompts 1000 \
    --seed 42 \
    --trust-remote-code \
    --enforce-eager \
    --dtype float16 \
    --device xpu \
    --load-in-low-bit sym_int4 \
    --gpu-memory-utilization 0.85

The following figure shows the result of benchmarking Llama-2-7b-chat-hf using 50 prompts:

Tip

To find the best config that fits your workload, you may need to start the service and use tools like wrk or jmeter to perform a stress tests.