diff --git a/docker/llm/serving/xpu/docker/Dockerfile b/docker/llm/serving/xpu/docker/Dockerfile index 750b9860d73..83cb1a4a2d7 100644 --- a/docker/llm/serving/xpu/docker/Dockerfile +++ b/docker/llm/serving/xpu/docker/Dockerfile @@ -15,7 +15,6 @@ RUN cd /llm &&\ apt-get install -y libfabric-dev wrk && \ pip install --pre --upgrade ipex-llm[xpu,serving] && \ pip install transformers==4.37.0 gradio==4.19.2 && \ - chmod +x /opt/entrypoint.sh && \ # Install vLLM-v2 dependencies cd /llm && \ git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git && \ diff --git a/docker/llm/serving/xpu/docker/README.md b/docker/llm/serving/xpu/docker/README.md index 5e8c5c7a8b4..91951723edd 100644 --- a/docker/llm/serving/xpu/docker/README.md +++ b/docker/llm/serving/xpu/docker/README.md @@ -45,6 +45,119 @@ After the container is booted, you could get into the container through `docker Currently, we provide two different serving engines in the image, which are FastChat serving engine and vLLM serving engine. -To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [document](https://github.com/intel-analytics/IPEX-LLM/tree/main/python/llm/src/ipex_llm/serving). +#### FastChat serving engine -To run vLLM engine using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/vLLM-Serving#vllm-v2-experimental-support). +To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#). + +#### vLLM serving engine + +To run vLLM engine using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md). + +We have included multiple example files in `/llm/vllm-examples`: +1. `offline_inference.py`: Used for offline inference example +2. `benchmark_throughput.py`: Used for benchmarking throughput +3. `payload-1024.lua`: Used for testing request per second using 1k-128 request +4. `start_service.sh`: Used for template for starting vLLM service + +##### Online benchmark throurgh api_server + +We can benchmark the api_server to get an estimation about TPS (transactions per second). To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service). + + +In container, do the following: +1. modify the `/llm/vllm-examples/payload-1024.lua` so that the "model" attribute is correct. By default, we use a prompt that is roughly 1024 token long, you can change it if needed. +2. Start the benchmark using `wrk` using the script below: + +```bash +cd /llm/vllm-examples +# You can change -t and -c to control the concurrency. +# By default, we use 12 connections to benchmark the service. +wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h + +``` +#### Offline benchmark through benchmark_throughput.py + +We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_throughput.py`. To use the benchmark_throughput script, you will need to download the test dataset through: + +```bash +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json +``` + +The full example looks like this: +```bash +cd /llm/vllm-examples + +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json + +export MODEL="YOUR_MODEL" + +# You can change load-in-low-bit from values in [sym_int4, fp8, fp16] + +python3 /llm/vllm-examples/benchmark_throughput.py \ + --backend vllm \ + --dataset /llm/vllm-examples/ShareGPT_V3_unfiltered_cleaned_split.json \ + --model $MODEL \ + --num-prompts 1000 \ + --seed 42 \ + --trust-remote-code \ + --enforce-eager \ + --dtype float16 \ + --device xpu \ + --load-in-low-bit sym_int4 \ + --gpu-memory-utilization 0.85 +``` + +> Note: you can adjust --load-in-low-bit to use other formats of low-bit quantization. + + +You can also adjust `--gpu-memory-utilization` rate using the below script to find the best performance using the following script: + +```bash +#!/bin/bash + +# Define the log directory +LOG_DIR="YOUR_LOG_DIR" +# Check if the log directory exists, if not, create it +if [ ! -d "$LOG_DIR" ]; then + mkdir -p "$LOG_DIR" +fi + +# Define an array of model paths +MODELS=( + "YOUR TESTED MODELS" +) + +# Define an array of utilization rates +UTIL_RATES=(0.85 0.90 0.95) + +# Loop over each model +for MODEL in "${MODELS[@]}"; do + # Loop over each utilization rate + for RATE in "${UTIL_RATES[@]}"; do + # Extract a simple model name from the path for easier identification + MODEL_NAME=$(basename "$MODEL") + + # Define the log file name based on the model and rate + LOG_FILE="$LOG_DIR/${MODEL_NAME}_utilization_${RATE}.log" + + # Execute the command and redirect output to the log file + # Sometimes you might need to set --max-model-len if memory is not enough + # load-in-low-bit accepts inputs [sym_int4, fp8, fp16] + python3 /llm/vllm-examples/benchmark_throughput.py \ + --backend vllm \ + --dataset /llm/vllm-examples/ShareGPT_V3_unfiltered_cleaned_split.json \ + --model $MODEL \ + --num-prompts 1000 \ + --seed 42 \ + --trust-remote-code \ + --enforce-eager \ + --dtype float16 \ + --load-in-low-bit sym_int4 \ + --device xpu \ + --gpu-memory-utilization $RATE &> "$LOG_FILE" + done +done + +# Inform the user that the script has completed its execution +echo "All benchmarks have been executed and logged." +``` diff --git a/docker/llm/serving/xpu/docker/offline_inference.py b/docker/llm/serving/xpu/docker/offline_inference.py index 480ca5e3573..1480f6accca 100644 --- a/docker/llm/serving/xpu/docker/offline_inference.py +++ b/docker/llm/serving/xpu/docker/offline_inference.py @@ -49,7 +49,8 @@ device="xpu", dtype="float16", enforce_eager=True, - load_in_low_bit="sym_int4") + load_in_low_bit="sym_int4", + tensor_parallel_size=1) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) diff --git a/python/llm/example/GPU/vLLM-Serving/README.md b/python/llm/example/GPU/vLLM-Serving/README.md index 6d1cfcec63c..49001691e6c 100644 --- a/python/llm/example/GPU/vLLM-Serving/README.md +++ b/python/llm/example/GPU/vLLM-Serving/README.md @@ -83,24 +83,93 @@ To fully utilize the continuous batching feature of the `vLLM`, you can send req For vLLM, you can start the service using the following command: ```bash +#!/bin/bash +model="YOUR_MODEL_PATH" +served_model_name="YOUR_MODEL_NAME" + + # You may need to adjust the value of + # --max-model-len, --max-num-batched-tokens, --max-num-seqs + # to acquire the best performance + python -m ipex_llm.vllm.entrypoints.openai.api_server \ - --model /MODEL_PATH/Llama-2-7b-chat-hf/ --port 8000 \ - --device xpu --dtype float16 \ - --load-in-low-bit sym_int4 \ - --max-num-batched-tokens 4096 + --served-model-name $served_model_name \ + --port 8000 \ + --model $model \ + --trust-remote-code \ + --gpu-memory-utilization 0.75 \ + --device xpu \ + --dtype float16 \ + --enforce-eager \ + --load-in-low-bit sym_int4 \ + --max-model-len 4096 \ + --max-num-batched-tokens 10240 \ + --max-num-seqs 12 \ + --tensor-parallel-size 1 ``` +You can tune the service using these four arguments: +1. --gpu-memory-utilization: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. +2. --max-model-len: Model context length. If unspecified, will be automatically derived from the model config. +3. --max-num-batched-token: Maximum number of batched tokens per iteration. +4. --max-num-seq: Maximum number of sequences per iteration. Default: 256 -Then you can access the api server as follows: -```bash - curl http://localhost:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "/MODEL_PATH/Llama-2-7b-chat-hf/", - "prompt": "San Francisco is a", - "max_tokens": 128, - "temperature": 0 +After the service has been booted successfully, you can send a test request using curl. Here, the `YOUR_MODEL` should be set equal to `$served_model_name` in your booting script. + +```bash +curl http://localhost:8000/v1/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "YOUR_MODEL_NAME", + "prompt": "San Francisco is a", + "max_tokens": 128, + "temperature": 0 }' & ``` + +#### Tensor parallel + +> Note: We recommend to use docker for tensor parallel deployment. + +We have also supported tensor parallel by using multiple XPU cards. To enable tensor parallel, you will need to install `libfabric-dev` in your environment. In ubuntu, you can install it by: + +```bash +sudo apt-get install libfabric-dev +``` + +To deploy your model across multiple cards, simplely change the value of `--tensor-parallel-size` to the desired value. + +For instance, if you have two Arc A770 cards in your environment, then you can set this value to 2. Some OneCCL environment variable settings are also needed, try check the following example: + + +```bash +#!/bin/bash +model="YOUR_MODEL_PATH" +served_model_name="YOUR_MODEL_NAME" + +# CCL needed environment variables +export CCL_WORKER_COUNT=2 +export FI_PROVIDER=shm +export CCL_ATL_TRANSPORT=ofi +export CCL_ZE_IPC_EXCHANGE=sockets +export CCL_ATL_SHM=1 + # You may need to adjust the value of + # --max-model-len, --max-num-batched-tokens, --max-num-seqs + # to acquire the best performance + +python -m ipex_llm.vllm.entrypoints.openai.api_server \ + --served-model-name $served_model_name \ + --port 8000 \ + --model $model \ + --trust-remote-code \ + --gpu-memory-utilization 0.75 \ + --device xpu \ + --dtype float16 \ + --enforce-eager \ + --load-in-low-bit sym_int4 \ + --max-model-len 4096 \ + --max-num-batched-tokens 10240 \ + --max-num-seqs 12 \ + --tensor-parallel-size 2 +``` diff --git a/python/llm/example/GPU/vLLM-Serving/offline_inference.py b/python/llm/example/GPU/vLLM-Serving/offline_inference.py index 0df3631f770..d081c3f7c92 100644 --- a/python/llm/example/GPU/vLLM-Serving/offline_inference.py +++ b/python/llm/example/GPU/vLLM-Serving/offline_inference.py @@ -49,7 +49,8 @@ device="xpu", dtype="float16", enforce_eager=True, - load_in_low_bit="sym_int4") + load_in_low_bit="sym_int4", + tensor_parallel_size=1) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params)