Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 37 additions & 38 deletions ChatQnA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Getting started is straightforward with the official Docker container. Simply pu
docker pull ghcr.io/huggingface/tgi-gaudi:1.2.1
```

Alternatively, you can build the Docker image yourself with:
Alternatively, you can build the Docker image yourself using latest [TGI-Gaudi](https://github.com/huggingface/tgi-gaudi) code with the below command:

```bash
bash ./serving/tgi_gaudi/build_docker.sh
Expand Down Expand Up @@ -44,46 +44,11 @@ The ./serving/tgi_gaudi/launch_tgi_service.sh script accepts three parameters:
- port_number: The port number assigned to the TGI Gaudi endpoint, with the default being 8080.
- model_name: The model name utilized for LLM, with the default set to "Intel/neural-chat-7b-v3-3".

You have the flexibility to customize these parameters according to your specific needs. Additionally, you can set the TGI Gaudi endpoint by exporting the environment variable `TGI_ENDPOINT`:
You have the flexibility to customize these parameters according to your specific needs. Additionally, you can set the TGI Gaudi endpoint by exporting the environment variable `TGI_LLM_ENDPOINT`:
```bash
export TGI_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
export TGI_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
```

## Enable TGI Gaudi FP8 for higher throughput
The TGI Gaudi utilizes BFLOAT16 optimization as the default setting. If you aim to achieve higher throughput, you can enable FP8 quantization on the TGI Gaudi. According to our test results, FP8 quantization yields approximately a 1.8x performance gain compared to BFLOAT16. Please follow the below steps to enable FP8 quantization.

### Prepare Metadata for FP8 Quantization

Enter into the TGI Gaudi docker container, and then run the below commands:

```bash
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/text-generation
pip install -r requirements_lm_eval.txt
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py run_lm_eval.py -o acc_7b_bs1_measure.txt --
model_name_or_path meta-llama/Llama-2-7b-hf --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py run_lm_eval.py -o acc_7b_bs1_quant.txt --model_name_or_path
meta-llama/Llama-2-7b-hf --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1 --fp8
```

After finishing the above commands, the quantization metadata will be generated. Move the metadata directory ./hqt_output/ and copy the quantization JSON file to the host (under …/data). Please adapt the commands with your Docker ID and directory path.

```bash
docker cp 262e04bbe466:/usr/src/optimum-habana/examples/text-generation/hqt_output data/
docker cp 262e04bbe466:/usr/src/optimum-habana/examples/text-generation/quantization_config/maxabs_quant.json data/
```

### Restart the TGI Gaudi server within all the metadata mapped

```bash
docker run -d -p 8080:80 -e QUANT_CONFIG=/data/maxabs_quant.json -e HUGGING_FACE_HUB_TOKEN=<your HuggingFace token> -v $volume:/data --
runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --
model-id meta-llama/Llama-2-7b-hf
```

Now the TGI Gaudi will launch the FP8 model by default. Please note that currently only Llama2 and Mistral models support FP8 quantization.


## Launch Redis
```bash
docker pull redis/redis-stack:latest
Expand Down Expand Up @@ -173,3 +138,37 @@ nohup npm run dev &
```

This will initiate the frontend service and launch the application.


# Enable TGI Gaudi FP8 for higher throughput (Optional)
The TGI Gaudi utilizes BFLOAT16 optimization as the default setting. If you aim to achieve higher throughput, you can enable FP8 quantization on the TGI Gaudi. According to our test results, FP8 quantization yields approximately a 1.8x performance gain compared to BFLOAT16. Please follow the below steps to enable FP8 quantization.

## Prepare Metadata for FP8 Quantization

Enter into the TGI Gaudi docker container, and then run the below commands:

```bash
pip install git+https://github.com/huggingface/optimum-habana.git
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/text-generation
pip install -r requirements_lm_eval.txt
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py run_lm_eval.py -o acc_7b_bs1_measure.txt --model_name_or_path Intel/neural-chat-7b-v3-3 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1
QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py run_lm_eval.py -o acc_7b_bs1_quant.txt --model_name_or_path Intel/neural-chat-7b-v3-3 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1 --fp8
```

After finishing the above commands, the quantization metadata will be generated. Move the metadata directory ./hqt_output/ and copy the quantization JSON file to the host (under …/data). Please adapt the commands with your Docker ID and directory path.

```bash
docker cp 262e04bbe466:/usr/src/optimum-habana/examples/text-generation/hqt_output data/
docker cp 262e04bbe466:/usr/src/optimum-habana/examples/text-generation/quantization_config/maxabs_quant.json data/
```
Then modify the `dump_stats_path` to "/data/hqt_output/measure" and update `dump_stats_xlsx_path` to /data/hqt_output/measure/fp8stats.xlsx" in maxabs_quant.json file.


## Restart the TGI Gaudi server within all the metadata mapped

```bash
docker run -p 8080:80 -e QUANT_CONFIG=/data/maxabs_quant.json -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:1.2.1 --model-id Intel/neural-chat-7b-v3-3
```

Now the TGI Gaudi will launch the FP8 model by default. Please note that currently only Llama2 series and Mistral series models support FP8 quantization.
4 changes: 2 additions & 2 deletions ChatQnA/langchain/docker/qna-app/app/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,9 @@ def handle_rag_chat(self, query: str):


upload_dir = os.getenv("RAG_UPLOAD_DIR", "./upload_dir")
tgi_endpoint = os.getenv("TGI_ENDPOINT", "http://localhost:8080")
tgi_llm_endpoint = os.getenv("TGI_LLM_ENDPOINT", "http://localhost:8080")
safety_guard_endpoint = os.getenv("SAFETY_GUARD_ENDPOINT")
router = RAGAPIRouter(upload_dir, tgi_endpoint, safety_guard_endpoint)
router = RAGAPIRouter(upload_dir, tgi_llm_endpoint, safety_guard_endpoint)


@router.post("/v1/rag/chat")
Expand Down
4 changes: 2 additions & 2 deletions ChatQnA/langchain/redis/rag_redis/chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
INDEX_NAME,
INDEX_SCHEMA,
REDIS_URL,
TGI_ENDPOINT,
TGI_LLM_ENDPOINT,
)

# Make this look better in the docs.
Expand Down Expand Up @@ -60,7 +60,7 @@ class Question(BaseModel):

# RAG Chain
model = HuggingFaceEndpoint(
endpoint_url=TGI_ENDPOINT,
endpoint_url=TGI_LLM_ENDPOINT,
max_new_tokens=512,
top_k=10,
top_p=0.95,
Expand Down
4 changes: 2 additions & 2 deletions ChatQnA/langchain/redis/rag_redis/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,5 +77,5 @@ def format_redis_conn_from_env():
REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", "schema.yml")
schema_path = os.path.join(parent_dir, REDIS_SCHEMA)
INDEX_SCHEMA = schema_path
TGI_ENDPOINT = os.getenv("TGI_ENDPOINT", "http://localhost:8080")
TGI_ENDPOINT_NO_RAG = os.getenv("TGI_ENDPOINT_NO_RAG", "http://localhost:8081")
TGI_LLM_ENDPOINT = os.getenv("TGI_LLM_ENDPOINT", "http://localhost:8080")
TGI_LLM_ENDPOINT_NO_RAG = os.getenv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")