NeuralChat is a customizable chat framework designed to create user own chatbot within few minutes on multiple architectures. This notebook is used to demonstrate how to build a RAG application on 4th Generation of Intel® Xeon® Scalable Processors Sapphire Rapids and Habana's Gaudi processors(HPU).

# Prepare Environment

Install intel extension for transformers:

In [None]:
!pip install intel-extension-for-transformers

Install Requirements:

In [None]:
!git clone https://github.com/intel/intel-extension-for-transformers.git

In [None]:
%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/
!pip install -r requirements.txt
%cd ../../../

In [None]:
!conda list

# Consume RAG via NeuralChat Model

## Consume RAG with Python API

User could leverage NeuralChat Retrieval plugin to do domain specific chat by feding with some documents like below:

In [None]:
%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/
!pip install -r requirements.txt
%cd ../../../../../../

In [None]:
!mkdir docs
%cd docs
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample.jsonl
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample.txt
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample.xlsx
%cd ..

In [None]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import plugins
plugins.retrieval.enable=True
plugins.retrieval.args["input_path"]="./docs/"
config = PipelineConfig(plugins=plugins)
chatbot = build_chatbot(config)
response = chatbot.predict("How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?")
print(response)

## Consume RAG with HTTP Restfup API

User should start `neuralchat_server` to consume HTTP Restful APIs with the command below.

In [None]:
cp -r ./docs /intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/deployment/rag/docs
cd /intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/deployment/rag
python askdock.py

Neuralchat support HTTP Restful API with openai-protocol. Users can consume RAG HTTP Restful API with cURL command like this:

In [None]:
curl -X POST localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
    -d {"model": "Intel/neural-chat-7b-v3-1", \
        "messages": "How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?",}

# Consume RAG with TGI service

In this part, we support two scenarios: run services on SPR / on Habana Gaudi. Before consuming RAG with TGI service, user need to launch a TGI service locally/remotely like below.

## Launch TGI service on SPR

In [None]:
model=Intel/neural-chat-7b-v3-1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
 
docker run --shm-size 1g -p 8080:80 -v $volume:/data -e https_proxy -e http_proxy -e HTTPS_PROXY -e HTTP_PROXY -e no_proxy -e NO_PROXY ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model

## Launch TGI service on Habana Gaudi

For Habana Gaudi, you need to build a TGI docker image on your server first. Then start TGI service using this gaudi-docker.

In [None]:
git clone https://github.com/huggingface/tgi-gaudi.git
cd tgi-gaudi
docker build -t tgi_gaudi .

In [None]:
model=Intel/neural-chat-7b-v3-1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
tgi_habana_visible_devices=${your_visible_habana_devices} # define your visible habana devices for TGI service, such as `all`, `0,1`
tgi_sharded=true # boolean, whether to do sharding on more than one card
tgi_num_shard=2 # integer, between 1 and the number of your physical gaudi cards(usually 8)

docker run -p 8080:80 --name tgi_service_gaudi -v $volume:/data -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true --runtime=habana -e HABANA_VISIBLE_DEVICES=$tgi_habana_visible_devices -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model  --sharded $tgi_sharded --num-shard $tgi_num_shard

## Consume RAG service

When TGI service is ready, you could leverage the endpoint of TGI to construct a Neuralchat chatbot. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export your Huggingface API token.

In [None]:
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}

In [None]:
import os
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import plugins

plugins.retrieval.enable=True
plugins.retrieval.args["input_path"]="./docs/"
config = PipelineConfig(
    model_name_or_path="Intel/neural-chat-7b-v3-1", 
    plugins=plugins, 
    hf_endpoint_url="http://localhost:8080/", 
    hf_access_token=os.getenv("HUGGINGFACEHUB_API_TOKEN", "")
)
chatbot = build_chatbot(config)
response = chatbot.predict("How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?")
print(response)