NeuralChat is a customizable chat framework designed to create user own chatbot within few minutes on multiple architectures. This notebook is used to demonstrate how to deploy a chatbot as a service on 4th Generation of Intel® Xeon® Scalable Processors Sapphire Rapids with Tensor Parallel(TP).

The 4th Generation of Intel® Xeon® Scalable processor provides two instruction sets viz. AMX_BF16 and AMX_INT8 which provides acceleration for bfloat16 and int8 operations respectively.

The workflow falls into the following architecture:
![Architecture](https://i.imgur.com/km3x1Xv.png)

# Prepare Environment

Install Intel® Extension for Transformers* and Requirements

In [None]:
!git clone https://github.com/intel/intel-extension-for-transformers.git
%cd ./intel-extension-for-transformers
!pip install -e .
%cd ./intel_extension_for_transformers/neural_chat/
!pip install -r requirements.txt
%cd ../../../

In [None]:
!pip uninstall torch -y
!pip install torch
!sudo apt install numactl

Install DeepSpeed

In [None]:
!pip install deepspeed

Install OneCCL

In [None]:
!git clone -b ccl_torch2.1.0+cpu https://github.com/intel/torch-ccl.git torch-ccl-2.2.0
%cd torch-ccl-2.2.0
!git submodule sync
!git submodule update --init --recursive
!python setup.py install

# Client-Server Architecture for Performance and Scalability

## Start Local Server on Multi Sockets

Configure hostfile

In [None]:
%edit ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/server/config/hostfile

Configure `localhost slots=4` in hostfile.

In [None]:
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/examples/deployment/textbot/backend/xeon/textbot.yaml

In [None]:
%edit textbot.yaml

Add these configurations in textbot.yaml:
- use_deepspeed: true
- world_size: 4
To better validate the TP performance, you can change the `model_name_or_path` into `meta-llama/Llama-2-13b-chat-hf`.

In [None]:
import multiprocessing
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
import nest_asyncio
nest_asyncio.apply()

def start_service():
    server_executor = NeuralChatServerExecutor()
    server_executor(config_file="textbot.yaml", log_file="neuralchat.log")
multiprocessing.Process(target=start_service).start()

❗ Please notice that the server is running on the background. 

## Access Text Chat Service 

If you run the codes in a command-line window, please run the following codes in a new terminal or session to access the text chat service.

In [None]:
from intel_extension_for_transformers.neural_chat import TextChatClientExecutor
executor = TextChatClientExecutor()
result = executor(
    prompt="Tell me about Intel Xeon Scalable Processors.",
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )
print(result.text)