# Generative AI: Develop and Optimize Your Own Talking Chatbot on Habana HPU

NeuralChat is a customizable chat framework designed to create user own chatbot within few minutes on multiple architectures. This notebook is used to demostrate how to build a talking chatbot on Habana's Gaudi processors(HPU).

# Prepare Environment

In order to streamline the process, users can construct a Docker image employing a Dockerfile, initiate the Docker container, and then proceed to execute inference or finetuning operations.

**IMPORTANT:** Please note Habana's Gaudi processors(HPU) requires docker environment for running. User needs to manually execute below steps to build docker image and run docker container for inference on Habana HPU. The Jupyter notebook server should be started in the docker container and then run this Jupyter notebook. 

```bash
git clone https://github.com/intel/intel-extension-for-transformers.git
cd intel-extension-for-transformers/docker/inference/
DOCKER_BUILDKIT=1 docker build --network=host --tag chatbothabana:latest  ./ -f Dockerfile  --target hpu --build-arg BASE_NAME="base-installer-ubuntu22.04" --build-arg ARTIFACTORY_URL="vault.habana.ai" --build-arg VERSION="1.10.0" --build-arg REVISION="494" --build-arg PT_VERSION="2.0.1" --build-arg OS_NUMBER="2204"
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host chatbothabana:latest
```

To run finetuning on Habana HPU, please execute below steps

```bash
git clone https://github.com/intel/intel-extension-for-transformers.git
cd intel-extension-for-transformers/docker/finetuning/
DOCKER_BUILDKIT=1 docker build --network=host --tag chatbot_finetuning:latest ./ -f Dockerfile  --target hpu
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /dev/shm:/dev/shm  -v /absolute/path/to/llama2:/llama2 -v /absolute/path/to/alpaca_data.json:/dataset/alpaca_data.json --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest

```

# Inference 💻

## Text Chat

Giving NeuralChat the textual instruction, it will respond with the textual response.

In [None]:
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
print(response)

## Text Chat With RAG Plugin

User could also leverage NeuralChat RAG plugin to do domain specific chat by feding with some documents like below

In [None]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import plugins
plugins.retrieval.enable=True
plugins.retrieval.args["input_path"]="./assets/docs/"
config = PipelineConfig(plugins=plugins)
chatbot = build_chatbot(config)
response = chatbot.predict("How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?")

## Voice Chat with ATS & TTS Plugin

In the context of voice chat, users have the option to engage in various modes: utilizing input audio and receiving output audio, employing input audio and receiving textual output, or providing input in textual form and receiving audio output.

For the Python API code, users have the option to enable different voice chat modes by setting audio_input to True for input or audio_output to True for output.

In [None]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
config = PipelineConfig(audio_input=True, audio_output=True)
chatbot = build_chatbot(config)
result = chatbot.predict(query="./assets/audio/sample.wav")

# Finetuning 🔧

Finetune the pretrained large language model (LLM) with the instruction-following dataset for creating the customized chatbot is very easy for NeuralChat.

## Finetuning LLM

In [None]:
from intel_extension_for_transformers.neural_chat import TextGenerationFinetuningConfig
from intel_extension_for_transformers.neural_chat import finetune_model
finetune_cfg = TextGenerationFinetuningConfig()
finetuned_model = finetune_model(finetune_cfg)

## Finetuning TTS

In [None]:
from intel_extension_for_transformers.neural_chat import TTSFinetuningConfig
from intel_extension_for_transformers.neural_chat import finetune_model
finetune_cfg = TTSFinetuningConfig()
finetuned_model = finetune_model(finetune_cfg)

# Low Precision Optimization 🚀

## BF16

In [1]:
# BF16 Optimization
from intel_extension_for_transformers.neural_chat.config import PipelineConfig, AMPConfig
config = PipelineConfig(optimization_config=AMPConfig())
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)


## Weight-Only Quantization

In [2]:
# Weight-Only Quantization
from intel_extension_for_transformers.neural_chat.config import PipelineConfig, WeightOnlyQuantizationConfig
config = PipelineConfig(optimization_config=WeightOnlyQuantizationConfig())
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

# Client-Server Architecture for Performance and Scalability

## Quick Start Local Server

❗ Please notice that the server is running on the background. 

In [None]:
import time
import multiprocessing
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
import nest_asyncio
nest_asyncio.apply()

def start_service():
    server_executor = NeuralChatServerExecutor()
    server_executor(config_file="./server/config/neuralchat.yaml", log_file="./log/neuralchat.log")
multiprocessing.Process(target=start_service).start()

## Access Text Chat Service 

In [None]:
from neural_chat import TextChatClientExecutor
executor = TextChatClientExecutor()
result = executor(
    prompt="Tell me about Intel Xeon Scalable Processors.",
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )
print(result.text)

## Access Voice Chat Service

In [None]:
from neural_chat import VoiceChatClientExecutor
executor = VoiceChatClientExecutor()
result = executor(
    audio_input_path='./assets/audio/sample.wav',
    audio_output_path='./results.wav',
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )


In [None]:
import IPython
# Play input audio
print("     Play Input Audio ......")
IPython.display.display(IPython.display.Audio("./assets/audio/sample.wav"))
# Play output audio
print("     Play Output Audio ......")
IPython.display.display(IPython.display.Audio("./assets/audio/welcome.wav"))


## Access Finetune Service

In [None]:
from neural_chat import FinetuingClientExecutor
executor = FinetuingClientExecutor()
tuning_status = executor(
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server port (port on socket 0, if both sockets are deployed)
    )