# Generative AI: Develop and Optimize Your Own Talking Chatbot

## Intel® Neural Chat to empower  💪  you to customerize your chatbot with a diverse range of plugins! 

NeuralChat is a general chat framework designed to create your own chatbot that can be efficiently deployed on Intel CPU/GPU, Habana HPU and Nvidia GPU. NeuralChat is built on top of large language models (LLMs) and provides a set of strong capabilities including LLM fine-tuning and LLM inference with a rich set of plugins such as knowledge retrieval, query caching, etc. With NeuralChat, you can easily create a text-based or audio-based chatbot and deploy on Intel platforms rapidly. Here is the flow of NeuralChat:

<!-- ![SNOWFALL](neuralchat.png) -->
<img src="neuralchat.png" width="500" height="300">

# Talking Chatbot on Intel  4th Gen Xeon Scalable Processors

Set up conda environment (already setup for the lab)

Add \<PATH TO intel-extension-for-transformers\> to PYTHONPATH

In [5]:
# set up conda environment (already setup)
import sys
sys.path.append('/home/devcloud/qungao/intel-extension-for-transformers') # PYTHONPATH=<PATH TO intel-extension-for-transformers>

pip install itrex

cd [to neural chat folder]

pip install -r requirement


# Inference 💻

## Text Chat

Giving NeuralChat the textual instruction, it will respond with the textual response.

In [3]:
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
print(response)

  from .autonotebook import tqdm as notebook_tqdm


Package 'habana_frameworks.torch.hpu' is not installed.
create asr plugin instance...
plugin parameters:  {}
Loading model meta-llama/Llama-2-7b-hf


UnboundLocalError: local variable 'AutoModelForCausalLM' referenced before assignment

## Text Chat With RAG Plugin

In [8]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import plugins
plugins.retrieval.enable=True
plugins.retrieval.args["input_path"]="./assets/docs/"
config = PipelineConfig(plugins=plugins)
chatbot = build_chatbot(config)
response = chatbot.predict("How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?")

create asr plugin instance...
plugin parameters:  {}
create retrieval plugin instance...
plugin parameters:  {'input_path': './assets/docs/'}


Downloading (…)c7233/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 7.47MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 270/270 [00:00<00:00, 980kB/s]
Downloading (…)/2_Dense/config.json: 100%|██████████| 116/116 [00:00<00:00, 363kB/s]
Downloading pytorch_model.bin: 100%|██████████| 3.15M/3.15M [00:00<00:00, 72.2MB/s]
Downloading (…)9fb15c7233/README.md: 100%|██████████| 66.3k/66.3k [00:00<00:00, 112MB/s]
Downloading (…)b15c7233/config.json: 100%|██████████| 1.53k/1.53k [00:00<00:00, 4.23MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 357kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.34G/1.34G [00:05<00:00, 260MB/s]
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 130kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 6.18MB/s]
Downloading spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 206MB/s]
Downloading (…)c7233/tokenizer.json: 100%|██████████| 2.

load INSTRUCTOR_Transformer
max_seq_length  512
success
Loading model meta-llama/Llama-2-7b-hf


UnboundLocalError: local variable 'AutoModelForCausalLM' referenced before assignment

## Voice Chat with ATS & TTS Plugin

In the context of voice chat, users have the option to engage in various modes: utilizing input audio and receiving output audio, employing input audio and receiving textual output, or providing input in textual form and receiving audio output.

For the Python API code, users have the option to enable different voice chat modes by setting audio_input to True for input or audio_output to True for output.

In [9]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
config = PipelineConfig(audio_input=True, audio_output=True)
chatbot = build_chatbot(config)
result = chatbot.predict(query="./assets/audio/pat.wav")

TypeError: PipelineConfig.__init__() got an unexpected keyword argument 'audio_input'

# Finetuning 🔧

Finetune the pretrained large language model (LLM) with the instruction-following dataset for creating the customized chatbot is very easy for NeuralChat.

## Finetuning LLM

In [10]:
from intel_extension_for_transformers.neural_chat import TextGenerationFinetuningConfig
from intel_extension_for_transformers.neural_chat import finetune_model
finetune_cfg = TextGenerationFinetuningConfig()
finetuned_model = finetune_model(finetune_cfg)

TypeError: BaseFinetuningConfig.__init__() missing 4 required positional arguments: 'model_args', 'data_args', 'training_args', and 'finetune_args'

## Finetuning TTS

In [11]:
from intel_extension_for_transformers.neural_chat import TTSFinetuningConfig
from intel_extension_for_transformers.neural_chat import finetune_model
finetune_cfg = TTSFinetuningConfig()
finetuned_model = finetune_model(finetune_cfg)

TypeError: TTSFinetuningConfig.__init__() missing 5 required positional arguments: 'model_args', 'data_args', 'training_args', 'finetune_args', and 'dataset_args'

# Low Precision Optimization 🚀

## BF16

In [1]:
# BF16 Optimization
# 


## SmoothQuant Quantization

## Weight-Only Quantization

In [2]:
# Optimize by INC
# Neural chat load int8 optimized by INC 

NeuralChat provides three quantization approaches respectively (PostTrainingDynamic, PostTrainingStatic, QuantAwareTraining) based on Intel® Neural Compressor.

In [None]:
# Weight-Only Quantization
config = PipelineConfig( optimization_config=OptimizationConfig( weight_only_quant_config=WeightOnlyQuantizationConfig()))
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

# Bitsandbytes Quantization
config = PipelineConfig(
    device='cuda',
    optimization_config=OptimizationConfig(
        bitsandbytes_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype="bfloat16"
        )
    )
)
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)


# Client-Server Architecture for Performance and Scalability

## Quick Start Local Server

❗ PLEASE notice that server is running on another thread. 

In [None]:
import time
import multiprocessing
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
import nest_asyncio
nest_asyncio.apply()

def start_service():
    server_executor = NeuralChatServerExecutor()
    server_executor(config_file="./server/config/neuralchat.yaml", log_file="./log/neuralchat.log")
multiprocessing.Process(target=start_service).start()

Process Process-1:
Traceback (most recent call last):
  File "/root/anaconda3/envs/qg_chat2/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/qg_chat2/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/ipykernel_213593/894158279.py", line 9, in start_service
    server_executor(config_file="./server/config/neuralchat.yaml", log_file="./log/neuralchat.log")
  File "/home/devcloud/qungao/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/server/neuralchat_server.py", line 229, in __call__
    if self.init(config):
  File "/home/devcloud/qungao/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/server/neuralchat_server.py", line 193, in init
    pipeline_config = PipelineConfig(**params)
TypeError: PipelineConfig.__init__() got an unexpected keyword argument 'audio_input'


## Access Text Chat Service 

In [12]:
from neural_chat import TextChatClientExecutor
executor = TextChatClientExecutor()
result = executor(
    prompt="Tell me about Intel Xeon Scalable Processors.",
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )
print(result.text)

Package 'habana_frameworks.torch.hpu' is not installed.


ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f081927e770>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Access Voice Chat Service

In [13]:
from neural_chat import VoiceChatClientExecutor
executor = VoiceChatClientExecutor()
result = executor(
    audio_input_path='./assets/audio/pat.wav',
    audio_output_path='./results.wav',
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )


ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /v1/voicechat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0bb607eaa0>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [24]:
import IPython
# Play input audio
print("     Play Input Audio ......")
IPython.display.display(IPython.display.Audio("./assets/audio/pat.wav"))
# Play output audio
print("     Play Output Audio ......")
IPython.display.display(IPython.display.Audio("./assets/audio/welcome.wav"))


     Play Input Audio ......


     Play Output Audio ......


## Access Finetune Service

In [14]:
from neural_chat import FinetuingClientExecutor
executor = FinetuingClientExecutor()
tuning_status = executor(
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server port (port on socket 0, if both sockets are deployed)
    )

ImportError: cannot import name 'FinetuingClientExecutor' from 'neural_chat' (/home/devcloud/qungao/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/../neural_chat/__init__.py)