# Generative AI: Develop and Optimize Your Own Talking Chatbot on Nvidia GPU

NeuralChat is a customizable chat framework designed to create user own chatbot within few minutes on multiple architectures. This notebook is used to demostrate how to build a talking chatbot on Nvidia GPUs.

# Prepare Environment

In [5]:
%%bash
%pip install intel-extension-for-transformers

# Inference 💻

## Text Chat

Giving NeuralChat the textual instruction, it will respond with the textual response.

In [None]:
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
print(response)

## Text Chat With RAG Plugin

User could also leverage NeuralChat RAG plugin to do domain specific chat by feding with some documents like below

In [None]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import plugins
plugins.retrieval.enable=True
plugins.retrieval.args["input_path"]="./assets/docs/"
config = PipelineConfig(plugins=plugins)
chatbot = build_chatbot(config)
response = chatbot.predict("How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?")

## Voice Chat with ATS & TTS Plugin

In the context of voice chat, users have the option to engage in various modes: utilizing input audio and receiving output audio, employing input audio and receiving textual output, or providing input in textual form and receiving audio output.

For the Python API code, users have the option to enable different voice chat modes by setting audio_input to True for input or audio_output to True for output.

In [None]:
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
config = PipelineConfig(audio_input=True, audio_output=True)
chatbot = build_chatbot(config)
result = chatbot.predict(query="./assets/audio/sample.wav")

# Finetuning 🔧

Finetune the pretrained large language model (LLM) with the instruction-following dataset for creating the customized chatbot is very easy for NeuralChat.

## Finetuning LLM

In [None]:
from intel_extension_for_transformers.neural_chat import TextGenerationFinetuningConfig
from intel_extension_for_transformers.neural_chat import finetune_model
finetune_cfg = TextGenerationFinetuningConfig()
finetuned_model = finetune_model(finetune_cfg)

## Finetuning TTS

In [None]:
from intel_extension_for_transformers.neural_chat import TTSFinetuningConfig
from intel_extension_for_transformers.neural_chat import finetune_model
finetune_cfg = TTSFinetuningConfig()
finetuned_model = finetune_model(finetune_cfg)

# Low Precision Optimization 🚀

## FP16

In [1]:
# FP16 Optimization
from intel_extension_for_transformers.neural_chat.config import PipelineConfig, AMPConfig
config = PipelineConfig(optimization_config=AMPConfig(dtype="float16"))
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)


## Weight-Only Quantization

In [2]:
# Weight-Only Quantization
from intel_extension_for_transformers.neural_chat.config import PipelineConfig, WeightOnlyQuantizationConfig
config = PipelineConfig(optimization_config=WeightOnlyQuantizationConfig())
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

## Bitsandbytes Quantization

In [None]:
# Bitsandbytes Quantization
from intel_extension_for_transformers.neural_chat.config import PipelineConfig, BitsAndBytesConfig
config = PipelineConfig(
    device='cuda',
    optimization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype="bfloat16"
        )
)
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

# Client-Server Architecture for Performance and Scalability

## Quick Start Local Server

❗ Please notice that the server is running on the background. 

In [None]:
import time
import multiprocessing
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
import nest_asyncio
nest_asyncio.apply()

def start_service():
    server_executor = NeuralChatServerExecutor()
    server_executor(config_file="./server/config/neuralchat.yaml", log_file="./log/neuralchat.log")
multiprocessing.Process(target=start_service).start()

## Access Text Chat Service 

In [None]:
from neural_chat import TextChatClientExecutor
executor = TextChatClientExecutor()
result = executor(
    prompt="Tell me about Intel Xeon Scalable Processors.",
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )
print(result.text)

## Access Voice Chat Service

In [None]:
from neural_chat import VoiceChatClientExecutor
executor = VoiceChatClientExecutor()
result = executor(
    audio_input_path='./assets/audio/sample.wav',
    audio_output_path='./results.wav',
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server entry point 
    )


In [None]:
import IPython
# Play input audio
print("     Play Input Audio ......")
IPython.display.display(IPython.display.Audio("./assets/audio/sample.wav"))
# Play output audio
print("     Play Output Audio ......")
IPython.display.display(IPython.display.Audio("./assets/audio/welcome.wav"))


## Access Finetune Service

In [None]:
from neural_chat import FinetuingClientExecutor
executor = FinetuingClientExecutor()
tuning_status = executor(
    server_ip="127.0.0.1", # master server ip
    port=8000 # master server port (port on socket 0, if both sockets are deployed)
    )