# EndToEnd TalkingBot on PC client (Windows)

> make sure you are running in a conda environment with Python 3.10

[Intel® Extension for Transformers Neural Chat](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat) provides a lot of plugins to meet different users' scenarios. In this notebook we will show you how to create a TalkingBot on your local laptop with **Intel CPU** (no GPU needed).

Behind the scene, a TalkingBot is composed of a pipeline of
1. recognize user's prompt audio and convert to text
2. text understanding and question answering by Large Language Models
2. convert answer text to speech

This is a notebook to let you know how to create such a TalkingBot on PC. Make sure that you have at least 50GB disk memory for loading and converting LLM.

## Audio To Text

In [None]:
!curl -O https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/audio/sample_2.wav

In [None]:
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.asr import AudioSpeechRecognition

In [None]:
from IPython.display import Audio
Audio(r"./sample_2.wav", rate=16000)

In [None]:
asr = AudioSpeechRecognition(model_name_or_path="openai/whisper-tiny")

In [None]:
in_text = asr.audio2text(r"./sample_2.wav")
print(in_text)

## LLM

### Directly load given int4 model to do inference

Here for quick demo, we just use a given int4 model to generate text. If you want to convert your int4 model manually, please refer to next cell.

In [None]:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.llm.runtime.graph import Model

prompt = in_text

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = Model()
model.tokenizer = tokenizer
model.init_from_bin(model_name="llama", model_path="ne_llama_q.bin", max_new_tokens=43, do_sample=False)

streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer)
output_text = tokenizer.batch_decocde(outputs)[0]

### Convert int4 model to do inference

This conversion will generate a int4 model `ne_llama_q.bin` that the above cell needs.

In [None]:
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModel

model_name = "meta-llama/Llama-2-7b-chat-hf"    # Please first download the model and replace this model_name with the local path
woq_config = WeightOnlyQuantConfig(compute_type="int8", weight_dtype="int4")
prompt = "Who is andy grove"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModel.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)

outputs = model.generate(inputs, streamer=streamer, max_new_tokens=20)
output_text = tokenizer.batch_decode(outputs)[0]

## Text To Speech

In [None]:
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.tts import TextToSpeech

In [None]:
tts = TextToSpeech()

In [None]:
result_path = tts.text2speech(output_text, "output.wav")

In [None]:
from IPython.display import Audio
Audio(result_path, rate=16000)