# EndToEnd TalkingBot on PC client (Windows)

> make sure you are running in a conda environment with Python 3.10

[Intel® Extension for Transformers Neural Chat](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat) provides a lot of plugins to meet different users' scenarios. In this notebook we will show you how to create a TalkingBot on your local laptop with **Intel CPU** (no GPU needed).

Behind the scene, a TalkingBot is composed of a pipeline of
1. recognize user's prompt audio and convert to text
2. text understanding and question answering by Large Language Models
2. convert answer text to speech

This is a notebook to let you know how to create such a TalkingBot on PC. Make sure that you have at least 50GB disk memory for loading and converting LLM.

## Audio To Text

In [1]:
!curl -O https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/audio/sample_2.wav

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38956  100 38956    0     0   186k      0 --:--:-- --:--:-- --:--:--  187k


In [2]:
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.asr import AudioSpeechRecognition

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
from IPython.display import Audio
Audio(r"./sample_2.wav", rate=16000)

In [4]:
asr = AudioSpeechRecognition(model_name_or_path="openai/whisper-tiny")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
in_text = asr.audio2text(r"./sample_2.wav")
print(in_text)

2024-04-28 21:32:18.084443: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-28 21:32:18.086103: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-28 21:32:18.115652: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-28 21:32:19,308 - root - INFO - generated text in 1.535085678100586 seconds, and the result is: who is andy grove


who is andy grove


## LLM

### Optimize the model with quantization to do inference

This conversion will generate a quantized LLM model under `runtime_outs/`. Next time it will load the model directly without re-quantization.

In [6]:
# Get the quantized model
from transformers import AutoTokenizer, TextStreamer
from neural_speed import Model
from intel_extension_for_transformers.transformers import RtnConfig
from intel_extension_for_transformers.transformers import AutoModel

model_name = "meta-llama/Llama-2-7b-chat-hf"    # You can first download the model and replace this model_name with the local path
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = in_text
inputs = tokenizer(prompt, return_tensors="pt").input_ids


woq_config = RtnConfig(bits=8, compute_type="int8", weight_dtype="int8")
streamer = TextStreamer(tokenizer)
model = AutoModel.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=100)   # Change the max_new_tokens here to control the output length
output_text = tokenizer.batch_decode(outputs)[0]

2024-04-28 21:32:22 [INFO] Applying Weight Only Quantization.
2024-04-28 21:32:22 [INFO] Quantize model by Neural Speed with RTN Algorithm.


cmd: ['python', PosixPath('/root/miniconda3/envs/spycsh_neuralchat/lib/python3.9/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Llama-2-7b-chat-hf']


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loadding the model from HF.


Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.25s/it]


Loading model file /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/model-00001-of-00002.safetensors
Loading model file /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/model-00001-of-00002.safetensors
Loading model file /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/model-00002-of-00002.safetensors
Loading vocab file /root/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/tokenizer.model
Writing vocab...
[  1/291] Writing tensor tok_embeddings.weight                  | size  32000 x   4096  |                  type UnquantizedDataType(name='F32')
[  2/291] Writing tensor norm.weight                            | size   4096           |                  type UnquantizedDataType(name='F32')
[  3/291] Writing tensor out

model.cpp: loading model from runtime_outs/ne_llama_f32.bin
The number of ne_parameters is wrong.
model.cpp: saving model to runtime_outs/ne_llama_q_int8_bestla_cfp32_g32.bin


ne_ftype: 10
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 32000                         
load_ne_hparams  1.hparams.n_embd = 4096                          
load_ne_hparams  2.hparams.n_mult = 256                           
load_ne_hparams  3.hparams.n_head = 32                            
load_ne_hparams  4.hparams.n_head_kv = 32                            
load_ne_hparams  5.hparams.n_layer = 32                            
load_ne_hparams  6.hparams.n_rot = 128                           
load_ne_hparams  7.hparams.ftype = 0                             
load_ne_hparams  8.hparams.max_seq_len = 0                             
load_ne_hparams  9.hparams.alibi_bias_max = 0.000                         
load_ne_hparams  10.hparams.clip_qkv = 0.000                         
load_ne_hparams  11.hparams.par_res = 0                             
load_ne_hparams  12.hparams.word_embed_proj_dim = 0                             
load_ne_hparams  13.hparams.do_layer_norm_

model.cpp: loading model from runtime_outs/ne_llama_q_int8_bestla_cfp32_g32.bin
The number of ne_parameters is wrong.
init: n_vocab    = 32000
init: n_ctx      = 0
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ctx size   = 7199.33 MB
load: scratch0   = 4096.00 MB
load: scratch1   = 2048.00 MB
load: scratch2   = 4096.00 MB
load: mem required  = 17439.33 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  552.00 MB


Andy Grove (1936-2016) was a Hungarian-American businessman, engineer, and philanthropist, best known as the co-founder and former CEO of Intel Corporation. He was widely regarded as one of the most successful technology entrepreneurs in the history of Silicon Valley.

Early Life and Education

Andy Grove was born Andor Grof on September 29, 193


## Text To Speech

This is to convert the output text to audio and saved the output as `output.wav`.

In [7]:
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.tts import TextToSpeech
tts = TextToSpeech()
result_path = tts.text2speech(output_text[:290], "output.wav")  # output may be truncated so we 

2024-04-28 21:33:19,076 - speechbrain.pretrained.fetching - INFO - Fetch hyperparams.yaml: Using existing file/symlink in /tmp/speechbrain/spkrec-xvect-voxceleb/hyperparams.yaml.
2024-04-28 21:33:19,077 - speechbrain.pretrained.fetching - INFO - Fetch custom.py: Delegating to Huggingface hub, source speechbrain/spkrec-xvect-voxceleb.
2024-04-28 21:33:19,220 - speechbrain.pretrained.fetching - INFO - Fetch embedding_model.ckpt: Using existing file/symlink in /tmp/speechbrain/spkrec-xvect-voxceleb/embedding_model.ckpt.
2024-04-28 21:33:19,221 - speechbrain.pretrained.fetching - INFO - Fetch mean_var_norm_emb.ckpt: Using existing file/symlink in /tmp/speechbrain/spkrec-xvect-voxceleb/mean_var_norm_emb.ckpt.
2024-04-28 21:33:19,221 - speechbrain.pretrained.fetching - INFO - Fetch classifier.ckpt: Using existing file/symlink in /tmp/speechbrain/spkrec-xvect-voxceleb/classifier.ckpt.
2024-04-28 21:33:19,221 - speechbrain.pretrained.fetching - INFO - Fetch label_encoder.txt: Using existing fi

In [8]:
from IPython.display import Audio
Audio(r"./output.wav", rate=16000)