# Chatbot de voz con Deep Learning

##1 - El problema a resolver

La idea es crear un chatbot que interprete voz humana y genere la conversación en formato texto, usando las <ins>mejores arquitecturas de Deep Learning disponibles</ins>:

![](https://drive.google.com/uc?export=view&id=11nvnhmuHtFOn8l9rwqfgSIXMdmd8FJpz)

##2 - Elementos del chatbot

Usaremos *wav2vec2* para la conversión voz a texto, y *BlenderBot* para generar la conversación:

![](https://drive.google.com/uc?export=view&id=1H2B8o6EiAO59yUXRm9SyWnJhAobMXlWd)

Tanto *wav2vec2* como *BlenderBot* se basan en las [Redes Transformer](https://youtu.be/Wp8NocXW_C4):

![](https://drive.google.com/uc?export=view&id=1p1InE9NxjhXkFN3cVfQ1EEcwGqelZvhv)

##3 - Conversión voz a texto con *wav2vec2*

[*wav2vec2*](https://arxiv.org/pdf/2006.11477.pdf) fue desarrollado por Facebook en 2020:

![](https://drive.google.com/uc?export=view&id=1DIRZVcQYucsZriQ8hnu2jSFVoDFkJhCJ)



In [24]:
!pip install transformers #wav2vec2 y blenderbot


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [25]:
!pip install git+https://github.com/ricardodeazambuja/colab_utils.git #mic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/ricardodeazambuja/colab_utils.git
  Cloning https://github.com/ricardodeazambuja/colab_utils.git to /tmp/pip-req-build-ojbk30sd
  Running command git clone -q https://github.com/ricardodeazambuja/colab_utils.git /tmp/pip-req-build-ojbk30sd


In [26]:
!pip install librosa # pre-procesamiento audio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [27]:
# Importar librerías
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from colab_utils import getAudio
import librosa
import numpy as np

w2v2 = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
w2v2_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
# Capturar audio del mic (a 48 KHz)
audio, sr = getAudio()

In [37]:
# Cambiar tasa de muestreo a 16 KHz (requerido por wav2vec2)
audio_float = audio.astype(np.float32)
audio_16k = librosa.resample(audio_float, sr, 16000)
print(f'Tamaño audio original: {audio_16k.shape}')

# Voz a texto
entrada = w2v2_processor(audio_16k, sampling_rate=16000, return_tensors="pt").input_values
print(f'Tamaño entrada a wav2vec2: {entrada.shape}')
probabilidades = w2v2(entrada).logits
print(f'Tamaño arreglo probabilidades (salida de wav2vec2): {probabilidades.shape}')
predicciones = torch.argmax(probabilidades, dim=-1)
print(f'Tamaño arreglo predicciones: {predicciones.shape}')
transcripcion = w2v2_processor.decode(predicciones[0])
print(transcripcion)

Tamaño audio original: (67200,)
Tamaño entrada a wav2vec2: torch.Size([1, 67200])
Tamaño arreglo probabilidades (salida de wav2vec2): torch.Size([1, 209, 32])
Tamaño arreglo predicciones: torch.Size([1, 209])
HA MY NAME IS GEORGE


##4 - *BlenderBot*



[*BlenderBot*](https://ai.facebook.com/blog/state-of-the-art-open-source-chatbot/) también fue desarrollado por FaceBook en 2020, con el fin de permitir una interacción más humana y natural:

![](https://drive.google.com/uc?export=view&id=1KR4du-0KfSL27abv3jERo44aSNvHwz6X)

In [14]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
blender = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot-400M-distill")

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/127k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/730M [00:00<?, ?B/s]

In [15]:
blender.generate?

In [16]:
# Prueba inicial
entradaBlender = tokenizer([transcripcion], return_tensors='pt')
print(f'Frase de entrada: {transcripcion}')
print(f'Entrada a BlenderBot: {entradaBlender}')
ids_respuesta = blender.generate(**entradaBlender)
print(f'Salida BlenderBot: {ids_respuesta}')
respuesta = tokenizer.batch_decode(ids_respuesta)
print(f'Salida después del Tokenizer: {respuesta}')

Frase de entrada: HAI I AM GEORGE WHAT IS YOUR NAME
Entrada a BlenderBot: {'input_ids': tensor([[3840,   48,  281, 3535,  485,   44, 2754, 6803, 3680, 1897, 2566, 2763,
           57,  432, 2982,   44,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}




Salida BlenderBot: tensor([[   1,  281,  446,  342,  513,  466,  304,  366, 1362,  458,   21,  228,
         1586,  304, 1362,  458,  383,  400,  770, 1051,   38,    2]])
Salida después del Tokenizer: ["<s> I don't know what you are talking about.  Are you talking about me or someone else?</s>"]


In [17]:
# Eliminar tokens de inicio y finalización de frase
respuesta = respuesta[0].replace('<s>','').replace('</s>','')
print(f'Salida en el formato correcto: {respuesta}')

Salida en el formato correcto:  I don't know what you are talking about.  Are you talking about me or someone else?


In [None]:
# Crear un chat de prueba
NFRASES = 5
nfrase = 1
while nfrase <= NFRASES:
  frase = input('-> Gabriel NR: ')
  entradaBlender = tokenizer([frase], return_tensors='pt')
  ids_respuesta = blender.generate(**entradaBlender)
  respuesta = tokenizer.batch_decode(ids_respuesta)
  respuesta = respuesta[0].replace('<s>','').replace('</s>','')
  print(f'-> BLENDERBOT: {respuesta}')

  nfrase += 1

-> Gabriel NR: Hi, my names is Gabriel. What is your name?
-> BLENDERBOT:  Hi! My name is Meg. What do you like to do in your spare time?
-> Gabriel NR: I like to train voice chatbot
-> BLENDERBOT:  That sounds like a lot of fun. Do you do it for a living or just for fun?
-> Gabriel NR: I do it for a living
-> BLENDERBOT:  What kind of work do you do, if you don't mind me asking? And how long have you been doing it?
-> Gabriel NR: I am a professor at the National University of Villa Mercedes and I have been doing this for many years.
-> BLENDERBOT:  That's awesome! I bet you have a lot of interesting stories to tell. What do you teach?
-> Gabriel NR: I teach subjects such as artificial intelligence among others
-> BLENDERBOT:  That sounds like an interesting field of study.  Do you enjoy it?  What do you do for a living?


##5 - *wav2dec2* + *BlenderBot* y prueba del chatbot

Ahora introduciremos la captura de audio -> wav2dec2 -> BlenderBot en un loop:

In [19]:
NFRASES = 2
nfrase = 1

while nfrase <= NFRASES:
  input()     # Esperar a pulsar tecla para iniciar grabación
  
  # Capturar audio y llevarlo a 16 KHz
  audio, sr = getAudio()
  audio_float = audio.astype(np.float32)
  audio_16k = librosa.resample(audio_float, sr, 16000)

  # Voz a texto
  entrada = w2v2_processor(audio_16k, sampling_rate=16000, return_tensors="pt").input_values
  probabilidades = w2v2(entrada).logits
  predicciones = torch.argmax(probabilidades, dim=-1)
  frase = w2v2_processor.decode(predicciones[0])
  
  # Imprimir transcripción
  print(f'-> MIGUEL: {frase}')

  # BlenderBot
  entradaBlender = tokenizer([frase], return_tensors='pt')
  ids_respuesta = blender.generate(**entradaBlender)
  respuesta = tokenizer.batch_decode(ids_respuesta)
  respuesta = respuesta[0].replace('<s>','').replace('</s>','')
  print(f'-> BLENDERBOT: {respuesta}')

  nfrase += 1




-> MIGUEL: WHAT DO YOU LIKE TO DO




-> BLENDERBOT:  I don't know. I guess I just want to get out there and try something new.



-> MIGUEL: WHERE DO YOU LIVE




-> BLENDERBOT:  I live on the west coast of the united states. I love it here.
