<a href="https://colab.research.google.com/github/ivangtorre/Curso_CRIDA_2022/blob/main/CRIDA_2022_Ejercicio_4_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Es necesario una GPU para correr este notebook
### "Entorno de ejecución" -> "Cambiar tipo de entorno de ejecución", elegir "Aceleración por Hardware"-> "GPU"

----------------------------------------------------------------------
#  Implementación de Automatic Speech Recognition (ASR) con Wav2vec2

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.

Using a novel contrastive pretraining objective, Wav2Vec2 learns powerful speech representations from more than 50.000 hours of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/wav2vec2.png)

For the first time, it has been shown that pretraining, followed by fine-tuning on very little labeled speech data achieves competitive results to state-of-the-art ASR systems. Using even as little as 10 minutes of labeled data

## Primero se instalan algunas dependencias y se importan librerias



In [2]:
%%bash
apt install ffmpeg
pip install torchaudio ipywebrtc notebook transformers datasets
jupyter nbextension enable --py widgetsnbextension

Reading package lists...
Building dependency tree...
Reading state information...
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.
Collecting ipywebrtc
  Downloading ipywebrtc-0.6.0-py2.py3-none-any.whl (260 kB)
Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
Collecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
Collecting xxhash
  Downloading xxhash-2.0



Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Ahora podemos grabar audio desde nuestro micrófono

In [3]:
from ipywebrtc import AudioRecorder, CameraStream
import transformers
import torchaudio
from IPython.display import Audio
from google.colab import output
output.enable_custom_widget_manager()

In [4]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

Ahora convertimos nuestro audio grabado a un formato adecuado para el modelo de ASR.

In [34]:
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav file.wav -y 
#!ffmpeg -y -i filetemp.wav -ar 44100 file.wav
sig, sr = torchaudio.load("file.wav")
print(sig.shape)
Audio(data=sig, rate=sr)

ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lib

## Import some ASR model
 

In [35]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch
import librosa
 
def parse_transcription(wav_file):
    # MODELO QUE SE PUEDE CAMBIAR
    modelhf = "facebook/wav2vec2-large-960h"

    # load pretrained model
    processor = Wav2Vec2Processor.from_pretrained(modelhf) 
    model = Wav2Vec2ForCTC.from_pretrained(modelhf) 

    # load audio
    audio_input, sample_rate = sf.read(wav_file)
    audio_input = librosa.resample(audio_input, orig_sr=sample_rate, target_sr=16000)
    sample_rate = 16000

    # pad input values and return pt tensor
    input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

    # INFERENCE
    # retrieve logits & take argmax
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    # transcribe
    transcription = processor.decode(predicted_ids[0], skip_special_tokens=True)
    print(transcription)



## Decode the ASR
La primera vez que se ejecute o cada vez que se cambie el modelo, esta función descargará los modelos preentrenado del repositorio

In [36]:
parse_transcription("file.wav")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


48000
HOW ARE YOU


## Algunas cosas que se pueden probar:


*   Prueba con otros modelos tanto es inglés como en castellano: https://huggingface.co/models?other=wav2vec2&sort=downloads
*   Continua leyendo sobre técnicas semisupervisadas y no supervisadas en el reconocimiento de voz. Enlaces debajo

https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/

https://ai.facebook.com/blog/wav2vec-state-of-the-art-speech-recognition-through-self-supervision/

https://ai.facebook.com/blog/wav2vec-unsupervised-speech-recognition-without-supervision/

https://ai.facebook.com/blog/covost-v2-expanding-the-largest-most-diverse-multilingual-speech-to-text-translation-data-set/

