# 🎙️ L4-B: Speech Recognition - Connectionist Temporal Classification

---

Having explored speech features already, let's proceed with end-to-end speech recognition:
- L4-A) Speech Features 🌊
- **L4-B) Speech Recognition - Connectionist Temporal Classification** 🕰️

Connectionist Temporal Classification (CTC), is an algorithm that addresses the issue of mapping a length variable input sequence to a fixed transcription.

For instance, if you try to transcribe someone saying "hello", by simply classifying the phonemes at every small audio chunk, you may get "hhhheeellooo" or "heeeellllooo", for instance. CTC smartly solves this alignment issue, and we'll illustrate in this colab how this is achieved, step by step.

We will learn about CTC, by using one of the current state-of-the-art speech recognition models: wav2vec. This consists of a Transformer-based encoder that has been trained with a self-supervised learning method, followed by a classifier that outputs characters.

We will be using some of the most trending Python libraries right now: HuggingFace for loading the ASR models, and PyTorch-Lightning for handling the data.

The lab will be carried out through the following steps:

1) Download and explore LibriSpeech test data.

2) Download HuggingFace wav2vec model.

3) Transcribe test data with wav2vec. Check the effect of CTC in word error rate (WER) metrics.

4) Repeat the previous step, but applying different levels of noise to audio, studying how performance varies depending to noise.

5) Try models with our own voice, also with crosslingual models.

## 📦 Installs and imports

We will use pip once again to install some required libraries for this colab. Key libraries here are:
- **pytorch-lightning** - this is a wrapper of PyTorch library, which we are going to use to run inference over our LibriSpeech dataset
- **torchaudio** - PyTorch's own audio backend
- **transformers** - this is HuggingFace's library, which contains code and weights for wav2vec model

In [None]:
# A few installs for audio recording and processing
!sudo apt-get install -q -y timidity libsndfile1
!pip install pydub numba jiwer music21 librosa pytorch-lightning torchaudio
!pip install -q transformers

In [None]:
# Generic imports
import IPython
import os
from tqdm import tqdm

# Math and Deep Learning imports
import numpy as np
import pytorch_lightning as pl
import torch
from torch import nn
from torch.utils.data import DataLoader
import torchaudio

# Audio imports
import librosa
import librosa.display
from base64 import b64decode
from IPython.display import Audio, Javascript
from pydub import AudioSegment
from scipy.io import wavfile

# HuggingFace imports
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

# Word Error Rate import
from jiwer import wer

## 💾 Dataset loading with PyTorch-Lightning

This time, we're going to explore LibriSpeech, an audiobook dataset. It is ideal to train speech recognition models, because it contains many hours of speech (the readers reading) and their corresponding transcription (the book itself).

When we train or run inference with neural networks, it is convenient to send more than one input to our network forward pass, to parallelize computation and increase efficiency.

For instance, if you want to transcribe 4 audios, you can run 4 forward passes to your network separately... or, you can group these 4 audios in the same tensor, and run a single forward pass. This is what we call **batching** - creating a single **batch** with **N** different inputs.

PyTorch-Lightning is a nice wrapper for PyTorch, which is very useful to keep your code functional, tight and clean. We use the following piece of code to add some logic that allows to create such batches. It is not crucial for now to understand every line of code here below, keep in mind the high level idea of batching.

In [None]:
class LibrispeechDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str = './data', test_batch_size: int = 2):
        super().__init__()

        self.data_dir = data_dir
        self.test_batch_size = test_batch_size
        self.max_size = 60 # maximum sample size is 60 seconds
        self.sample_rate = 16000

    def prepare_data(self):
        torchaudio.datasets.LIBRISPEECH(self.data_dir, url='test-clean', download=True)

    def setup(self, stage=None):
        self.test_dataset = torchaudio.datasets.LIBRISPEECH(self.data_dir, url='test-clean', download=False)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.test_batch_size, collate_fn=self.collater, shuffle=False)

    def collater(self, samples):
        waveforms = []
        labels = []
        for (waveform, _, utterance, _, _, _) in samples:
            if waveform.size(1) > self.max_size*self.sample_rate:
                pass
            else:
                waveforms.append(waveform.squeeze())
                labels.append(utterance)

        waveforms = nn.utils.rnn.pad_sequence(waveforms, batch_first=True).unsqueeze(1)

        return waveforms, labels

In [None]:
# Run this cell so our LibriSpeech data module is initialized, and the dataset is downloaded.
if not os.path.isdir("./data"):
    os.makedirs("./data")

dm = LibrispeechDataModule(data_dir='./data')
dm.prepare_data()
dm.setup()

## 🎧 Audio visualization

In [None]:
# Run this to extract a batch from the test set, containing waveforms and their corresponding transcripts.
for batch in dm.test_dataloader():
    waveforms, transcripts = batch
    break

In [None]:
# Let's now inspect one of these waveforms: listening to it and checking the transcription.
index = 0

print(f"Waveform shape = {waveforms[index].shape}")
waveform = waveforms[index]
transcript = transcripts[index]

print(f"Transcript = {transcript}")
librosa.display.waveshow(waveform.squeeze().numpy(), sr=16000)
IPython.display.Audio(waveform.squeeze().numpy(), rate=16000)

Now that we have a waveform and the ground truth transcription, let's use wav2vec model to transcribe it automatically, and exemplify how CTC works.

As we have the ground truth transcription, we can check how the predicted transcription scores with the Word Error Rate (WER) measurement.

## 🤖 Model

Let's download the wav2vec2.0 model from the HuggingFace repo. We mainly need two things:
- A **tokenizer** - which is used to handle everything related to characters. It transforms input strings into numerical token sequences, and also converts output probabilities from the main model into readable characters.
- The **model** - this has all the wav2vec logic required to extract features and compute output probabilities for character sequences.

In [None]:
# First, we will need a tokenizer to translate the model output probabilities to readable characters.
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

# Then, download the wav2vec model itself, which will output character probabilities from raw waveforms.
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

## ✍️ Transcribing a waveform

___

Now, we'll transcribe a waveform, with the downloaded model and tokenizer. You'll be implementing some of the calls, so in case of doubt, you can check the documentation here: https://huggingface.co/transformers/v4.3.3/model_doc/wav2vec2.html#wav2vec2forctc

✍️ **TASK 1-A** (+0.50/10) - Now, pass the LibriSpeech waveform seen before, through the wav2vec model. How does the output look like?

In [None]:
output = None # <- implement the forward pass for wav2vec here.
print(output)

✍️ **TASK 1-B** (+0.50/10) - Extract the logits and take a look at them, specially at their shape.

In [None]:
output_logits = None # <- extract output logits here.
print(output_logits)
print(output_logits.shape)

In [None]:
# The shape should be (B, T, C).
# B = batch size (1 for now, since we're only passing one waveform)
# T = time steps (the number of output frames in the time dimension)
# C = output class number (the number of possible output tokens, the letters in the English alphabet, plus some other characters like ' or the whitespace.
tokenizer.get_vocab()

✍️ **TASK 1-C** (+0.75/10) - OK, now that you have the output logits, translate them into characters.

In [None]:
# First, for each time step T, get the index of the biggest logit, aka the one with the highest probability.
# Clue: use torch.argmax(...), and remember to torch.squeeze the output.
predicted_ids = torch.argmax(...)

# Second, pass the IDs through the tokenizer in order to convert them into tokens.
# Clue: use the convert_ids_to_tokens(...) function, in the tokenizer.
predicted_tokens = None
print(predicted_tokens)

In [None]:
# Well, that didn't look that nice, right? Let's clean up the padding tokens, and replace | tokens with whitespaces.
predicted_string = ''
for token in predicted_tokens:
    if token == '<pad>':
        pass
    elif token == '|':
        predicted_string += ' '
    else:
        predicted_string += token

predicted_string = ' '.join(predicted_string.split())

In [None]:
# Let's compare the predicted string, with the ground truth reference one.
print("Predicted transcript:")
print(predicted_string)
print("Reference transcript:")
print(transcript)

✍️ **TASK 1-D** (+0.50/10) - Almost there, right? However, can we give a metric about the error rate in the predicted transcription, regarding the reference?

In [None]:
# Use the jiwer library, in order to compute the word error rate (WER) score. It has been already installed and imported at the beginning.
# See: https://pypi.org/project/jiwer/

word_error_rate = None
print(f"WER = {word_error_rate*100.0}%")

In [None]:
# Obviously, the WER is quite high since we have many repeated characters.
# Phonemes may span across several time frames, leading to repeated letters.
# CTC algorithm solves this. The idea is to remove consecutive duplicated characters. "CAAAATT" -> "CAT"
# Wait! What if our target word has repeated characters, like "DINNER"?
# That's why CTC algorithm introduces the "pad"/"blank" token. This is a "dummy" token that will be removed,
# and causes that the network learns to predict it between consecutive letters that must be together.

# Correct output
# "DIIIIN<pad>NEEER" -> "DINNER"

# Incorrect output
# "DIINNNNER" -> "DINER"

# There is a function in tokenizer that automatically applies the CTC algorithm to the output tokens, yielding the correct string transcription.
# Can you find it? Look at the source code at: https://huggingface.co/transformers/v4.3.3/_modules/transformers/models/wav2vec2/tokenization_wav2vec2.html#Wav2Vec2Tokenizer
# Or call the following function to see the tokenizer function members. One clue, the function name starts with "convert_tokens_..."
dir(tokenizer)

✍️ **TASK 1-E** (+0.75/10) - Convert the tokens to a string with CTC algorithm.

In [None]:
predicted_string = None # <- implement the correct function for CTC algorithm here

# Now, compute again the WER, and check if it has improved or not.
print("Predicted transcript:")
print(predicted_string)
print("Reference transcript:")
print(transcript)

word_error_rate = None # <- compute the WER once again here
print(f"WER = {word_error_rate*100.0}%")

## 🥞 Transcribing batches
____

✍️ **TASK 2** (+2.00/10) - As before, we'll be transcribing waveforms, but this time in batches of two waveforms at the same time.
You'll be implementing the calls to the waveform forward through the model, plus the decoding of the output logits into a text string.

You can implement what you learned before. This time, you can directly translate the predicted IDs into text strings with tokenizer's function "batch_decode(...)".

⚠️ - The dataset contains more than 1000 batches, so transcribing everything would take too much time. We will only compute metrics over the first 10 batches, so don't worry, the process should stop automatically after 10 forward passes.



In [None]:
# Let's compute the WER of 10 batches of 2 samples from the LibriSpeech test set.
preds = []
refs = []
stop_at = 10
iteration = 0
for batch in tqdm(dm.test_dataloader()):
    waveforms, transcripts = batch
    refs += transcripts

    print(waveforms.shape)
    # TASK 2: obtain the transcript from the waveform.
    # TASK 2a: go from waveforms -> logits
    output_logits = None
    # TASK 2b: go from logits -> predicted token IDs, with torch.argmax
    predicted_ids = None
    # TASK 2c: go from predicted token IDs, to the final text string, with tokenizer's batch_decode(...) function.
    pred_transcripts = None

    preds += pred_transcripts

    iteration += 1

    if iteration == stop_at:
        break

test_wer = wer(refs, preds)
print(f"\nLibri test WER = {test_wer*100.0}%")

✍️ **TASK 3** (+2.00/10) - For now, we've operated with audio that is quite clean, but what if there are noises or severe distortions?

I want you to artificially add noise to the same audios we've transcribed in batches before. Then, you'll be transcribing the noised audios, and you'll measure how much WER is deteriorated.

The idea is that you repeat the same inference loop as in **TASK 2**, but increasing the noise everytime. You shall then plot the relationship between the level of noise and WER.

We will control the level of noise by the **signal-to-noise ratio (SNR)**. A high SNR (like 50) means that there is more speech (signal) than noise. A low SNR (like -10) means that there is way more noise than speech (signal).

Step by step, you'll have to:

- 1) Implement again the same inference loop as in **TASK 2**.
- 2) But, add noise to the waveforms. You have an utility function for that here below.
- 3) Get the average WER for the 10-sample test set several times. Every time, use a different SNR to control how much noise you put there.
- 4) Plot the relationship of SNR (x-axis) vs WER (y-axis).
- 5) Comment and analyze what you see.



In [None]:
# Utility function to add noise to a waveform.
def add_noise_to_waveform(waveform, snr_db):
    """
    Adds Gaussian noise to a waveform at a given SNR (in dB) using PyTorch.

    Args:
        waveform (torch.Tensor): Input waveform (1D PyTorch tensor).
        snr_db (float): Desired Signal-to-Noise Ratio in decibels (dB).

    Returns:
        torch.Tensor: Waveform with added noise at the specified SNR.
    """

    # Calculate the signal power (mean of squared waveform values)
    signal_power = torch.mean(waveform**2)

    # Convert SNR from dB to a linear scale
    snr_linear = 10**(snr_db / 10)

    # Calculate the noise power
    noise_power = signal_power / snr_linear

    # Generate Gaussian noise with the computed noise power
    noise = torch.sqrt(noise_power) * torch.randn(waveform.shape)

    # Add the noise to the original waveform
    noisy_waveform = waveform + noise

    return noisy_waveform

In [None]:
x_snrs = [] # Define here the values of SNRs that you want to test.
y_wers = [] # You'll be storing here the average WER value of the test set, as a
            # function of the SNR used to apply noise to the test waveforms.

# Reimplement the loop from TASK 2 here, but this time using different SNR
# values and collecting their corresponding WERs, in order to plot afterwards
# their relation.


In [None]:
# Plot here the relation between SNRs (x-axis) and WERs (y-axis)


In [None]:
# Reason here what you see about the relation between SNR and WER.
print("")

## 🤩 Try it with your own voice
---

If you want to have some fun, you can run the following cells to record and transcribe yourself (microphone required). You'll be able to try in English, and also other languages.

In [None]:
#@title [Run this] Definition of the JS code to record audio straight from the browser

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec=5, out_name='recorded_audio.wav'):
  try:
    from google.colab import output
  except ImportError:
    print('No possible to import output from google.colab')
    return ''
  else:
    print('Recording')
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    fname = out_name
    print('Saving to', fname)
    b = b64decode(s.split(',')[1])
    with open(fname, 'wb') as f:
      f.write(b)
    return fname

In [None]:
EXPECTED_SAMPLE_RATE = 16000

def convert_audio_for_model(user_file, output_file='converted_audio_file.wav'):
  audio = AudioSegment.from_file(user_file)
  audio = audio.set_frame_rate(EXPECTED_SAMPLE_RATE).set_channels(1)
  audio.export(output_file, format="wav")
  return output_file

def record_utterance(file_name='my_recording_wav'):
    file_name = record(5, file_name)
    converted_audio_file = convert_audio_for_model(file_name)
    input_audio, sr = torchaudio.load(converted_audio_file)
    return input_audio

Run the cell below to record yourself, the recording will be automatically stopped at 5 seconds, approx. Say whatever you want, in English!

In [None]:
input_audio = record_utterance('english_recording.wav')
print(input_audio.shape)
librosa.display.waveshow(input_audio.squeeze().numpy(), sr=16000)
Audio(input_audio.squeeze().numpy(), rate=16000)

In [None]:
def transcribe_audio(input_audio, asr_model, asr_tokenizer):
    output_logits = asr_model(input_audio).logits
    predicted_ids = torch.argmax(output_logits, dim=-1)
    transcription = asr_tokenizer.batch_decode(predicted_ids)
    return transcription

In [None]:
import time

In [None]:
#start_time = time.time()
transcribe_audio(input_audio, model, tokenizer)
#print(f"Elapsed time = {time.time() - start_time} s")

## 🌐 Changing languages

___

There are further wav2vec models at HuggingFace that are able to transcribe in different languages. Maybe you want to try other language rather than English. Look for the crosslingual models, those containing the "xlsr" keyword, like "facebook/wav2vec2-large-xlsr-53-arabic"

Models are found here: https://huggingface.co/models?pipeline_tag=automatic-speech-recognition

In [None]:
# Load a tokenizer and a model from the XLSR models, in the language you want to try.
#xlsr_tokenizer =
#xlsr_model =
# Solution
xlsr_tokenizer = Wav2Vec2Tokenizer.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala")
xlsr_model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala")

Now, record yourself talking in your chosen language!

In [None]:
input_audio = record_utterance('xlsr_recording.wav')
print(input_audio.shape)
librosa.display.waveshow(input_audio.squeeze().numpy(), sr=16000)
Audio(input_audio.squeeze().numpy(), rate=16000)

In [None]:
# Transcription with the English model
print(transcribe_audio(input_audio, model, tokenizer))

# Transcription with the Cross-lingual model
print(transcribe_audio(input_audio, xlsr_model, xlsr_tokenizer))