# Building Your First Automatic Speech Recognition (ASR) model

This notebook will introduce you to using a pre-trained Wave2Vec 2.0 model for Automatic Speech Recognition. We will explore the basics of ASR technology and take a hands-on approach to fine-tune the pre-trained Wav2Vec 2.0 model on a limited dataset. This session is designed to give you a practical understanding of the challenges and steps involved in developing an ASR model.

Do note that our use of such a small dataset here is just meant to give you an indication of what the process would look like; scaling up the dataset size to a more sensible and useful one is an exercise left to the participant.

#### Installation Guide

In [None]:
!pip install librosa jiwer torchaudio jsonlines datasets accelerate

In [None]:
# on colab, download data
!gdown -O ./data --folder https://drive.google.com/drive/folders/1iTpHzVWh8TydoCkd75RGP8TTlMk8SMZJ

<img src="https://lh3.googleusercontent.com/d/1-24xmHwUyx1OuM2qk-TcMLUutjVsMgkd" alt="drawing" width="650">

### Using a pre-trained model for ASR

Wave2Vec 2.0 is a powerful model developed by Facebook for converting speech to text. Here, we'll demonstrate fine-tuning this modelwith minimal additional training

In [None]:
import jsonlines
import torchaudio
from datasets import Dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor, Trainer, TrainingArguments
from pathlib import Path
import torch
import librosa
import IPython.display as ipd
import jiwer

### Using Wav2Vec2 to transcribe an audio file

In [None]:
model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

audio_file = 'data/audio_1.wav'
audio_input, sample_rate = librosa.load(audio_file, sr=16000)

input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print("Transcription:", transcription)

**Wav2Vec2 Processor:**

The Wav2Vec2 processor is responsible for converting raw audio signals into input features that the Wav2Vec2 model can understand.
It consists of a feature extractor and a tokenizer.
The feature extractor processes the raw audio waveform, typically by performing preprocessing steps such as resampling and normalization.
The tokenizer tokenizes the processed audio into input tokens suitable for the model.
In our code, we use the Wav2Vec2Processor to instantiate the processor from the pretrained model.

In [None]:
# Play the loaded audio file
audio_data, sampling_rate = librosa.load(audio_file, sr=None)
waveform, sample_rate = torchaudio.load(audio_file)
ipd.Audio(waveform, rate=sampling_rate)

### Evaluation of ASR Systems using Word Error Rate

Word Error Rate (WER) is a crucial metric used to evaluate the performance of an Automatic Speech Recognition (ASR) system. It measures how accurately the ASR system transcribes spoken language into text by comparing the machine-generated transcription to a human-generated reference transcription.

The formula for WER is:

$$ WER = \frac{S + D + I}{N} $$

Where:
- \( S \) represents the number of substitutions, which occur when a word from the reference is replaced by a different word in the hypothesis.
- \( D \) represents the number of deletions, where a word from the reference is missing in the hypothesis.
- \( I \) represents the number of insertions, where a word not present in the reference appears in the hypothesis.
- \( N \) is the total number of words in the reference transcription.

The WER gives us a percentage that reflects the proportion of errors (substitutions, deletions, and insertions) in the hypothesis compared to the total number of words in the reference. A WER of 0% means perfect transcription, while a higher WER indicates more discrepancies between the hypothesis and the reference.

In [None]:
# Example data
references = [
    "this is a test",
    "HEADING IS TWO SIX ZERO, TARGET IS BLACK, WHITE, AND YELLOW COMMERCIAL AIRCRAFT, TOOL TO DEPLOY IS SURFACE-TO-AIR MISSILES"
]

hypotheses = [
    "this is test",
    "HEADING HIS TWO STICK FERO TARGATIVE BLACK WHITE AND YELLOW COMMERCIAL AIR CRAFT TOOLED TO DEPOY IN CIRCUS AIR MISSILF"
]


In [None]:
# Function to calculate WER
def calculate_wer(references, hypotheses):
    wer_scores = []
    for ref, hyp in zip(references, hypotheses):
        wer_score = jiwer.wer(ref, hyp)
        wer_scores.append(wer_score)
    return wer_scores

# Calculate WER for each pair
wer_scores = calculate_wer(references, hypotheses)

# Display the results
for i, score in enumerate(wer_scores):
    print(f"Reference {i+1}: {references[i]}")
    print(f"Hypothesis {i+1}: {hypotheses[i]}")
    print(f"WER: {score:.2%}\n")


## Fine-tuning

#### Setup and Loading Data
We start by setting up the environment and loading our training data.

In [None]:
# Define the path to the directory
data_dir = Path("data")

# Read data from a jsonl file and reformat it
data = {'key': [], 'audio': [], 'transcript': []}
with jsonlines.open(data_dir / "asr.jsonl") as reader:
    for obj in reader:
        if len(data['key']) < 10:  # Only keep the first 10 entries
            for key, value in obj.items():
                data[key].append(value)

# Convert to a Hugging Face dataset
dataset = Dataset.from_dict(data)

# Shuffle the dataset
dataset = dataset.shuffle(seed=42)

# Split the dataset into training, validation, and test sets
train_size = int(0.8 * len(dataset))
val_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - val_size

train_dataset = dataset.select(range(train_size))
val_dataset = dataset.select(range(train_size, train_size + val_size))
test_dataset = dataset.select(range(train_size + val_size, train_size + val_size + test_size))


In [None]:
data

In [None]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

#### Preprocessing Audio and Label Data
Below, we define a preprocessing function preprocess_data that takes an input dictionary of examples (containing audio paths and transcripts) and preprocesses them for training. It loads audio files, processes them using the Wav2Vec2 processor, creates attention masks, and pads the labels to match the input length.

In [None]:
# Initially freeze all layers except the classifier layer
for param in model.parameters():
    param.requires_grad = False
for param in model.lm_head.parameters():
    param.requires_grad = True

# Function to load and preprocess audio
def preprocess_data(examples):
    input_values = []
    attention_masks = []
    labels = []

    for audio_path, transcript in zip(examples['audio'], examples['transcript']):
        speech_array, sampling_rate = torchaudio.load(data_dir / audio_path)
        processed = processor(speech_array.squeeze(0), sampling_rate=sampling_rate, return_tensors="pt", padding=True)

        # Process labels with the same processor settings
        with processor.as_target_processor():
            label = processor(transcript, return_tensors="pt", padding=True)

        input_values.append(processed.input_values.squeeze(0))
        # Create attention masks based on the input values
        attention_mask = torch.ones_like(processed.input_values)
        attention_mask[processed.input_values == processor.tokenizer.pad_token_id] = 0  # Set padding tokens to 0
        attention_masks.append(attention_mask.squeeze(0))

        # Ensure labels are padded to the same length as inputs if needed
        padded_label = torch.full(processed.input_values.shape[1:], -100, dtype=torch.long)
        actual_length = label.input_ids.shape[1]
        padded_label[:actual_length] = label.input_ids.squeeze(0)
        labels.append(padded_label)

    # Concatenate all batches
    examples['input_values'] = torch.stack(input_values)
    examples['attention_mask'] = torch.stack(attention_masks)
    examples['labels'] = torch.stack(labels)

    return examples


**Padding:**

- Since the Wav2Vec2 model expects inputs of fixed length, we need to pad the input tensors to ensure consistency. We pad both the input features and the labels (transcripts) to the same length to maintain alignment.
- Padding is done using PyTorch's `torch.full` function, which fills a tensor with a specified value to a specified shape. We calculate the actual length of the label sequence and pad it accordingly to match the length of the input features.
- Padding ensures that all input tensors within a batch have the same shape, allowing them to be efficiently processed in parallel.

**Attention Masks:**

- Attention masks are used to indicate which tokens in the input should be attended to by the model and which should be ignored. In the case of Wav2Vec2, we use attention masks to mask padding tokens.
- We create attention masks of the same shape as the input tensors, initialized with ones. We then set the elements corresponding to padding tokens to zero to mask them.
- This ensures that the model does not attend to the padding tokens during training or inference, improving efficiency. Attention masks are essential for maintaining the correct input-output alignment, especially when using batch processing.

#### Training Configuration
Define the training arguments for the Trainer, including the output directory, evaluation strategy, learning rate, batch size, number of epochs, and other training settings

In [None]:
# Apply preprocessing
train_dataset = train_dataset.map(preprocess_data, batched=True, batch_size=1, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(preprocess_data, batched=True, batch_size=1, remove_columns=val_dataset.column_names)
test_dataset = test_dataset.map(preprocess_data, batched=True, batch_size=1, remove_columns=test_dataset.column_names)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    learning_rate=1e-4,
    per_device_train_batch_size=1,  # Reduce to one for simplicity
    num_train_epochs=10,
    weight_decay=0.005,
    save_steps=500,
    eval_steps=500,
    logging_steps=10,
    load_best_model_at_end=True
)

#### Training & Evaluation
Conduct 10 epochs of training

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,  # Use the validation dataset for evaluation
    tokenizer=processor.feature_extractor
)

# Train the model
trainer.train()

This script provides a very basic example of fine-tuning a Wav2Vec2 model on a few data points.
When fine-tuning a complex model like Wav2Vec 2.0 on an extremely limited dataset the model's performance is likely to be highly unpredictable and generally poor. However, we can still observe the validation loss decreasing over 10 epochs.

### Conclusion
In this workshop we took our first steps into ASR. Starting with a pre-trained model, we learnt to fine-tune it on a small dataset. Given the complexity and the depth of models like Wav2Vec 2.0, they require substantial data to adapt their pre-trained knowledge to new tasks or domains effectively. In a real-world scenario, one would need to manage larger datasets and more sophisticated training routines involving, more epochs, consider freezing some layers, hyper-parameter tuning, validation and regularization based on performance metrics.
