## Setup

### Install required libraries

The libraries should already be installed in the terminal after running 
```
pip install -r requirements.txt
```
in the root directory, but this will be useful if the notebook is used in other environments.

In [1]:
%%capture
!pip install datasets==3.6.0
!pip install transformers==4.52.4
!pip install huggingface-hub==0.32.3
!pip install torchaudio==2.7.0
!pip install librosa==0.11.0
!pip install jiwer==3.1.0
!pip install evaluate==0.4.3

### Import required libraries

In [4]:
from datasets import Dataset, Audio, load_from_disk
import random
import evaluate
import pandas as pd
import IPython.display as ipd
import re
import json
from transformers import (Wav2Vec2ForCTC, Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor, Wav2Vec2Processor, 
                            TrainingArguments, Trainer, AutoModelForCTC)
import numpy as np
import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import os

pd.set_option('display.max_colwidth', 100)

### Defining helper functions

## Data Preprocessing

Instead of using ```load_dataset``` from
```
from datasets import load_dataset
```
to load Common Voice data, we load the data directly from our project directory since it is already copied inside.

### Load labels

In [5]:
dataset_split = 'cv-valid-train'
cv_csv_file = f'../datasets/common_voice/{dataset_split}.csv'
train_df = pd.read_csv(cv_csv_file)
train_df.head(5)

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration
0,cv-valid-train/sample-000000.mp3,learn to recognize omens and follow them the old king had said,1,0,,,,
1,cv-valid-train/sample-000001.mp3,everything in the universe evolved he said,1,0,,,,
2,cv-valid-train/sample-000002.mp3,you came so that you could learn about your dreams said the old woman,1,0,,,,
3,cv-valid-train/sample-000003.mp3,so now i fear nothing because it was those omens that brought you to me,1,0,,,,
4,cv-valid-train/sample-000004.mp3,if you start your emails with greetings let me be the first to welcome you to earth,3,2,,,,


### Clean labels

In [6]:
cleaned_train_df = train_df.drop(['up_votes', 'down_votes', 'age', 'gender', 'accent', 'duration'], axis=1)
audio_files_directory = f'../datasets/common_voice/{dataset_split}'
cleaned_train_df["path"] = cleaned_train_df["filename"].apply(lambda x: os.path.join(audio_files_directory, x))
cleaned_train_df.head(5)

Unnamed: 0,filename,text,path
0,cv-valid-train/sample-000000.mp3,learn to recognize omens and follow them the old king had said,../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000000.mp3
1,cv-valid-train/sample-000001.mp3,everything in the universe evolved he said,../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000001.mp3
2,cv-valid-train/sample-000002.mp3,you came so that you could learn about your dreams said the old woman,../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000002.mp3
3,cv-valid-train/sample-000003.mp3,so now i fear nothing because it was those omens that brought you to me,../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000003.mp3
4,cv-valid-train/sample-000004.mp3,if you start your emails with greetings let me be the first to welcome you to earth,../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000004.mp3


### Inspect text

Check if 'text' contains uppercase alphabets and special characters that cannot be transcribed (eg. colon, comma, percentage, etc). We want to keep whitespaces because the model has to learn to predict when a full word is finished. If not, the output will be a sequence of characters with no spacing. Also, the apostrophes are kept because of pronounciation difference. The output shows that there is none, so the data is cleaned.

In [7]:
# This pattern keeps only rows with characters that are NOT lowercase letter, whitespace character, or a single quote '.
special_chars_pattern = r"[^a-z\s']"

df_with_special_chars = cleaned_train_df[cleaned_train_df['text'].str.contains(special_chars_pattern, regex=True, na=False)]

print(df_with_special_chars.head(5))

Empty DataFrame
Columns: [filename, text, path]
Index: []


### Load and decode audio

In [8]:
cleaned_train_df = cleaned_train_df.rename(columns={"path": "audio"})
common_voice_train_original = Dataset.from_pandas(cleaned_train_df)
common_voice_train_temp = common_voice_train_original.cast_column("audio", Audio())

In [9]:
common_voice_train_temp[0]["audio"]

{'path': '../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000000.mp3',
 'array': array([ 0.00000000e+00, -2.39675359e-13, -2.89490112e-14, ...,
         4.10622742e-04,  7.94679625e-04,  7.57523230e-04], shape=(196992,)),
 'sampling_rate': 48000}

The output of the cell above shows that the audio data is loaded with a sampling rate of 48 kHz, but 16 kHz is expected by the ```wav2vec2-large-960h``` model. Hence, we have to resample the audio data.

In [10]:
common_voice_train_temp = common_voice_train_original.cast_column("audio", Audio(sampling_rate=16_000))

In [11]:
common_voice_train_temp[0]["audio"]

{'path': '../datasets/common_voice/cv-valid-train/cv-valid-train/sample-000000.mp3',
 'array': array([-4.36557457e-11,  9.09494702e-12,  4.00177669e-11, ...,
         1.25038787e-04,  7.30113825e-04,  7.36902468e-04], shape=(65664,)),
 'sampling_rate': 16000}

In [12]:
common_voice_train_temp

Dataset({
    features: ['filename', 'text', 'audio'],
    num_rows: 195776
})

### Filter out long audio sequences

Long audio sequences require a lot of memory. As the training will be done locally with limited computational resources, it is best to filter away these sequences.

In [13]:
def filter_short_audio_sequences(data, max_seconds=5):
    max_samples = 16000 * max_seconds
    return len(data["audio"]["array"]) <= max_samples

In [14]:
common_voice_train = common_voice_train_temp.filter(filter_short_audio_sequences)

Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 195776/195776 [09:09<00:00, 356.37 examples/s]


In [15]:
common_voice_train

Dataset({
    features: ['filename', 'text', 'audio'],
    num_rows: 133647
})

### Check data

In [16]:
rand_int = random.randint(0, len(common_voice_train)-1)

print("Target text:", common_voice_train[rand_int]["text"])
print("Input array shape:", common_voice_train[rand_int]["audio"]["array"].shape)
print("Sampling rate:", common_voice_train[rand_int]["audio"]["sampling_rate"])
print("\n")
ipd.Audio(data=common_voice_train[rand_int]["audio"]["array"], autoplay=True, rate=16000)

Target text: let me see it
Input array shape: (52224,)
Sampling rate: 16000




## Tokenizer

### Building vocabulary

In Wav2Vec2 ASR models, the vocabulary is typically composed of characters or subword units, not whole words.

A user recommends characters' vocabulary:
[Wav2Vec2ForCTC fine-tuning best practices](https://github.com/huggingface/transformers/issues/15196)

We use a mapping function to build the vocabulary from the training data.

In [17]:
def extract_all_chars(batch):
    all_text = " ".join(batch["text"])
    vocab = list(set(all_text))
    return {"vocab": [vocab]}

In [18]:
vocab_train = common_voice_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=common_voice_train.column_names)

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 133647/133647 [00:00<00:00, 204670.90 examples/s]


In [19]:
vocab_list = list(set(vocab_train["vocab"][0]))

In [20]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'c': 0,
 'b': 1,
 ' ': 2,
 'l': 3,
 'f': 4,
 'r': 5,
 'g': 6,
 'k': 7,
 'o': 8,
 'v': 9,
 'h': 10,
 'z': 11,
 "'": 12,
 's': 13,
 'i': 14,
 'p': 15,
 'm': 16,
 'n': 17,
 'y': 18,
 'q': 19,
 'x': 20,
 'j': 21,
 'u': 22,
 'a': 23,
 'e': 24,
 't': 25,
 'd': 26,
 'w': 27}

To avoid ambiguity for " " token class, we give it a more visible character "|".

In [21]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

Finally, add unknown and padding tokens as well.

In [22]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
vocab_dict

{'c': 0,
 'b': 1,
 'l': 3,
 'f': 4,
 'r': 5,
 'g': 6,
 'k': 7,
 'o': 8,
 'v': 9,
 'h': 10,
 'z': 11,
 "'": 12,
 's': 13,
 'i': 14,
 'p': 15,
 'm': 16,
 'n': 17,
 'y': 18,
 'q': 19,
 'x': 20,
 'j': 21,
 'u': 22,
 'a': 23,
 'e': 24,
 't': 25,
 'd': 26,
 'w': 27,
 '|': 2,
 '[UNK]': 28,
 '[PAD]': 29}

Save the vocabulary file so that it can be used by the ```Wav2Vec2CTCTokenizer```.

In [23]:
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

### Instantiate the tokenizer

```Wav2Vec2CTCTokenizer``` is used because:
- The Wav2Vec2 model we are using is trained with Connectionist Temporal Classification (CTC) loss.
- Our vocabulary is character-based, which is required by CTC.
- It is compatible with ```Wav2Vec2Processor```.

In [24]:
tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

## Feature Extraction

Before we can create the ```Wav2Vec2Processor``` object, we need a ```Wav2Vec2FeatureExtractor``` beside the ```Wav2Vec2CTCTokenizer```.

The following parameters for the ```Wav2Vec2FeatureExtractor``` is used:
- **feature_size**: Set to 1 as Wav2Vec2 model is trained on single-channel waveforms
- **sampling_rate**: Set to 16000 Hz to match the input data sampling rate requirement
- **padding_value**: Set to 0.0, conventional
- **do_normalize**: Set to *true* as it helps with model stability and consistency
- **return_attention_mask**: Set to *true* as Wav2Vec2 generally make use of the attention mask

### Instantiate the feature extractor

In [25]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

## Wav2Vec2 Processor

### Instantiate the processor

In [26]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

## Data Preparation

### Split dataset

We have to split the 'cv-valid-train' dataset further down into 70-30 ratio, where 30% is kept for validation.

In [27]:
split = common_voice_train.train_test_split(test_size=0.3, seed=42)

common_voice_train_train = split['train']
common_voice_train_validation = split['test']

Inspect the ```training_dataset```

In [28]:
common_voice_train_train

Dataset({
    features: ['filename', 'text', 'audio'],
    num_rows: 93552
})

Inspect the ```validation_dataset```

In [29]:
common_voice_train_validation

Dataset({
    features: ['filename', 'text', 'audio'],
    num_rows: 40095
})

### Map data to be used by processor

Inside the ```prepare_dataset``` function, more complex feature extraction methods can be added inside.

In [30]:
def prepare_dataset(batch):
    audio = batch["audio"]

    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
        
    return batch

Apply the ```prepare_dataset``` to training and validation dataset.

Process a large dataset in chunks to avoid memory/kernel crashes,
then concatenate all processed chunks and save the final dataset once.

In [31]:
training_dataset = common_voice_train_train.map(prepare_dataset, remove_columns=common_voice_train_train.column_names, batched=False)

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 93552/93552 [11:10<00:00, 139.50 examples/s]


In [32]:
training_dataset_cache_dir = "./caches/training"
os.makedirs(training_dataset_cache_dir, exist_ok=True)
training_dataset.save_to_disk(training_dataset_cache_dir)

Saving the dataset (42/42 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 93552/93552 [05:22<00:00, 290.22 examples/s]


In [None]:
validation_dataset = common_voice_train_validation.map(prepare_dataset, remove_columns=common_voice_train_validation.column_names, batched=False)

Map:  97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████     | 38997/40095 [04:46<00:06, 160.28 examples/s]

In [None]:
validation_dataset_cache_dir = "./caches/validation"
os.makedirs(validation_dataset_cache_dir, exist_ok=True)
validation_dataset.save_to_disk(validation_dataset_cache_dir)

## Training

## Validation