<a href="https://colab.research.google.com/github/kalindasiaminwe/NLP_and_ML_Projects/blob/master/XLS_R_300m_Tonga_ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning XLS-R for Tonga ASR with 🤗 Transformers

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau. After showcasing its impressive performance on the widely-used LibriSpeech dataset, Facebook AI introduced a multi-lingual version of Wav2Vec2 called XLSR (short for cross-lingual speech representations). This new model has the ability to learn speech representations that can be applied across multiple languages, making it a valuable tool for developing ASR systems that can handle a variety of languages. XLSR builds upon the success of Wav2Vec2 and extends its capabilities to support multiple languages, further demonstrating the potential of this cutting-edge technology.

# Setup
In this notebook, we will use pre-trained checkpoint Wav2Vec2-XLS-R-300M and fine-tune it for ASR in Chitonga.

Firstly, it's important to have access to a good GPU in order to maximize the performance of your machine learning models. Unfortunately, it's becoming increasingly difficult to obtain a good GPU using the free version of Google Colab. However, by subscribing to Google Colab Pro, you can easily get access to either a V100 or P100 GPU, which will provide the power and speed you need to run your models efficiently.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Dec  9 14:12:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


Before we begin, let's make sure we have the necessary packages installed. We'll need the datasets and transformers libraries, as well as torchaudio to load audio files and jiwer to evaluate our fine-tuned model using the word error rate (WER) metric. Once we have these packages installed, we'll be ready to start working on our model.

In [None]:
%%capture
!pip install datasets==1.18.3
!pip install transformers==4.17.0
!pip install jiwer

In [None]:
!pip install bitsandbytes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bitsandbytes
  Downloading bitsandbytes-0.35.4-py3-none-any.whl (62.5 MB)
[K     |████████████████████████████████| 62.5 MB 1.4 MB/s 
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.35.4


Next we have to upload our training checkpoints directly to the 🤗 Hub while training. The 🤗 Hub has integrated version control so you can be sure that no model checkpoint is getting lost during training.

we use our authentication token from the Hugging Face website on our account.

In [None]:
from huggingface_hub import notebook_login

notebook_login()



Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


Then you need to install Git-LFS to upload your model checkpoints:

In [None]:
%%capture
!apt install git-lfs

# Prepare Data, Tokenizer, Feature Extractor

Automatic speech recognition (ASR) models are used to transcribe speech into text. This involves two key components: a feature extractor that processes the input speech signal and converts it into a format that the model can understand (e.g. a feature vector), and a tokenizer that processes the model's output and converts it into text.

In 🤗 Transformers, the XLS-R model is accompanied by a tokenizer called Wav2Vec2CTCTokenizer and a feature extractor called Wav2Vec2FeatureExtractor. These two components work together to enable the model to accurately transcribe speech into text.

To use these components, we need to create instances of the Wav2Vec2CTCTokenizer and Wav2Vec2FeatureExtractor classes. This will allow us to use the tokenizer to decode the model's predicted output classes and convert them into the final transcription.

Before we do that, we first import our data.

## Create Wav2Vec2CTCTokenizer

Because our dataset is stored on our google drive, we mount it in order to have direct access to the data.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We then locate the data and prepare them for usage.

**NOTE:** The data is already split into train, test and evaluate(validation). You can find the dataset on our github rep: https://github.com/unza-speech-lab/zambezi-voice  

In [None]:
abs_path_to_data = "/content/drive/MyDrive/Tonga/data"
!ls {abs_path_to_data}/splits/*.csv

/content/drive/MyDrive/Tonga/data/splits/eval.csv
/content/drive/MyDrive/Tonga/data/splits/test.csv
/content/drive/MyDrive/Tonga/data/splits/train.csv


In [None]:
import pandas as pd
import numpy as np
from glob import glob
import hashlib

In [None]:
save_dir = "/content/save_dir"
import os
if not os.path.exists(save_dir):
  os.mkdir(save_dir)

In [None]:
def load_datasets(abs_path_to_data):
  splits=glob(f"{abs_path_to_data}/splits/*.csv")
  split_1 = os.path.basename(splits[0])[:-4]
  split_2 = os.path.basename(splits[1])[:-4]
  split_3 = os.path.basename(splits[2])[:-4]

  split_1_df = pd.read_csv(splits[0], sep="\t")
  split_1_df["path"] = abs_path_to_data + "/audio/" + split_1_df['audio_id']+".wav"
  split_1_df["status"] = split_1_df["path"].apply(lambda path: True if os.path.exists(path) else None)
  split_1_df = split_1_df.dropna(subset=["path"])
  split_1_df = split_1_df.drop(columns=['audio_id', 'status'])
  split_1_df = split_1_df.rename(columns={'path':'audio'})
  split_1_df.to_csv(f"{save_dir}/{split_1}.csv", sep='\t', index=False)
  print(f"No. of {split_1} records: {len(split_1_df)}")

  split_2_df = pd.read_csv(splits[1], sep="\t")
  split_2_df["path"] = abs_path_to_data + "/audio/" + split_2_df['audio_id']+".wav"
  split_2_df["status"] = split_2_df["path"].apply(lambda path: True if os.path.exists(path) else None)
  split_2_df = split_2_df.dropna(subset=["path"])
  split_2_df = split_2_df.drop(columns=['audio_id', 'status'])
  split_2_df = split_2_df.rename(columns={'path':'audio'})
  split_2_df.to_csv(f"{save_dir}/{split_2}.csv", sep='\t', index=False)
  print(f"No. of {split_2} records: {len(split_2_df)}")

  split_3_df = pd.read_csv(splits[2], sep="\t")
  split_3_df["path"] = abs_path_to_data + "/audio/" + split_3_df['audio_id']+".wav"
  split_3_df["status"] = split_3_df["path"].apply(lambda path: True if os.path.exists(path) else None)
  split_3_df = split_3_df.dropna(subset=["path"])
  split_3_df = split_3_df.drop(columns=['audio_id', 'status'])
  split_3_df = split_3_df.rename(columns={'path':'audio'})
  split_3_df.to_csv(f"{save_dir}/{split_3}.csv", sep='\t', index=False)
  print(f"No. of {split_3} records: {len(split_3_df)}")

In [None]:
from glob import glob
import numpy as np
import os
#abs_path_to_data = "/content/BembaSpeech/data"
load_datasets(abs_path_to_data)

No. of train records: 1107
No. of eval records: 493
No. of test records: 453


In [None]:
from datasets import load_dataset, load_metric, Audio

tonga_train = load_dataset("csv",
                                  data_files={"train": f"{save_dir}/train.csv"},
                                  delimiter="\t")["train"]
tonga_dev = load_dataset("csv",
                                data_files={"eval": f"{save_dir}/eval.csv"},
                                delimiter="\t")["eval"]
tonga_test = load_dataset("csv",
                                 data_files={"test": f"{save_dir}/test.csv"},
                                 delimiter="\t")["test"]

print(tonga_train)
print(tonga_dev)
print(tonga_test)



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-7a6ad9454a060d74/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-7a6ad9454a060d74/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-05fbf9894eeb0a7a/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-05fbf9894eeb0a7a/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-bd293e27a13725eb/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-bd293e27a13725eb/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['sentence', 'audio'],
    num_rows: 1107
})
Dataset({
    features: ['sentence', 'audio'],
    num_rows: 493
})
Dataset({
    features: ['sentence', 'audio'],
    num_rows: 453
})


We write a function to display some random samples of the dataset and run it a couple of times to get a feeling for the transcriptions.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(tonga_train.remove_columns(["audio"]), num_examples=10)

Unnamed: 0,sentence
0,ndalumba ncibi kubeja nkaambo mebo inga limwi nseya kwaatola kumpali zyabukoko
1,kwiinda mukutazyiba kwangu ndakaile kuunkila mukasuko kakambizyi mulandu wakujaya lwando
2,nkabela cipanzi canyika mpaakayakide zilao zyakwe wakacuula kubana ba hamori wisi sekemu wakaula amali aakesita aali mwanda
3,zintu ziligwa mumyuunda mumwezi wa miyoba nzeezi mapopwe minsale myuungu mapusi makowa
4,ndiya kukujana kukavwu kasubila wakajokezya ntobolo yakwe ansando munkomwi yamukati aajekete
5,nkabela sekemu mwana wa hamori mu hiti mwami wacisi naakamubona wakamutizya woona awe wamubisizya
6,mafuta atununkilizyo zilabotezya moyo mbulubede lulayo lwamweenzinyoko uuyandika
7,makondo waulisya basune makumi otatwe
8,asike amunzi waatalika kufwala
9,nekuba kuti watwa mufubafuba muncili amunsi antoomwe amaila bufubafuba bwakwe tabumani pe


Next we remove all characters that don't contribute to the meaning of a word and cannot really be represented by an acoustic sound and normalize the text. We also change all numerical values to their text form and replaced all "hatted" characters - like `å` to their "un-hatted" equivalent, *e.g.* `a`. 


In [None]:
import re
chars_to_remove_regex = '[\,\?\.\!\-\'\_\¬\;\:\"\“\%\‘\”\x8b\¨\¼\å\ã\�\']'
num_1 = '1'
num_2 = '2'
num_3 = '3'
num_4 = '4'
num_5 = '5'
num_6 = '6'
num_8 = '8'
char_a1 = 'å'
char_a2 = 'ã'

def remove_special_characters(batch):
    batch["sentence"] = re.sub(chars_to_remove_regex, '', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_1, 'one', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_2, 'two', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_3, 'three', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_4, 'four', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_5, 'five', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_6, 'six', batch["sentence"]).lower()
    batch["sentence"] = re.sub(num_8, 'eight', batch["sentence"]).lower()
    batch["sentence"] = re.sub(char_a1, 'a', batch["sentence"]).lower()
    batch["sentence"] = re.sub(char_a2, 'a', batch["sentence"]).lower()
    return batch

In [None]:
tonga_train = tonga_train.map(remove_special_characters)
tonga_dev = tonga_dev.map(remove_special_characters)
tonga_test = tonga_test.map(remove_special_characters)



0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

Next we extract all distinct letters of the training and test data and build our vocabulary from this set of letters.

We write a mapping function that concatenates all transcriptions into one long transcription and then transforms the string into a set of chars. 
It is important to pass the argument `batched=True` to the `map(...)` function so that the mapping function has access to all transcriptions at once.

In [None]:
def extract_all_chars(batch):
  all_text = " ".join(batch["sentence"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
vocab_train = tonga_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=tonga_train.column_names)
vocab_dev = tonga_dev.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=tonga_dev.column_names)
vocab_test = tonga_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=tonga_test.column_names)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Then we create the union of all distinct letters in the training dataset and validate dataset and convert the resulting list into an enumerated dictionary.

In [None]:
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_dev["vocab"][0]))

In [None]:

vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}

`" "` has its own token class, we give it a more visible character `|`. In addition, we also add an "unknown" token so that the model can later deal with characters not encountered in Common Voice's training set. We also add a padding token that corresponds to CTC's "blank token". The "blank token" is a core component of the CTC algorithm. More information can be found here: https://distill.pub/2017/ctc/

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [None]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

our vocabulary is complete and consists of 27 tokens. 

In [None]:
vocab_dict

{'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'r': 17,
 's': 18,
 't': 19,
 'u': 20,
 'v': 21,
 'w': 22,
 'x': 23,
 'y': 24,
 'z': 25,
 '|': 0,
 '[UNK]': 26,
 '[PAD]': 27}

we can see that all letters of the alphabet occur in the dataset. Also another thing to keep in mind that pre-processing is a very important step before training your model. E.g., we don't want our model to differentiate between `a` and `A` because the difference between between the two does not depend on the "sound" of the letter at all, but more on grammatical rules - *e.g.* use a capitalized letter at the beginning of the sentence. So it is essential to remove the difference between capitalized and non-capitalized letters so that the model has an easier time learning to transcribe speech. 

Now save the vocabulary as a json file.

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

We load the model checkpoint and initialize it.

In [None]:
model_checkpoint = "facebook/wav2vec2-xls-r-300m"

In [None]:
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

In [None]:
tokenizer_type = config.model_type if config.tokenizer_class is None else None
config = config if config.tokenizer_class is not None else None

We use the json file to load the vocabulary into an instance of the `Wav2Vec2CTCTokenizer` class.

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", 
                                                 unk_token="[UNK]", 
                                                 pad_token="[PAD]", 
                                                 word_delimiter_token="|")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# # Import the necessary libraries
# from transformers import AutoModelWithLMHead, AutoTokenizer
# from keras.preprocessing import sequence

# # Set the model type, architecture, and configuration
# model_type = 'wav2vec2-xls-r-300m'
# arch = 'xlsr300m'
# config = './config.json'

# # Initialize the model with the specified architecture and configuration
# model = AutoModelWithLMHead.from_pretrained(model_type,
#                                              arch,
#                                              config=config)

# # Initialize the tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_type,
#                                            arch,
#                                            unk_token='[UNK]',
#                                            pad_token='[PAD]')

# # Set the maximum sequence length
# max_seq_length = 50

# # Load your input data
# input_data = ...

# # Tokenize the input data
# input_tokens = tokenizer.encode(input_data)

# # Pad the input tokens to the maximum sequence length
# input_tokens = sequence.pad_sequences([input_tokens],
#                                       maxlen=max_seq_length,
#                                       padding='post')

# # Use the model to make predictions on the input data
# predictions = model.predict(input_tokens)


We name our model

In [None]:
model_checkpoint_name = model_checkpoint.split('/')[-1]
repo_name = f"{model_checkpoint_name}-tonga-5hrs"

## Create Wav2Vec2FeatureExtractor

Speech is a continuous signal that must be discretized in order to be processed by computers. This process is known as sampling, and the rate at which it is performed is called the sampling rate. The higher the sampling rate, the better the approximation of the original speech signal, but it also requires more data points to be measured per second. Therefore, the choice of sampling rate is a trade-off between accuracy and data efficiency.

Before fine-tuning a pretrained checkpoint of an ASR model, it is crucial to verify that the sampling rate of the data that was used to pretrain the model matches the sampling rate of the dataset used to fine-tune the model. XLS-R was pretrained on audio data of Babel, Multilingual LibriSpeech (MLS), Common Voice, VoxPopuli, and VoxLingua107 at a sampling rate of 16kHz. So our audio needs to the same sampling rate.

A Wav2Vec2FeatureExtractor object requires several parameters to be instantiated, including:

- feature_size: This is the size of the feature vectors that the model expects as input. For Wav2Vec2, the feature size is 1 because the model was trained on the raw speech signal.
- sampling_rate: This is the rate at which the model was trained on.
- padding_value: This is the value used to pad shorter inputs when performing batched inference.
- do_normalize: This specifies whether the input should be zero-mean-unit-variance normalized. Speech models often perform better when the input is normalized in this way.
- return_attention_mask: This specifies whether the model should use an attention mask for batched inference. In general, XLS-R models checkpoints should always use an attention mask.





In [None]:
from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/212 [00:00<?, ?B/s]

Now the XLS-R's feature extraction pipeline is has been fully defined!

For improved user-friendliness, the feature extractor and tokenizer are wrapped into a single Wav2Vec2Processor class so that one only needs a model and processor object.

In [None]:
from transformers import Wav2Vec2Processor
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

## Preprocess Data

Import required libraries

In [None]:
from transformers import (
    Wav2Vec2CTCTokenizer,
    Wav2Vec2FeatureExtractor,
    Wav2Vec2Processor,
    Wav2Vec2ForCTC,
    TrainingArguments,
    Trainer
)

We set the sampling rate to 16 000

In [None]:
tonga_train = tonga_train.cast_column("audio", Audio(sampling_rate=16_000))
tonga_dev = tonga_dev.cast_column("audio", Audio(sampling_rate=16_000))
tonga_test = tonga_test.cast_column("audio", Audio(sampling_rate=16_000))

Next we listen to a couple of audio files to better understand the dataset and verify that the audio was correctly loaded.

**Note**: You can click the following cell a couple of times to listen to different speech samples.

In [None]:
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(tonga_train)-1)

print(tonga_train[rand_int]["sentence"])
ipd.Audio(data=tonga_train[rand_int]["audio"]["array"], autoplay=True, rate=16000)


moyo wako utafwidi basizibi ibbivwe pele kaka tila kukuyoowa jehova buzuba boonse 


Our data is correctly loaded and resampled.

We do a final check that the data is correctly prepared, by printing the shape of the speech input, its transcription, and the corresponding sampling rate.

Note: You can click the following cell a couple of times to verify multiple samples.

In [None]:
rand_int = random.randint(0, len(tonga_train)-1)

print("Target text:", tonga_train[rand_int]["sentence"])
print("Input array shape:", tonga_train[rand_int]["audio"]["array"].shape)
print("Sampling rate:", tonga_train[rand_int]["audio"]["sampling_rate"])

Target text: mwana asikulya izina lyaaanda yangu ngooyu eliezere wakudamasko alimwi abramu 
Input array shape: (178959,)
Sampling rate: 16000


The data is a 1-dimensional array, the sampling rate always corresponds to 16kHz, and the target text is normalized. This is what we wanted to achieve. 



Finally, we can leverage Wav2Vec2Processor to process the data to the format expected by Wav2Vec2ForCTC for training. To do so let's make use of Dataset's map(...) function.
We firstly load and resample the audio data, simply by calling batch["audio"]. Then, we extract the input_values from the loaded audio file. In our case, the Wav2Vec2Processor only normalizes the data. And then,we encode the transcriptions to label ids.

In [None]:
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

In [None]:
tonga_train = tonga_train.map(prepare_dataset, remove_columns=tonga_train.column_names, num_proc=4)
tonga_dev = tonga_dev.map(prepare_dataset, remove_columns=tonga_dev.column_names, num_proc=4)
tonga_test = tonga_test.map(prepare_dataset, remove_columns=tonga_test.column_names, num_proc=4)

In [None]:
tonga_dev

Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 493
})

# Training

Before we can set up the training pipeline for fine-tuning an XLS-R model, we need to prepare the data by doing the following:

- Define a data collator that can handle the large input sizes of XLS-R models. Since the input length is much greater than the output length, it is more efficient to pad the training batches dynamically, meaning that each sample is only padded to the length of the longest sample in its batch, rather than the overall longest sample.
- Define a function to compute the evaluation metric (e.g. word error rate) that will be used to assess the model's performance during training.
- Load a pretrained model checkpoint and configure it for training.
- Define the training configuration (e.g. learning rate, number of epochs, etc.).

Once the model has been fine-tuned, we can evaluate it on the test data to verify that it has learned to transcribe speech correctly.






We firstly define the data collator. code was taken from this [[example](https://github.com/huggingface/transformers/blob/7e61d56a45c19284cfda0cee8995fb552f6b1f4e/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L219)].


In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Next, the evaluation metric is defined. As mentioned earlier, the 
predominant metric in ASR is the word error rate (WER), hence we will use it in this notebook as well. We will also be using the character error rate(cer).

In [None]:
wer_metric = load_metric("wer")
cer_metric = load_metric("cer")

Downloading:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer, "cer":cer}

Now, we can load the pretrained checkpoint of [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m). The tokenizer's `pad_token_id` must be to define the model's `pad_token_id` or in the case of `Wav2Vec2ForCTC` also CTC's *blank token* ${}^2$. To save GPU memory, we enable PyTorch's [gradient checkpointing](https://pytorch.org/docs/stable/checkpoint.html) and also set the loss reduction to "*mean*".

Because the dataset is quite small (~5h of training data), fine-tuning Facebook's [wav2vec2-xls-r-300m checkpoint](https://huggingface.co/facebook/wav2vec2-xls-r-300m) requires some hyper-parameter tuning. Therefore, I had to play around a bit with different values for dropout, [SpecAugment](https://arxiv.org/abs/1904.08779)'s masking dropout rate, layer dropout, and the learning rate until training seemed to be stable enough. 

**Note**: When using this notebook to train XLS-R on another language those hyper-parameter settings might not work very well. Feel free to adapt those depending on your use case. 

In [None]:
from transformers import EarlyStoppingCallback
import bitsandbytes as bnb
from transformers.trainer_pt_utils import get_parameter_names
from transformers import AutoModelForCTC
training_callbacks = EarlyStoppingCallback(3)

model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    attention_dropout=0.0,
    hidden_dropout=0.1,
    feat_proj_dropout=0.05,
    mask_time_prob=0.15,
    mask_feature_prob=0.15,
    layerdrop=0.05,
    ctc_loss_reduction="mean",
    ctc_zero_infinity = True,
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

Downloading:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-xls-r-300m were not used when initializing Wav2Vec2ForCTC: ['project_hid.bias', 'quantizer.weight_proj.weight', 'quantizer.codevectors', 'project_q.bias', 'quantizer.weight_proj.bias', 'project_hid.weight', 'project_q.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-xls-r-300m and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it 

In [None]:
model.freeze_feature_encoder()

In a final step, we define all parameters related to training. 
To give more explanation on some of the parameters:

- `learning_rate` and ` num_train_epochs` were heuristically tuned until fine-tuning has become stable. 

For more explanations on other parameters, one can take a look at the [docs](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer#trainingarguments).


**Note**: If one does not want to upload the model checkpoints to the hub, simply set `push_to_hub=False`.

In [None]:
training_args = TrainingArguments(
    output_dir=repo_name,
    group_by_length=True,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    num_train_epochs=40,
    gradient_checkpointing=True,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=3e-5,
    load_best_model_at_end = True,
    greater_is_better=False,
    warmup_steps=500,
    save_total_limit=2,
    push_to_hub=True,
    report_to="all",
)

In [None]:
decay_parameters = get_parameter_names(model, [torch.nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if n in decay_parameters],
        "weight_decay": training_args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
        "weight_decay": 0.0,
    },
]
optimizer = bnb.optim.Adam8bit(
    params=optimizer_grouped_parameters,
    lr=training_args.learning_rate,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
)

optimizers = (optimizer, None)

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=tonga_train,
    eval_dataset=tonga_dev,
    tokenizer=processor.feature_extractor,
    optimizers=optimizers
)

Now, all instances can be passed to Trainer and we are ready to start training!

In [None]:
# A Call to train the model
trainer.train()

We then upload the result of the training to the 🤗 Hub

In [None]:
trainer.push_to_hub()

To load this model, you can run the following code

In [None]:
from transformers import AutoModelForCTC, Wav2Vec2Processor

model = AutoModelForCTC.from_pretrained("kalisia/wav2vec2-xls-r-300m-tonga-test_v2")
processor = Wav2Vec2Processor.from_pretrained("kalisia/wav2vec2-xls-r-300m-tonga-test_v2")

# Evaluation

Let's first load the pretrained checkpoint.

In [None]:
from transformers import AutoModelForCTC, Wav2Vec2Processor

model = AutoModelForCTC.from_pretrained("kalisia/wav2vec2-xls-r-300m-tonga-test_v2")
processor = Wav2Vec2Processor.from_pretrained("kalisia/wav2vec2-xls-r-300m-tonga-test_v2")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We then load the eval.py which is used to calculate the wer and cer, as well as generate a predictions text file of our test data as the reference text file.

In [None]:
!wget https://raw.githubusercontent.com/csikasote/xls-r-bem-exp/main/eval.py

--2022-12-09 11:57:33--  https://raw.githubusercontent.com/csikasote/xls-r-bem-exp/main/eval.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4992 (4.9K) [text/plain]
Saving to: ‘eval.py’


2022-12-09 11:57:34 (71.7 MB/s) - ‘eval.py’ saved [4992/4992]



In [None]:
!python /content/eval.py \
  --model_id /content/wav2vec2-xls-r-300m-tonga-test_v2 \
  --dataset Tongaspeech \
  --config toi \
  --split test \
  --path /content/test.csv  \
  --log_outputs

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-c25b872776880bd0/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...
  0% 0/1 [00:00<?, ?it/s]100% 1/1 [00:00<00:00, 9754.20it/s]
  0% 0/1 [00:00<?, ?it/s]100% 1/1 [00:00<00:00, 1235.07it/s]
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-c25b872776880bd0/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.
  0% 0/1 [00:00<?, ?it/s]100% 1/1 [00:00<00:00, 971.80it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
453ex [01:09,  6.55ex/s]
Downloading: 4.48kB [00:00, 5.48MB/s]       
Downloading: 5.59kB [00:00, 6.82MB/s]       
WER: 1.0801509769094138
CER: 2.6722664045512765
453ex [00:00, 20706.86ex/s]


Since the blank token allows the model to predict a word, such as "hello" by forcing it to insert the blank token between the two l's. A CTC-conform prediction of "hello" of our model would be [PAD] [PAD] "h" "e" "e" "l" "l" [PAD] "l" "o" "o" [PAD]. Hence our wer and cer were extremely high due to the presence of the [PAD] token. Therefor, we opted to use an alternative solution to calculate our errors.

We used [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and the [Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#:~:text=Informally%2C%20the%20Damerau%E2%80%93Levenshtein%20distance,one%20word%20into%20the%20other.). Before performing the calculation, we needed to get rid of the [PAD] and [UNK] tokens that were generated with the predicted text. Then we calculated the wer and cer. 

In [None]:
# Open the predicted text file and read the text
with open('/content/log_Tongaspeech_toi_test_predictions.txt', 'r') as f:
    predicted_text = f.read()

# Open the reference text file and read the text
with open('/content/log_Tongaspeech_toi_test_targets.txt', 'r') as f:
    reference_text = f.read()


In [None]:
# Tokenize the predicted and reference text
predicted_tokens = tokenizer.encode(predicted_text)
reference_tokens = tokenizer.encode(reference_text)


In [None]:
# Filter out the [PAD] and [UNK] tokens
predicted_tokens = [token for token in predicted_tokens if token not in ['[PAD]', '[UNK]']]
reference_tokens = [token for token in reference_tokens if token not in ['[PAD]', '[UNK]']]


In [None]:
# Import the Levenshtein distance function
from Levenshtein import distance

# Calculate the number of insertion, deletion, and substitution errors
errors = distance(predicted_tokens, reference_tokens)


In [None]:
# Calculate the WER
wer = errors / len(reference_tokens) * 100
print(wer)


10.038417954174843


In [None]:
!pip install pyxDamerauLevenshtein


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyxDamerauLevenshtein
  Downloading pyxDamerauLevenshtein-1.7.1.tar.gz (39 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: pyxDamerauLevenshtein
  Building wheel for pyxDamerauLevenshtein (PEP 517) ... [?25l[?25hdone
  Created wheel for pyxDamerauLevenshtein: filename=pyxDamerauLevenshtein-1.7.1-cp38-cp38-linux_x86_64.whl size=73347 sha256=f1e52cd4b9ee3f78cf981b98871bf01abf6e05a9e4049bac58f39d4dcc77235d
  Stored in directory: /root/.cache/pip/wheels/da/8f/65/a5ea1a7e769ec74f616fdeba3385e17c907fe3f62bb6d6c311
Successfully built pyxDamerauLevenshtein
Installing collected packages: pyxDamerauLevenshtein
Successfully installed pyxDamerauLevenshtein-1.7.1


In [None]:
# Import the Damerau-Levenshtein distance function
from pyxdameraulevenshtein import damerau_levenshtein_distance

# Calculate the number of insertion, deletion, and substitution errors
errors = damerau_levenshtein_distance(predicted_tokens, reference_tokens)


In [None]:
wer = errors / len(reference_tokens) * 100
print(wer)

10.01077913822172
