## BEROM-ASR-Wav2Vec2
Audio preprocessing and finetuning using wav2vec2-large-xlsr model for  automatic speech recognition on berom data.

This Notebook explores finetunning a XLSR wav2vec ASR model on Berom,a low resourced African Language, We aim to achieve two goals with this;
> to build/finetune  a novel baseline ASR model for a Berom that will serve as a laucnh pad for futher research in that area
> to experiment XLSR's performance in Transfer learning as a cross lingual based speech model and see how it fairs against a low resourced African Language like Berom


XLSR stands for cross-lingual speech representations and refers to XLSR-Wav2Vec2's ability to learn speech representations that are useful across multiple languages. Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to BERT's masked language modeling, the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

In [1]:
#installing dependencies
!pip install pandas==1.5.3
#!pip install datasets
#!pip install fsspec
#!pip install transformers 
!pip install jiwer
!python --version

Collecting pandas==1.5.3
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.2
    Uninstalling pandas-2.0.2:
      Successfully uninstalled pandas-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
beatrix-jupyterlab 2023.621.222118 requires jupyter-server~=1.16, but you have jupyter-server 2.6.0 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but yo

In [2]:
import wandb

# Configure your API key
wandb.login(key="3f4f13146ae620fea89ce98949fe1b22726ae40c")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [3]:
#importing required libraries
import os, json
import soundfile as sf
import torchaudio
import torch, torchaudio

import re
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from datasets import load_dataset, DatasetDict, Audio,load_metric

from transformers import Wav2Vec2CTCTokenizer,Wav2Vec2FeatureExtractor,Wav2Vec2Processor
from transformers import TrainingArguments,Trainer
from transformers import Wav2Vec2ForCTC

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

from datasets import ClassLabel
import random
from IPython.display import display, HTML


import warnings
warnings.filterwarnings('ignore')



In [4]:
#setting global variables
data_dir = "/kaggle/input/berom-speech-data"
file_path = os.path.join("/kaggle/input/berom-speech-data/trans/transcribe.csv")
audio_path = os.path.join("/kaggle/input/berom-speech-data/wav")
device =  torch.device("cuda" if torch.cuda.is_available() else "cpu")

We'ill create utility functions that will aid our exploration, these functions perform the task of preparing and fetching the dataset, as well as applying transformations to our data.

In [5]:
#utility function to fetch and preprocess speech data 
class BeromSpeechDataset(object):
    """beromSpeech dataset"""

    def __init__(self,data_dir,file_path,device):
        super().__init__()
        self.data_dir = data_dir
        self.file_path = file_path
        self.data = None
        self.device = device
        self.train_data = None
        self.eval_data = None

    def data_proc(self):
        #wav_data = os.path.join(audio_path, "wav")
        wav_files = os.listdir(audio_path)

        # Read the training and test datasets using pandas.read_csv()
        berom_data = pd.read_csv(file_path)
        berom_data['wav_path'] = "/kaggle/input/berom-speech-data/wav/"+berom_data['wav_id']+'.wav'
        train,test = train_test_split(berom_data, test_size=0.2, random_state=0)
        
        # Save the `df_train` and `df_test` DataFrames to CSV files.
        train.to_csv("train.csv", encoding="utf-8", index=False)
        test.to_csv("test.csv", encoding="utf-8", index=False)
        
        return train, test
    
    def fetch_data(self):
        self.data_proc()
        #load_data
        berom_train = load_dataset("csv", data_files={"train": "train.csv"})["train"]
        berom_test = load_dataset("csv", data_files={"test": "test.csv"})["test"]
        #wrap data as dictionary object
        self.data = DatasetDict({k: dt for k, dt in {'train': berom_train, 'test': berom_test}.items()})

        self.train_data = self.data['train']
        self.eval_data = self.data['test']
        
        return self

    def remove_special_characters(batch):
        chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'
        batch["transcription"] = re.sub(chars_to_ignore_regex, '', batch["transcription"]).lower() + " "
        return batch

    def extract_all_chars(self, batch):
        all_text = " ".join(batch['transcription'])
        vocab = list(set(all_text))
        return {"vocab": [vocab], "all_text": [all_text]}

    def speech_file_to_array_fn(self, batch):
        resampler = torchaudio.transforms.Resample(16000, 16000)
        speech_array, sampling_rate = torchaudio.load(batch["wav_path"])
        batch["speech"] = resampler(speech_array).squeeze().numpy()
        batch["sampling_rate"] = 16000
        batch["target_text"] = batch["text"]
        return batch


    def prepare_dataset(self,batch):
        audio = batch["wav_path"]

        # batched output is "un-batched"
        batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
        batch["input_length"] = len(batch["input_values"])

        with processor.as_target_processor():
            batch["labels"] = processor(batch["transcription"]).input_ids
        return batch

    
    def get_vocab(self):
        #wrap data as dictionary object
        vocabs = self.data.map(self.extract_all_chars,
                               batched=True,
                               batch_size=-1,
                               keep_in_memory=True,
                               remove_columns=self.data.column_names["train"])
        vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))
        vocab_dict = {v: k for k, v in enumerate(vocab_list)}
        vocab_dict["|"] = vocab_dict[" "]
        del vocab_dict[" "]
        vocab_dict["[UNK]"] = len(vocab_dict)
        vocab_dict["[PAD]"] = len(vocab_dict)

        with open('vocab.json', 'w') as vocab_file:
            json.dump(vocab_dict, vocab_file)

        return vocab_dict

In [6]:
#instantiate utility function
berom = BeromSpeechDataset(data_dir,file_path,device)

In [7]:
berom = berom.fetch_data()

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-08a2f5d365daa14f/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-08a2f5d365daa14f/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e0aaac7c1f13c874/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e0aaac7c1f13c874/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'

def remove_special_characters(batch):
    batch["transcription"] = re.sub(chars_to_ignore_regex, '', batch["transcription"]).lower() + " "
    return batch

berom.train_data = berom.train_data.map(remove_special_characters)
berom.eval_data = berom.eval_data.map(remove_special_characters)

  0%|          | 0/169 [00:00<?, ?ex/s]

  0%|          | 0/43 [00:00<?, ?ex/s]

In [9]:
#getting dictionary vocabulary object
vocab = berom.get_vocab()

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

### TRAINNING

In [10]:
#lets explore 10 random data samples
def show_5(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

show_5(berom.train_data)

Unnamed: 0,wav_id,transcription,wav_path
0,bom-184,hwey hom ha se re,/kaggle/input/berom-speech-data/wav/bom-184.wav
1,bom-123,wot e vey yunung yen e vey yunung,/kaggle/input/berom-speech-data/wav/bom-123.wav
2,bom-96,my shot hom a se ra jogo,/kaggle/input/berom-speech-data/wav/bom-96.wav
3,bom-43,beha ba se ra fwom,/kaggle/input/berom-speech-data/wav/bom-43.wav
4,bom-120,yen e vey wet wot a vey wet yen a vey wet,/kaggle/input/berom-speech-data/wav/bom-120.wav


In [11]:
berom.train_data[0]['wav_path']

'/kaggle/input/berom-speech-data/wav/bom-17.wav'

Therefore, we will have to downsample our fine-tuning data to 16kHz in the following.

In [12]:
##downsizing sampling rate from 48 to 16
berom.train_data = berom.train_data.cast_column("wav_path", Audio(sampling_rate=16_000)) 
berom.eval_data = berom.eval_data.cast_column("wav_path",Audio(sampling_rate=16_000))

In [13]:
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(berom.train_data)-1)

print(berom.train_data[rand_int]["transcription"])
ipd.Audio(data=berom.train_data[rand_int]["wav_path"]["array"], autoplay=True, rate=16000)

dara a se ra fwom 


In [14]:
rand_int = random.randint(0, len(berom.train_data)-1)

print("Target text:", berom.train_data[rand_int]["transcription"])
print("Input array shape:", berom.train_data[rand_int]["wav_path"]["array"].shape)
print("Sampling rate:", berom.train_data[rand_int]["wav_path"]["sampling_rate"])

Target text: bengyi ba se tele 
Input array shape: (38566,)
Sampling rate: 16000


Now We Use our generated json file to create an object of Wav2Vec2CTCTokenizer class, we we'd also generate a feature extractor object, our wav2vec processor object and also instantiate our data collator object, all of which are important arguments to out final model

In [15]:
tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [16]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

In [17]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [18]:

@dataclass
class DataCollatorCTCWithPadding:
    processor: processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        
        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [19]:
data_collator = DataCollatorCTCWithPadding(processor=processor)

XLSR-Wav2Vec2 was pretrained on the audio data of Babel, Multilingual LibriSpeech (MLS), and Common Voice. Most of those datasets were sampled at 16kHz, so the BEROM audio dataset on the other hand is sampled at 48kHz, thus, the data will be downsampled to 16kHz for training. 

In [20]:
def prepare_dataset(batch):
    audio = batch["wav_path"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["transcription"]).input_ids
    return batch
        
berom.train_data = berom.train_data.map(prepare_dataset, remove_columns=berom.train_data.column_names)
berom.eval_data = berom.eval_data.map(prepare_dataset, remove_columns=berom.eval_data.column_names)

  0%|          | 0/169 [00:00<?, ?ex/s]

  0%|          | 0/43 [00:00<?, ?ex/s]

In [21]:
#setting a maximum audio length
max_input_length_in_sec = 5.0
berom.train_data = berom.train_data.filter(lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate,
                                           input_columns=["input_length"])

  0%|          | 0/1 [00:00<?, ?ba/s]

In [22]:
wer_metric = load_metric("wer")

Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [23]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [24]:
#instantiating model and setting model parameters
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    ctc_loss_reduction="mean",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    gradient_checkpointing=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
model.freeze_feature_extractor()

In [26]:
#clear cache
torch.cuda.empty_cache()

# Use data parallelism.
devices = [0, 1]

In [27]:
training_args = TrainingArguments(
  output_dir='/kaggle/working/',
  group_by_length=True,
  per_device_train_batch_size=12,
  gradient_accumulation_steps=4,
  evaluation_strategy="steps",
  num_train_epochs=30,
  gradient_checkpointing=True,
  fp16=False,
  save_steps=350,
  eval_steps=350,
  logging_steps=350,
  learning_rate=3e-4,
  warmup_steps=800,
  save_total_limit=2,
)

In [28]:
#instantiating our trainer object
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=berom.train_data,
    eval_dataset=berom.eval_data,
    tokenizer=processor.feature_extractor
)

In [29]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mbmandieng[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.12 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.9
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231029_110732-y0wvq6qs[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mcelestial-grass-15[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/bmandieng/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/bmandieng/huggingface/runs/y0wvq6qs[0m


Step,Training Loss,Validation Loss


TrainOutput(global_step=30, training_loss=25.903082275390624, metrics={'train_runtime': 1269.1511, 'train_samples_per_second': 2.766, 'train_steps_per_second': 0.024, 'total_flos': 2.9948041399316506e+17, 'train_loss': 25.903082275390624, 'epoch': 24.0})

In [30]:
# Evaluate the model on the evaluation dataset.
evaluation_results = trainer.evaluate(berom.eval_data)

# Log the metrics to a file.
evaluation_results

{'eval_loss': 32.71015930175781,
 'eval_wer': 1.0,
 'eval_runtime': 10.8782,
 'eval_samples_per_second': 3.953,
 'eval_steps_per_second': 0.276,
 'epoch': 24.0}

In [31]:
#save finetunned model
model.save_pretrained("wav2vec2-large-xlsr-Berom")
processor.save_pretrained("wav2vec2-large-xlsr-Berom")

## REFERENCES