# Speech recognition for spoken Afrikaans/isiXhosa

This notebook is based on the *XLS-R fine-tuning* [notebook](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb#scrollTo=1XZ-kjweyTy_).

Author: Lucas Meyer

## 1. Setup

### 1.1 Install python libraries and git large file system

In [1]:
# %%capture
!pip3 install -r requirements.txt
!apt install git-lfs

### 1.2 Import libraries

#### 1.2.1 Useful library imports

In [1]:
import json
import torch
import numpy as np
import IPython.display as ipd

from dataclasses import dataclass
from typing import Dict, List, Union

2023-08-17 11:14:33.512270: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### 1.2.2 Hugging face imports

In [None]:
import evaluate

from huggingface_hub import notebook_login
from transformers import Trainer, TrainingArguments
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import AutoModelForCTC, Wav2Vec2CTCTokenizer

#### 1.2.3 My own imports

In [None]:
from download_data import download_high_quality_tts
from load_data import load_and_preprocess_high_quality_tts

### 1.3 Log-in to hugging face hub

Use the following token with **write** permissions:
 - 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

In [1]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Data

### 2.1 Download data

Takes about 3 minutes to download ...

In [2]:
download_high_quality_tts() # Will not download data if data is already downloaded

The data has already been downloaded.


### 2.2 Load and preprocess data

Takes about 1 minute to run ...

In [2]:
train_set, val_set, test_set = load_and_preprocess_high_quality_tts()

Loading datasets ...

0it [00:00, ?it/s]

2927it [00:17, 168.68it/s]
2420it [00:13, 174.78it/s]


Pre-processing datasets ...

Map:   0%|          | 0/4277 [00:00<?, ? examples/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

Datasets loaded and pre-processed successfully.


## 3 Prepare for training
### 3.1 Create tokenizer for our data
#### 3.1.1 Create vocabulary

In [6]:
def extract_all_chars(batch):
    all_text = " ".join(batch["sentence"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

vocab_train = train_set.map(extract_all_chars,
                              batched=True, batch_size=-1,
                              keep_in_memory=True,
                              remove_columns=train_set.column_names)

vocab_val = val_set.map(extract_all_chars,
                          batched=True, batch_size=-1,
                          keep_in_memory=True,
                          remove_columns=val_set.column_names)

vocab_test = test_set.map(extract_all_chars,
                            batched=True, batch_size=-1,
                            keep_in_memory=True,
                            remove_columns=test_set.column_names)

# Get list for vocab of train/val/test
vocab_list = list(set(vocab_train["vocab"][0]) |
                  set(vocab_test["vocab"][0]) |
                  set(vocab_val["vocab"][0]))

# Get dict for vocab of train/val/test
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

Map:   0%|          | 0/4277 [00:00<?, ? examples/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

#### 3.1.2 Save vocabulary and create tokenizer

In [7]:

# Save vocabulary file
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json",
                                 unk_token="[UNK]",
                                 pad_token="[PAD]",
                                 word_delimiter_token="|")

repo_name = "wav2vec2-xls-r-300m-af-xh"
# repo_name = input("To what directory would you like to save your tokenizer?")
# tokenizer.push_to_hub(repo_name)

56

### 3.2 Prepare dataset using Wav2Vec processor

In [1]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1,
                                             sampling_rate=16000,
                                             padding_value=0.0,
                                             do_normalize=True,
                                             return_attention_mask=True)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor,
                              tokenizer=tokenizer)

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    batch["labels"] = processor(text=batch["sentence"]).input_ids
    return batch

train_set = train_set.map(prepare_dataset, remove_columns=train_set.column_names)
val_set = val_set.map(prepare_dataset, remove_columns=val_set.column_names)
test_set = test_set.map(prepare_dataset, remove_columns=test_set.column_names)

NameError: name 'Wav2Vec2FeatureExtractor' is not defined

### 3.3 Create collator with padding

In [17]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

## 4. Load pretrained model

### 4.1 Create and DL model

In [24]:
# Download model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    attention_dropout=0.0,
    hidden_dropout=0.0,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.0,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

# Freeze feature exctraction weights
model.freeze_feature_encoder()

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

repo_name = "wav2vec2-xls-r-300m-af-xh"

### 4.2 Prepare model for training

In [25]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer_metric = evaluate.load("wer")
    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

training_args = TrainingArguments(
    output_dir=repo_name,
    group_by_length=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=30,
    gradient_checkpointing=True,
    fp16=True,
    save_steps=400,
    eval_steps=400,
    logging_steps=400,
    learning_rate=3e-4,
    warmup_steps=500,
    save_total_limit=2,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_set,
    eval_dataset=test_set,
    tokenizer=processor.feature_extractor,
)

### 4.4 TRAIN

In [26]:
# trainer.train()

### 4.5 Load pre-trained model

In [27]:
repo_name = "wav2vec2-xls-r-300m-af-xh"
model = AutoModelForCTC.from_pretrained(f"lucas-meyer/{repo_name}")
processor = Wav2Vec2Processor.from_pretrained(f"lucas-meyer/{repo_name}")

### 4.6 Use model for test predictions

In [41]:
for i in range(20):
    input_dict = processor(test_set[i]["input_values"], 
                            sampling_rate=16000,
                            return_tensors="pt", 
                            padding=True)

    logits = model(input_dict.input_values).logits
    logits = logits.detach()
    pred_ids = torch.argmax(logits, dim=-1)[0]
    
    pred = processor.decode(pred_ids)
    true = test_set_copy[i]["sentence"].lower()
    
    print(f"Test {i}:")
    print(f"  - pred: {pred}")
    print(f"  - true: {true}\n")

Test 0:
  - pred: dit is 37 grade celsius in ashton
  - true: dit is 37 grade celsius in ashton

Test 1:
  - pred: she was sleeping under his protection as sweetly as a child
  - true: she was sleeping under his protection as sweetly as a child

Test 2:
  - pred: in 1758 sluit sy as generaal by die vyand se weermag aan
  - true: in 1758 sluit sy as generaal by die vyand se weermag aan

Test 3:
  - pred: yintoni ekhankanywa ligama
  - true: yintoni ekhankanywa ligama

Test 4:
  - pred: i followed the line of the proposed railroad looking for chances
  - true: i followed the line of the proposed railroad looking for chances

Test 5:
  - pred: skepe treine en helikopters is spesiaal deur die president gehuur om hierdie groot onderneming uit te voer
  - true: skepe treine en helikopters is spesiaal deur die president gehuur om hierdie groot onderneming uit te voer

Test 6:
  - pred: daar sal 4 toets wedstryde tussen die sunfoil dolphins en australië gespeel word
  - true: daar sal 4 toets 

In [43]:
ipd.Audio(data=test_set_copy[15]["audio"]["array"], autoplay=False, rate=16000)