# Speech recognition for spoken Afrikaans/isiXhosa

## MLAI Research Project


This notebook is based on the *XLS-R fine-tuning* [notebook](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb#scrollTo=1XZ-kjweyTy_).

Author: Lucas Meyer

### Log-in to hugging face using write token

Token: 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

In [29]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

### Install Python dependencies

Takes about 30 seconds to install ...

In [2]:
%%capture
!pip3 install -r requirements.txt

### Download and preprocess data

Takes about 3 minutes to download ...

In [3]:
from download_data import download_data

_ = download_data() # Will not download data if data is already downloaded

Downloading af_za.tar.gz ...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 951M/951M [01:32<00:00, 10.2MiB/s]


File af_za.tar.gz downloaded successfully!

Downloading xh_za.tar.gz ...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 907M/907M [01:29<00:00, 10.1MiB/s]


File xh_za.tar.gz downloaded successfully!

Data downloaded and extracted successfully!


### Load and preprocess data

Takes about 1 minute to run ...

In [4]:
from preprocessing import get_data

common_voice_train, _, common_voice_test, _, _, _ = get_data()

2927it [00:41, 70.31it/s] 
2420it [00:07, 311.38it/s]


### Show the first ten examples

In [5]:
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    picks = [i for i in range(num_examples)]
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(common_voice_train.remove_columns(["audio", "path"]))

Unnamed: 0,sentence
0,"Om prooi te lok, kom sy ook snags uit om te weef."
1,Jy kan 'n eenvoudige elektroskoop met alledaagse items maak.
2,Die fraai mooi meisies is in daardie eeu baie eenvoudig grootgemaak.
3,Wat gebeur by die positiewe elektrode?
4,'n Akwaduk kan beskryf word as 'n kanaal waarmee water vervoer word.
5,The boy at the wheel lost his head.
6,In Sentraal-Swede is daar woorde wat deur die eeue in ander omskep is.
7,"He was just bursting with joy, joy over what."
8,"It was not a large lake, and almost round."
9,"'n Volwasse organisme word tot 90 cm lank, 12 cm breed en weeg dan tot 9 kg."


### Remove special characters (preprocessing)

TODO: Move this to ``preprocessing.py``.

In [7]:
import re
from unidecode import unidecode

chars_to_remove_regex = '[\,\?\.\!\-\;\:\"\‚Äú\%\‚Äò\‚Äù\ÔøΩ\']'

def remove_special_characters(batch):
    # batch["sentence"] = unidecode(batch["sentence"])
    batch["sentence"] = re.sub(chars_to_remove_regex, '', batch["sentence"]).lower()
    return batch

In [8]:
common_voice_train = common_voice_train.map(remove_special_characters)
common_voice_test = common_voice_test.map(remove_special_characters)

Map:   0%|          | 0/2342 [00:00<?, ? examples/s]

Map:   0%|          | 0/292 [00:00<?, ? examples/s]

In [9]:
show_random_elements(common_voice_train.remove_columns(["audio", "path"]))

Unnamed: 0,sentence
0,om prooi te lok kom sy ook snags uit om te weef
1,jy kan n eenvoudige elektroskoop met alledaagse items maak
2,die fraai mooi meisies is in daardie eeu baie eenvoudig grootgemaak
3,wat gebeur by die positiewe elektrode
4,n akwaduk kan beskryf word as n kanaal waarmee water vervoer word
5,the boy at the wheel lost his head
6,in sentraalswede is daar woorde wat deur die eeue in ander omskep is
7,he was just bursting with joy joy over what
8,it was not a large lake and almost round
9,n volwasse organisme word tot 90 cm lank 12 cm breed en weeg dan tot 9 kg


### Create vocabulary

In [10]:
def extract_all_chars(batch):
    all_text = " ".join(batch["sentence"])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

vocab_train = common_voice_train.map(extract_all_chars,
                                     batched=True, batch_size=-1,
                                     keep_in_memory=True,
                                     remove_columns=common_voice_train.column_names)


vocab_test = common_voice_test.map(extract_all_chars,
                                   batched=True, batch_size=-1,
                                   keep_in_memory=True,
                                   remove_columns=common_voice_test.column_names)

Map:   0%|          | 0/2342 [00:00<?, ? examples/s]

Map:   0%|          | 0/292 [00:00<?, ? examples/s]

In [11]:
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))

vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

51

In [12]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [13]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [14]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

In [15]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [16]:
common_voice_train[0]["audio"]

{'array': [0.00010495477181393653,
  1.2783108104486018e-05,
  -1.098342181649059e-07,
  7.284652383532375e-05,
  5.144427632330917e-05,
  8.166098996298388e-05,
  0.00011993561929557472,
  9.788547322386876e-05,
  1.6354775652871467e-05,
  -0.00017298481543548405,
  -0.0001319276198046282,
  -0.00021512714738491923,
  -0.00016329524805769324,
  -1.4811714208917692e-05,
  -0.00010698611004045233,
  1.1797164916060865e-06,
  -7.245631422847509e-06,
  6.943335756659508e-05,
  -0.00010792936518555507,
  -0.0001433017896488309,
  -1.516022348369006e-05,
  -0.00012431277718860656,
  4.528474892140366e-05,
  0.00010183817357756197,
  0.0001238321274286136,
  0.00013397846487350762,
  9.34507988858968e-05,
  3.167870454490185e-05,
  -0.00018919141439255327,
  -7.222875865409151e-05,
  1.4843346434645355e-05,
  -0.00020681903697550297,
  -8.146527397911996e-05,
  -0.00010800559539347887,
  -7.827170338714495e-05,
  4.8587862693239e-06,
  -0.00018525995255913585,
  -0.0001430249831173569,
  -0.

In [17]:
import IPython.display as ipd

ipd.Audio(data=common_voice_train[0]["audio"]["array"], autoplay=True, rate=16000)

In [18]:
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]

    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

In [19]:
common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names, num_proc=4)
common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names, num_proc=4)

Map (num_proc=4):   0%|          | 0/2342 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/292 [00:00<?, ? examples/s]



In [20]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [21]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [22]:
from datasets import load_metric

wer_metric = load_metric("wer")
# TODO: Use evaluate.load instead

  wer_metric = load_metric("wer")


Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [23]:
import numpy as np

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [24]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# model.freeze_feature_extractor() # deprecated
model.freeze_feature_encoder()

In [26]:
model.gradient_checkpointing_enable()

In [27]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  # output_dir="/content/gdrive/MyDrive/wav2vec2-large-xlsr-turkish-demo",
  output_dir="./output",
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=30,
  fp16=True,
  save_steps=100,
  eval_steps=100,
  logging_steps=10,
  learning_rate=3e-4,
  warmup_steps=500,
  save_total_limit=2,
)

In [30]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=common_voice_train,
    eval_dataset=common_voice_test,
    tokenizer=processor.feature_extractor,
)

In [31]:
repo_name = "wav2vec2-large-xls-r-300m-afrikaans"
tokenizer.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/lucas-meyer/wav2vec2-large-xls-r-300m-afrikaans/commit/9584609f5b0b3012f48c968b97a0f111347c5714', commit_message='Upload tokenizer', commit_description='', oid='9584609f5b0b3012f48c968b97a0f111347c5714', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
trainer.train()
trainer.push_to_hub()



Step,Training Loss,Validation Loss,Wer
100,4.0855,3.694779,1.0
200,3.0449,3.042321,1.0


