<a href="https://colab.research.google.com/github/gullogullo/Sauris-ASR/blob/main/VERO_SAURIANO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning XLSR-Wav2Vec2 for Saurano Speech-to-Text with 🤗 Transformers

Fine-tune the [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the low resource Saurano dataset that contains approximately 1.5 hours of training data, with a validation dataset of 8.5 minutes.

Massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. For reference see the official [paper](https://arxiv.org/pdf/2006.13979.pdf).

A single randomly initialized linear layer is stacked on top of the pre-trained checkpoint and trained to classify raw audio input to a sequence of letters. It does so by:

1.   extracting audio representations from the raw audio (using CNN layers),
2.   processing the sequence of audio representations with a stack of transformer layers,
3.   classifying the processed audio representations into a sequence of output letters (sequence modeling with Connectionist Temporal Classification, https://distill.pub/2017/ctc/).

We first fine-tune XLSR-Wav2Vec2 without making use of a language model.

## Model description

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.  Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, *Facebook AI* presented XLSR-Wav2Vec2 (click [here](https://arxiv.org/abs/2006.13979)). XLSR stands for *cross-lingual  speech representations* and refers to XLSR-Wav2Vec2's ability to learn speech representations that are useful across multiple languages.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xlsr_wav2vec2.png)

## Preliminary

**Mount Google Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Set root directory**



In [None]:
root_dir = 'drive/MyDrive/SAURIANO'

**Install**

*   datasets, transformers, accelerate, evaluate (🤗 APIs)
*   mpi4py (Message Passing Interface)
*   torchaudio, librosa, pydub (audio libraries)
*   jiwer (automatic speech recognition evaluation package)
*   pandas (dataframe package)
*   chardet (Character Encoding Detector package)

In [None]:
%%capture
!pip install datasets==2.11.0
!pip install transformers==4.28.0
!pip install git+https://github.com/huggingface/accelerate
!pip install evaluate
!pip install mpi4py
!pip install torchaudio
!pip install librosa
!pip install pydub
!pip install jiwer
!pip install pandas
!pip install chardet

**Runtime settings**

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Mon Sep 18 07:54:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Build the dataset

### Functions to segment audio files and texts

In [None]:
import os
import chardet
import pandas as pd
import torch
import torchaudio
from pydub import AudioSegment
from pydub.silence import split_on_silence
from pydub.effects import normalize
import math


def segment_audio(audio_file_path, wav_name, data, root_dir):
  #output_folder = os.path.join(root_dir, "segments")
  output_folder = "segments"
  # ERRORI / INCESPICAMENTI:
  # ZOOM0031 -> REMOVE ZOOM0031_20 21 22
  # ZOOM0055 -> REMOVE ZOOM0055_4 AND CORRESPONDING TEXT, ATTACH TEXTS CORRESPONDING TO ZOOM0055_16
  # ZOOM0051  -> REMOVE ZOOM0051_4, AND SEPARATE TEXT CORRESPONDING TO ZOOM0051_5
  # ZOOM0058, ZOOM0061 ???
  bad_audio = [f"{output_folder}/ZOOM0031_20.mp3", f"{output_folder}/ZOOM0031_21.mp3", f"{output_folder}/ZOOM0031_22.mp3",
               f"{output_folder}/ZOOM0055_4.mp3",
               f"{output_folder}/ZOOM0051_4.mp3",
               f"{output_folder}/ZOOM0019_7.mp3",
               f"{output_folder}/ZOOM0045_5.mp3", f"{output_folder}/ZOOM0045_11.mp3",
               f"{output_folder}/ZOOM0043_16.mp3"]
  speech = AudioSegment.from_wav(audio_file_path)
  speech = speech[2000:] # Remove the first two seconds
  speech = normalize(speech)
  speech_chunks = split_on_silence(speech, min_silence_len=4000, silence_thresh=-40,
                                   keep_silence=500)
  if not os.path.exists(output_folder):
    os.makedirs(output_folder)
  for i, chunk in enumerate(speech_chunks):
    output_file = f"{output_folder}/{wav_name[:-4]}_{i}.mp3"
    chunk.export(output_file, format="mp3")
    if output_file in bad_audio:
      pass
    else:
      data['path'].append(output_file)
  return len(data['path'])


def segment_text(text_file_path, data):
  with open(text_file_path, 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large
    print('txt encoding:', result['encoding'])
  with open(text_file_path, 'rb') as file:
    lines = file.readlines()
  try:
    text = lines[0].decode(result['encoding'])
  except:
    text = lines[0].decode('unicode_escape')
  strings = text.split('\r')
  strings = strings[1:-2]
  num_strings = 0
  for sentence in strings:
    if len(sentence) > 0:
      if not sentence.isspace():
        # WATCH OUT !!
        # BAD ENCODING: é, ö, Õ, Ô, Ò, š, Ç, È, Ê, ð, Ğ, ~, \x9a, \x8e, \x8f, \xa0
        if sentence == 'ÇNa na, du meigest net sain geben bo i ons vunen, gebisser net. ' or sentence == 'va orma sealn, ':
          pass
        else:
          if 'š' in sentence:
            sentence = sentence.replace('š', 'ö')
          if 'Ç' in sentence:
            sentence = sentence.replace('Ç', '"')
          if 'È' in sentence:
            sentence = sentence.replace('È', '"')
          if 'Ò' in sentence:
            sentence = sentence.replace('Ò', '"')
          if 'Ó' in sentence:
            sentence = sentence.replace('Ó', '"')
          if 'Õ' in sentence:
            sentence = sentence.replace('Õ', '\'')
          if 'Ô' in sentence:
            sentence = sentence.replace('Ô', '\'')
          if 'Ê' in sentence:
            sentence = sentence.replace('Ê', '')
          if 'ð' in sentence:
            sentence = sentence.replace('ð', '-')
          if 'Ð' in sentence:
            sentence = sentence.replace('Ð', '-')
          if 'Ğ' in sentence:
            sentence = sentence.replace('Ğ', '-')
          if '~' in sentence:
            sentence = sentence.replace('~', 'ò')
          if '\x9a' in sentence:
            sentence = sentence.replace('\x9a', 'ö')
          if sentence ==  'van obegestörbn as khement in de do belt ':
            sentence = 'va orma sealn, van obegestörbn as khement in de do belt '
          if sentence == "Unt se seiber aufgean, der on gehot za geanan se bo's seint geben de vrainte van der Nina. 'S ist geben a baibele sel avour, ana olta, khrumpa. ":
            data['sentence'].append("Unt se seiber aufgean, der on gehot za geanan se bo's seint geben de vrainte van der Nina. ")
            num_strings += 1
            sentence = "'S ist geben a baibele sel avour, ana olta, khrumpa. "
          if sentence == 'Khans van ins drai zeinkelan ist mear geben instonde za khöistan ne a kriskele ne a vledle vame sel pochan".':
            data['sentence'].append('Khans van ins drai zeinkelan ist mear geben instonde za khöistan ne a kriskele ne a vledle')
            num_strings += 1
            sentence = "vame sel pochan."
          if sentence == 'Vournt inarzatretan, tueder obe in huet unt de hontschnÉ':
            sentence = 'Vournt inarzatretan, tueder obe in huet unt de hontschn'
          if sentence == "É sister, ben d'aussin geast, pakhenste de strauche!":
            sentence = "sister, ben d'aussin geast, pakhenste de strauche!"
          if '  ' in sentence:
            sentence = sentence.replace('  ', ' ')
          if "\x8f" in sentence:
            sentence = sentence.replace("\x8f", "è")
          if "\x8e" in sentence:
            sentence = sentence.replace("\x8e", "é")
          if "\xa0" in sentence:
            sentence = sentence.replace("\xa0", "")
          if 'Der Hailige Andrea pringet' in sentence:
            sentence = "Der Hailige Andrea pringet in schnea unt zame Hailign Nicolò istar schon do."
          data['sentence'].append(sentence)
          num_strings += 1
  return len(data['sentence']), num_strings


def case_insensitive_file_search(directory, filename):
    filename = filename.lower()
    list_files = os.listdir(directory)
    file_paths = []
    for f in list_files:
      if filename == f.lower():
        file_paths.append(f)
    return file_paths


### Dataset build-up

Dictionary structure:

1.   path: path to each audio segment
2.   sentence: sentence of each audio file
3.   speaker: speaker of each sentence

# TODO ALERT INCESPICAMENTI
ZOOM0009_13 August 1.txt

ZOOM0009_19 lungo post

ZOOM0010_13 attacco

ZOOM0010_28 lungo post

ZOOM0011_27 lungo post

ZOOM0012_0 attacco



In [None]:
rec_dir = os.path.join(root_dir, 'Registrazioni Sauris')
txt_dir = os.path.join(root_dir, 'SAURIS_txt')
data = {'path': [], 'sentence': [], 'speaker': []}
for n, f in enumerate(os.listdir(rec_dir)):
  path_f = os.path.join(rec_dir, f)
  if f[-4:] == 'xlsx':
            try:
                list_wav_names = pd.ExcelFile(path_f, engine='openpyxl')
                raw_sheet = pd.read_excel(list_wav_names, header=None)
            except FileNotFoundError:
                print('No list of files')
for n, f in enumerate(os.listdir(rec_dir)):
  path_f = os.path.join(rec_dir, f)
  print(path_f)
  if os.path.isdir(path_f):
    speaker = os.path.basename(path_f)
    speaker = speaker.split("_")[0]
    for rec in os.listdir(path_f):
      print('rec', rec)
      if rec != 'ZOOM0029.WAV':
      #if rec == 'ZOOM0051.WAV':
        # No txt for ZOOM0029 !
        txt_name = raw_sheet[raw_sheet[1] == rec][0].item() + '.txt'
        if txt_name == 'Augusti 1.txt':
          txt_true_name = 'August 1.txt'
        elif txt_name == 'l g1 (lf1bis).txt':
          txt_true_name = 'L G1.txt'
        elif txt_name == 'l g2 (lf2bis).txt':
          txt_true_name = 'L G2.txt'
        elif txt_name == 'l g3 (lf3bis).txt':
          txt_true_name = 'L G3.txt'
        elif txt_name == 'r b5 parte 1.txt':
          txt_true_name = 'R B5_1.txt'
        elif txt_name == 'r b5 parte 2.txt':
          txt_true_name = 'R B5_2.txt'
        elif txt_name == 'd c7.txt':
          txt_true_name = 'D C7_bis.txt'
        else:
          txt_true_name = case_insensitive_file_search(txt_dir, txt_name)[0]
        texts_in_data, num_strings = segment_text(os.path.join(txt_dir, txt_true_name), data)
        audios_in_data = segment_audio(os.path.join(path_f, rec), rec, data, root_dir)
        data['speaker'].extend([speaker] * num_strings)
        if texts_in_data != audios_in_data:
          print('ALERT! Bad segmentation')
          print('texts', texts_in_data)
          print('speeches', audios_in_data)
          print('strings', data['sentence'])
          print(txt_true_name)

### Save the dataset
Save the dictionary into a CSV file for subsequent runs of the notebook.

In [None]:
import csv

# Specify the CSV file name
csv_filename = 'drive/MyDrive/SAURIANO/data.csv'

# Write the data to the CSV file
with open(csv_filename, 'w', newline='') as csvfile:
    fieldnames = ['path', 'sentence', 'speaker']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()  # Write the header row
    for i in range(len(data['path'])):
        writer.writerow({
            'path': data['path'][i],
            'sentence': data['sentence'][i],
            'speaker': data['speaker'][i]
        })

print(f'Data saved to {csv_filename}')

## Load the dataset
Load the dictionary from the saved CSV file.

In [None]:
# Read the data from the CSV file
import csv

loaded_data = {'path': [], 'sentence': [], 'speaker': []}
csv_filename = 'drive/MyDrive/SAURIANO/data.csv'

with open(csv_filename, 'r', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        loaded_data['path'].append(row['path'])
        loaded_data['sentence'].append(row['sentence'])
        loaded_data['speaker'].append(row['speaker'])

data = loaded_data


### Select a fragment of the dataset

Select just some percentage of the whole dataset.

In [None]:
def select_random_elements(dataset, percentage=10):
    num_examples = int(len(dataset) * percentage / 100)
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    return df

percentage = 100
data = select_random_elements(data, percentage=10)

TypeError: ignored

## Split the dataset
Split the data in training, validation and testing datasets.



In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

dataset = Dataset.from_dict(data)

# Split the dataset into train, validation, and test sets
train_indices, temp_indices = train_test_split(range(len(dataset)), test_size=0.2, random_state=42)
val_indices, test_indices = train_test_split(temp_indices, test_size=0.5, random_state=42)

na_train = dataset.select(train_indices)
na_valid = dataset.select(val_indices)
na_test = dataset.select(test_indices)

### Get audio durations

Get the total audio duration for train, validation, and test sets.


In [None]:
train_audio_duration = 0
for path in na_train['path']:
    audio = AudioSegment.from_file(os.path.join(root_dir, path))
    train_audio_duration += audio.duration_seconds
valid_audio_duration = 0
for path in na_valid['path']:
    audio = AudioSegment.from_file(os.path.join(root_dir, path))
    valid_audio_duration += audio.duration_seconds
test_audio_duration = 0
for path in na_test['path']:
    audio = AudioSegment.from_file(os.path.join(root_dir, path))
    test_audio_duration += audio.duration_seconds

print(f"Train dataset length: {len(na_train)}")
print(f"Train dataset audio duration: {train_audio_duration}")
print(f"Validation dataset length: {len(na_valid)}")
print(f"Validation dataset audio duration: {valid_audio_duration}")
print(f"Test dataset length: {len(na_test)}")
print(f"Test dataset audio duration: {test_audio_duration}")

Train dataset length: 842
Train dataset audio duration: 4077.4609070294773
Validation dataset length: 105
Validation dataset audio duration: 509.21979591836754
Test dataset length: 106
Test dataset audio duration: 507.20625850340133


## Prepare Data, Tokenizer, Feature Extractor

1.   Feature extractor: processes speech signal to feature vector.
2.   Tokenizer: processes model's output to text.

### Preprocess dataset


1.   Remove punctuation.
2.   All lower case characters.
3.   List of characters of each sentence.




In [None]:
from datasets import ClassLabel
import random
from IPython.display import display, HTML
import pandas as pd


PUNC_SYMBOLS = [',', '.', ':', ';', '?', '!', "'", '"', '*', '«',
                '»', '“', '”', "", "–", "-", '‘', '’', '…']

def final_text_words(batch):
    sentence = batch['sentence']
    sentence_lower = sentence.lower()
    for char in sentence_lower:
      if char in PUNC_SYMBOLS:
        sentence_lower = sentence_lower.replace(char,"")
    batch['sentence_lower'] = sentence_lower
    characters_per_sentence = []
    for char in sentence_lower:
        if char == " ":
            if characters_per_sentence and characters_per_sentence[-1] != "|":
                characters_per_sentence.append("|")
        elif char not in PUNC_SYMBOLS:
            characters_per_sentence.append(char)
    batch['characters'] = characters_per_sentence
    return batch


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

na_train = na_train.map(final_text_words)
na_train = na_train.remove_columns(['sentence'])
show_random_elements(na_train.remove_columns(['path']), num_examples=5)
na_test = na_test.map(final_text_words)
na_test = na_test.remove_columns(['sentence'])
na_valid = na_valid.map(final_text_words)
na_valid = na_valid.remove_columns(['sentence'])

Map:   0%|          | 0/842 [00:00<?, ? examples/s]

Unnamed: 0,speaker,sentence_lower,characters
0,Renza,ona nicht onzeleigansi unt völa khölte,"[o, n, a, |, n, i, c, h, t, |, o, n, z, e, l, e, i, g, a, n, s, i, |, u, n, t, |, v, ö, l, a, |, k, h, ö, l, t, e]"
1,Renza,der mons jörge plozzer schbaltnar ist gebeen pforar van der zahre,"[d, e, r, |, m, o, n, s, |, j, ö, r, g, e, |, p, l, o, z, z, e, r, |, s, c, h, b, a, l, t, n, a, r, |, i, s, t, |, g, e, b, e, e, n, |, p, f, o, r, a, r, |, v, a, n, |, d, e, r, |, z, a, h, r, e]"
2,Duilio,otar gesehn de do khindlpeitarin auvar khemen iber s staigele vame vraithoufe mime khinde im orbm,"[o, t, a, r, |, g, e, s, e, h, n, |, d, e, |, d, o, |, k, h, i, n, d, l, p, e, i, t, a, r, i, n, |, a, u, v, a, r, |, k, h, e, m, e, n, |, i, b, e, r, |, s, |, s, t, a, i, g, e, l, e, |, v, a, m, e, |, v, r, a, i, t, h, o, u, f, e, |, m, i, m, e, |, k, h, i, n, d, e, |, i, m, |, o, r, b, m, |]"
3,Augusto,none i onder getrogn epans guets,"[n, o, n, e, |, i, |, o, n, d, e, r, |, g, e, t, r, o, g, n, |, e, p, a, n, s, |, g, u, e, t, s, |]"
4,Paola,d ontse gemeiget gasln unt ist net migla geben za mochanse baitargean,"[d, |, o, n, t, s, e, |, g, e, m, e, i, g, e, t, |, g, a, s, l, n, |, u, n, t, |, i, s, t, |, n, e, t, |, m, i, g, l, a, |, g, e, b, e, n, |, z, a, |, m, o, c, h, a, n, s, e, |, b, a, i, t, a, r, g, e, a, n, |]"


Map:   0%|          | 0/106 [00:00<?, ? examples/s]

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

### Create vocabulary


1.   Join all the sentences in one big sentence.
2.   Create the union of all distinct characters.
3.   Convert the resulting list into an enumerated alphabet.

Notes:


*   " " has its own token class: |.
*   "unknown" token: to deal with characters not encountered in the training set.
* "padding" token: corresponds to CTC's "blank token"(see "Alignment" section [here](https://distill.pub/2017/ctc/)).





In [None]:
def extract_all_chars(batch):
  all_text = "".join(batch["sentence_lower"])
  voc = []
  for i in batch['characters']:
    voc.append(list(set(i)))
  voc = [item for l in voc for item in l]
  vocab = list(set(voc))
  return {"vocab": [vocab], "all_text": [all_text]}

vocab_train = na_train.map(extract_all_chars, batched=True,
                           batch_size=-1, keep_in_memory=True,
                           remove_columns=na_train.column_names)
vocab_valid = na_valid.map(extract_all_chars, batched=True,
                         batch_size=-1, keep_in_memory=True,
                         remove_columns=na_valid.column_names)
#vocab_test = na_test.map(extract_all_chars, batched=True,
#                         batch_size=-1, keep_in_memory=True,
#                         remove_columns=na_test.column_names)

vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_valid["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)

print(vocab_dict)

Map:   0%|          | 0/842 [00:00<?, ? examples/s]

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

{'b': 0, 'u': 1, 'c': 2, 's': 3, 'm': 4, 'k': 5, 'z': 6, 'h': 7, 'á': 8, 'n': 9, 'j': 10, 'q': 11, 'ö': 12, '3': 13, 'é': 14, '0': 15, '9': 16, 'è': 17, 'à': 18, 'r': 19, 'e': 20, 'g': 21, '|': 22, 't': 23, 'ò': 24, 'd': 25, 'f': 26, 'ä': 27, ')': 28, '1': 29, '(': 30, 'i': 31, 'o': 32, 'l': 33, 'a': 34, 'ë': 35, 'w': 36, '2': 37, '5': 38, 'v': 39, 'p': 40, '[UNK]': 41, '[PAD]': 42}


### Save the vocabulary

Save the vocabulary as a json file.

In [None]:
import json
with open('/content/drive/MyDrive/SAURIANO/vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

### Create the Wav2Vec2CTCTokenizer

Instantiate an object of the Wav2Vec2CTCTokenizer class.

**Connectionist Temporal Classification** (CTC) is a way to resolve the alignment between the audio input $X = [x_1, \dots, x_T]$ and the string output $Y = [y_1, \dots, y_U]$. For a given input, we’d like to train our model to maximize the probability it assigns to the right answer. The conditional probability $p(Y∣X)$ is the loss function, and the inferred output is
$$Y^* = \underset{Y}{\mathrm{argmax}}\;p(Y∣X).$$

For a single $(X,Y)$ pair:
$$p(Y∣X)= \sum_{A \in A_{(X, Y)}} \prod_{t=1}^Tp(a_t|X),$$

where $A_{(X, Y)}$ is the set of the valid alignments, and the product is the probability for a single alignment computed step-by-step.

One heuristic is to take the most likely output at each time-step. This gives us the alignment with the highest probability:

$$A^* = \underset{A}{\mathrm{argmax}}\;p(Y∣X).$$


In [None]:
from transformers import Wav2Vec2CTCTokenizer
tokenizer = Wav2Vec2CTCTokenizer("/content/drive/MyDrive/SAURIANO/vocab.json",
                                 unk_token="[UNK]", pad_token="[PAD]",
                                 word_delimiter_token="|")

### Create the Wav2Vec2FeatureExtractor



1.   feature_size is 1 because the model was trained on the raw speech signal.
2.   sampling_rate at which the model is trained on.
3.   padding_value for batched inference, shorter inputs need to be padded with a specific value.
4.   zero-mean-unit-variance normalized input. Usually, speech models perform better when normalizing the input
5.   make use of an attention_mask for batched inference.

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1,
                                             sampling_rate=16000,
                                             padding_value=0.0,
                                             do_normalize=True,
                                             return_attention_mask=True)

### Save the Wav2Vec2Processor

Feature extractor and tokenizer are wrapped into a single Wav2Vec2Processor class so that one only needs a model and processor object.

Save the processor.

The model path is:
- model _ "batch size" _ "gradient accumulation steps" _ "percentage of the whole dataset"

In [None]:
from transformers import Wav2Vec2Processor

batch_size = 8
if batch_size == 8:
  gradient_accumulation_steps = 2
elif batch_size == 16:
  gradient_accumulation_steps = 8

model_path = "/content/drive/MyDrive/SAURIANO/model_" + str(batch_size) + "_" + str(gradient_accumulation_steps) + "_" + str(100)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor,
                              tokenizer=tokenizer)
processor.save_pretrained(model_path)

### Language model boosting: Expand the alphabet

Add the special tokens for the beginning and the end of a sentence ("\<s>", and "\</s>", respectively), to match the KenLM language model alphabet.

In [None]:
tokenizer.add_tokens(["<s>", "</s>"], special_tokens=True)
tokenizer.save_pretrained(model_path)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor,
                              tokenizer=tokenizer)
processor.save_pretrained(model_path)

### Preprocess audio
XLSR-Wav2Vec2 expects the audio file in the format of a 1-dimensional array

1.   Load mp3 audio files into the dataset object with torchaudio.
2.   Resample the audio files to 16kHz.



In [None]:
import os
import torchaudio

def speech_file_to_array_fn(batch):
    # speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array, sampling_rate = torchaudio.load(os.path.join(root_dir, batch["path"]))
    batch["speech"] = speech_array[0].numpy()
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch["sentence_lower"]
    return batch

show_random_elements(na_train, num_examples=5)

Unnamed: 0,path,speaker,sentence_lower,characters
0,segments/ZOOM0059_4.mp3,Lucia,de do zbean pflonzn seint geben nopeindich net lai pan ins ober iblarume,"[d, e, |, d, o, |, z, b, e, a, n, |, p, f, l, o, n, z, n, |, s, e, i, n, t, |, g, e, b, e, n, |, n, o, p, e, i, n, d, i, c, h, |, n, e, t, |, l, a, i, |, p, a, n, |, i, n, s, |, o, b, e, r, |, i, b, l, a, r, u, m, e, |]"
1,segments/ZOOM0044_13.mp3,Duilio,s ist nou geben vinster tschmörganz vrie,"[s, |, i, s, t, |, n, o, u, |, g, e, b, e, n, |, v, i, n, s, t, e, r, |, t, s, c, h, m, ö, r, g, a, n, z, |, v, r, i, e, |]"
2,segments/ZOOM0054_1.mp3,Paola,desevant ontse de musln geströtzet geign rotz unt in kadour mitn ouksn,"[d, e, s, e, v, a, n, t, |, o, n, t, s, e, |, d, e, |, m, u, s, l, n, |, g, e, s, t, r, ö, t, z, e, t, |, g, e, i, g, n, |, r, o, t, z, |, u, n, t, |, i, n, |, k, a, d, o, u, r, |, m, i, t, n, |, o, u, k, s, n, |]"
3,segments/ZOOM0032_7.mp3,Gianni,seign as ist geben oise a seta leitzes völkh umunonder,"[s, e, i, g, n, |, a, s, |, i, s, t, |, g, e, b, e, n, |, o, i, s, e, |, a, |, s, e, t, a, |, l, e, i, t, z, e, s, |, v, ö, l, k, h, |, u, m, u, n, o, n, d, e, r, |]"
4,segments/ZOOM0041_5.mp3,Duilio,in onder tokh oban noch de gepete otarsi gerichtet zorbetan,"[i, n, |, o, n, d, e, r, |, t, o, k, h, |, o, b, a, n, |, n, o, c, h, |, d, e, |, g, e, p, e, t, e, |, o, t, a, r, s, i, |, g, e, r, i, c, h, t, e, t, |, z, o, r, b, e, t, a, n]"


In [None]:
na_train = na_train.map(speech_file_to_array_fn,
                        remove_columns=na_train.column_names)
na_valid = na_valid.map(speech_file_to_array_fn,
                      remove_columns=na_valid.column_names)
na_test = na_test.map(speech_file_to_array_fn)

Map:   0%|          | 0/842 [00:00<?, ? examples/s]

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/106 [00:00<?, ? examples/s]

In [None]:
import librosa
import numpy as np

def resample(batch):
    batch["speech"] = librosa.resample(np.asarray(batch["speech"]),
                                       orig_sr=batch["sampling_rate"],
                                       target_sr=16000)
    batch["sampling_rate"] = 16000
    return batch

In [None]:
na_train = na_train.map(resample, num_proc=3)
na_valid = na_valid.map(resample, num_proc=3)
na_test = na_test.map(resample, num_proc=3)

Map (num_proc=3):   0%|          | 0/842 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/105 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/106 [00:00<?, ? examples/s]

### Check the preprocessed dataset

Listen to some audio files to better understand the dataset and verify that the audio was correctly loaded and check that the data is correctly prepared:

*   shape of the speech input
*   transcription
*   sample rate

In [None]:
import IPython.display as ipd
import random

rand_int = random.randint(0, len(na_train))

print("Target text:", na_train[rand_int]["target_text"])
print("Input array shape:", np.asarray(na_train[rand_int]["speech"]).shape)
print("Sampling rate:", na_train[rand_int]["sampling_rate"])

ipd.Audio(data=np.asarray(na_train[rand_int]["speech"]),
          autoplay=True, rate=16000)


Target text: ober der voische bölf khent er der earste 
Input array shape: (68224,)
Sampling rate: 16000


### Final dataset processing

Process the dataset to the format expected by the model for training.

1.   Check that the data samples have the same sampling rate of 16kHz.
2.   Extract the input_values from the loaded audio file.
3.   Encode the transcriptions to label ids.


In [None]:
def prepare_dataset(batch):
    # check that all files have the correct sampling rate
    assert (
        len(set(batch["sampling_rate"])) == 1
    ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["speech"],
                                      sampling_rate=batch["sampling_rate"][0]).input_values

    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

In [None]:
na_train = na_train.map(prepare_dataset, remove_columns=na_train.column_names,
                        batch_size=batch_size, batched=True)
na_valid = na_valid.map(prepare_dataset, remove_columns=na_valid.column_names,
                      batch_size=batch_size, batched=True)
na_test = na_test.map(prepare_dataset, batch_size=batch_size, batched=True)

Map:   0%|          | 0/842 [00:00<?, ? examples/s]

  tensor = as_tensor(value)


Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/106 [00:00<?, ? examples/s]

## Training

1.   Define a data collator: form a batch by using a list of dataset elements as input.
2.   XLSR-Wav2Vec2's large input length and small output length: training samples should only be padded to the longest sample in their batch and not the overall longest sample.
3.   Evaluation metrics: set as the word error rate and the character error rate, and define a compute_metrics function accordingly.
4.   Pretrained checkpoint: load it and configure it correctly for training.
5.   Define the training configuration.
6.   Fine-tune the model, evaluate it on the test data and verify that it has indeed learned to correctly transcribe speech.

### [Data collator](https://github.com/huggingface/transformers/blob/9a06b6b11bdfc42eea08fa91d0c737d1863c99e3/examples/research_projects/wav2vec2/run_asr.py#L81)
It treats the input_values and labels differently and thus applies to separate padding functions on them.

Padding tokens in the labels with -100 so that those tokens are not taken into account when computing the loss.

Define:

In [None]:
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import torch

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [
            {"input_values": feature["input_values"]} for feature in features]
        label_features = [
            {"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

Instantiate:

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

### Metrics


1.   Word Error Rate
2.   Character Error Rate



In [None]:
from evaluate import load

wer_metric = load("wer")
cer_metric = load("cer")

The model will return a sequence of logit vectors: $\mathbf{y}_1, \dots, \mathbf{y}_m$, with $\mathbf{y}_i = f_\theta (x_1, \dots, x_n)$, and $n \gg m$.

A logit vector $\mathbf{y}_i$ contains the log-odds for each word in the vocabulary we defined earlier, len$(\mathbf{y}_i)$ = config.vocab_size. We are interested in the most likely prediction of the model and thus take the argmax(...) of the logits. Also, we transform the encoded labels back to the original string by replacing -100 with the pad_token_id and decoding the ids while making sure that consecutive tokens are not grouped to the same token in CTC style.

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

### Load the pretrained XLSR-Wav2Vec2 checkpoint

To save GPU memory, we enable PyTorch's gradient checkpointing and also set the loss reduction to "mean".


In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.075,
    layerdrop=0.1,
    gradient_checkpointing=True,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2ForCTC: ['project_q.bias', 'quantizer.weight_proj.bias', 'project_hid.bias', 'project_hid.weight', 'project_q.weight', 'quantizer.codevectors', 'quantizer.weight_proj.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to u

The first component of XLSR-Wav2Vec2 consists of a stack of CNN layers that are used to extract acoustically meaningful - but contextually independent - features from the raw speech signal. This part of the model has [already been sufficiently trained during pretraining](https://arxiv.org/pdf/2006.13979.pdf). Thus, we can set the requires_grad to False for all parameters of the feature extraction part.

In [None]:
model.freeze_feature_encoder()

### Parameters related to training

*   group_by_length makes training more efficient by grouping training samples of similar input length into one batch.

*   learning_rate and weight_decay heuristically tuned until fine-tuning has become stable.

[Here](https://huggingface.co/docs/transformers/main/main_classes/trainer#trainingarguments) more explanations on other parameters.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=model_path,
  group_by_length=True,
  per_device_train_batch_size=batch_size,
  gradient_accumulation_steps=gradient_accumulation_steps,
  evaluation_strategy="steps",
  num_train_epochs=60, #60
  fp16=True,
  save_steps=100,
  eval_steps=50,
  logging_steps=50,
  learning_rate=3e-4,
  warmup_steps=500,
  save_total_limit=2
  )

Pass all instances to Trainer.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=na_train,
    eval_dataset=na_valid,
    tokenizer=processor.feature_extractor,
)

### Train

In [None]:
trainer.train()
trainer.save_model()
processor.save_pretrained(model_path)
tokenizer.save_pretrained(training_args.output_dir)
feature_extractor.save_pretrained(model_path)




Step,Training Loss,Validation Loss


FailedPreconditionError: ignored

#### Load the pretrained checkpoint

In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(model_path).to("cuda")
processor = Wav2Vec2Processor.from_pretrained(model_path)

#### Fast check

Take the first example of the test set, run it through the model and take the argmax(...) of the logits to retrieve the predicted token ids.

In [None]:
input_dict = processor(na_test["input_values"][0], return_tensors="pt",
                       padding=True, sampling_rate=16000)
logits = model(input_dict.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)[0]
print("Prediction:")
print(processor.decode(pred_ids))
print("Reference:")
print(na_test['sentence_lower'][0])

## Evaluation

In [None]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from jiwer import wer, cer

ref = []
pred = []
wer_list = []
cer_list = []
for i in range(len(na_test)):
    input_dict = processor(na_test["input_values"][i], return_tensors="pt", padding=True, sampling_rate=16000)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    predicted_sentence = processor.decode(pred_ids)
    reference_sentence = na_test['sentence_lower'][i]
    pred.append(predicted_sentence)
    ref.append(reference_sentence)
    wer_score = wer(reference_sentence, predicted_sentence)
    cer_score = cer(reference_sentence, predicted_sentence)
    wer_list.append(wer_score)
    cer_list.append(cer_score)
print('Mean WER', np.mean(wer_list))
print('SD WER', np.std(wer_list))
print('Mean CER', np.mean(cer_list))
print('SD CER', np.std(cer_list))

### Store the results in a CSV file

In [None]:
df_results = pd.DataFrame({'Reference': ref, 'Prediction': pred, 'WER': wer_list, 'CER': cer_list})
csv_path = model_path + "/results_" + str(batch_size) + "_" + str(gradient_accumulation_steps) + str(percentage) + ".csv"
df_results.to_csv(csv_path, index=False, sep='\t')

## Combination with an n-gram language model
While large language models based on the [Transformer architecture](https://jalammar.github.io/illustrated-transformer/) have become the standard in NLP, it is still very common to use an n-gram LM to boost speech recognition systems.

Looking at Table 9 of Appendix C of the [official Wav2Vec2 paper](https://arxiv.org/abs/2006.11477), it can be noticed that using a Transformer-based LM for decoding clearly yields better results than using an n-gram model, but the difference between n-gram and Transformer-based LM is much less significant than the difference between n-gram and no LM.

E.g., for the large Wav2Vec2 checkpoint that was fine-tuned on 10 minutes only, an n-gram reduces the word error rate (WER) compared to no LM by ca. 80% while a Transformer-based LM only reduces the WER by another 23% compared to the n-gram. This relative WER reduction becomes less, the more data the acoustic model has been trained on. E.g., for the large checkpoint a Transformer-based LM reduces the WER by merely 8% compared to an n-gram LM whereas the n-gram still yields a 21% WER reduction compared to no language model.

The reason why an n-gram is preferred over a Transformer-based LM is that n-grams come at a significantly smaller computational cost. For an n-gram, retrieving the probability of a word given previous words is almost only as computationally expensive as querying a look-up table or tree-like data storage - i.e. it's very fast compared to modern Transformer-based language models that would require a full forward pass to retrieve the next word probabilities.

For more information on how n-grams function and why they are (still) so useful for speech recognition, the reader is advised to take a look at [this excellent summary](https://web.stanford.edu/~jurafsky/slp3/3.pdf) from Stanford.

We will use the popular [KenLM library](https://github.com/kpu/kenlm).


We can include the language model as a factor in the inference problem:

$$Y^* = \underset{Y}{\mathrm{argmax}}\;p(Y∣X) \cdot p(Y)^\alpha \cdot L(Y)^\beta.$$

The function $L(Y)$ computes the length of $Y$ in terms of the language model tokens and acts as a word insertion bonus. With a word-based language model $L(Y)$ counts the number of words in $Y$. If we use a character-based language model then $L(Y)$ counts the number of characters in $Y$. The language model scores are only included when a prefix is extended by a character (or word) and not at every step of the algorithm. This causes the search to favor shorter prefixes, as measured by $L(Y)$, since they don’t include as many language model updates. The word insertion bonus helps with this. The parameters $\alpha$ and $\beta$ are usually set by cross-validation.

The language model scores and word insertion term can be included in the beam search. Whenever we propose to extend a prefix by a character, we can include the language model score for the new character given the prefix so far.


### Install the Ubuntu library prerequisites

In [None]:
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

### Install KenLM

Download and unpack the [KenLM](https://kheafield.com/code/kenlm/) repo.

In [None]:
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

KenLM is written in C++, so we'll make use of cmake to build the binaries.

In [None]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2

### Kneser–Ney smoothing

KenLM by default computes an n-gram with [Kneser-Ney smooting](https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing). All text data used to create the n-gram is expected to be stored in a text file.

A language model that is useful for a speech recognition system should support the acoustic model, e.g. Wav2Vec2, in predicting the next word (or token, letter) and therefore model the following distribution:
$$P(w_n|\mathbf{w}_0^{t-1}),$$
with $w_n$ the next word and $\mathbf{w}_0^{t-1}$ being the sequence of all previous words since the beginning of the utterance.

The language model should be good at modeling language that corresponds to the target transcriptions of the speech recognition system.

A dataset that is relatively clean and easy to pre-process is [europarl_bilingual](https://huggingface.co/datasets/europarl_bilingual) as it's a dataset that is based on discussions and talks of the European parliament. It should therefore be relatively clean and correspond well to read-out audio data. The dataset is originally designed for machine translation and can therefore only be accessed in translation pairs. We will only extract the text of the target language, German (de), from the Danish-to-German translations.

We download the dataset and save it as a .txt file.


In [None]:
target_lang="de"

from datasets import load_dataset

dataset = load_dataset("europarl_bilingual", lang1="da", lang2=target_lang, split="train")

### Alphabet matching

The alphabet of the language model should match the one of the fine-tuned acoustic checkpoints.

We can write a single map function to extract the German text and process it right away.

In [None]:
import re

chars_to_ignore_regex = '[,?.!\-\;\:"“%‘”�—’*«»…–]'

def extract_text(batch):
  text = batch["translation"][target_lang]
  batch["text"] = re.sub(chars_to_ignore_regex, "", text.lower())
  return batch

dataset = dataset.map(extract_text, remove_columns=dataset.column_names)
with open("drive/MyDrive/SAURIANO/kenLMtext.txt", "w") as file:
  file.write(" ".join(dataset["text"]))

### Build the n-gram

Run KenLM's lmplz command to build the n-gram, called "5gram.arpa". As it's relatively common in speech recognition, we build a 5-gram by passing the -o 5 parameter. For more information on the different n-gram LM that can be built with KenLM, one can take a look at the official website of [KenLM](https://kheafield.com/code/kenlm/).

In [None]:
!kenlm/build/bin/lmplz -o 5 <"drive/MyDrive/SAURIANO/kenLMtext.txt" > "drive/MyDrive/SAURIANO/5gram.arpa"

Let's inspect the first couple of lines.

In [None]:
!head -20 drive/MyDrive/SAURIANO/5gram.arpa

### Add the special token for the end of a sentence: \</s>

The 5-gram correctly includes a "Unknown" or <UNK>, as well as a begin-of-sentence, \<s> token, but no end-of-sentence, \</s> token.

We add the end-of-sentence token by adding the line 0 \</s> -0.11831701 below the begin-of-sentence token and increasing the ngram 1 count by 1.

In [None]:
with open("drive/MyDrive/SAURIANO/5gram.arpa", "r") as read_file, open("drive/MyDrive/SAURIANO/5gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      if "<unk>" in line:
        line.replace("<unk>", "<UNK>")
      if "<pad>" in line:
        line.replace("<pad>", "<PAD>")
      write_file.write(line)

Let's now inspect the corrected 5-gram.

In [None]:
!head -20 drive/MyDrive/SAURIANO/5gram_correct.arpa

### Integrate the n-gram with pyctcdecode and 🤗 Transformers

Install [pyctcdecode](https://pypi.org/project/pyctcdecode/).

In [None]:
!pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode

Extract the alphabet of its tokenizer as it represents the "labels" of pyctcdecode's BeamSearchDecoder class.

In [None]:
vocab_dict = processor.tokenizer.get_vocab()
print(vocab_dict)
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}
print(sorted_vocab_dict)


### Build the CTC decoder

The "labels" and the previously built 5gram_correct.arpa file is all that's needed to build the decoder.

In [None]:
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="drive/MyDrive/SAURIANO/5gram_correct.arpa",
)

### Create the Wav2Vec2ProcessorWithLM

Wrap the just created decoder, together with the processor's tokenizer and feature_extractor into a Wav2Vec2ProcessorWithLM class.

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=feature_extractor,
    tokenizer=tokenizer,
    decoder=decoder
)

### Reduce the n-gram size

The 5-gram LM is quite large - it amounts to more than 4 GB. To reduce the size of the n-gram and make loading faster, kenLM allows converting .arpa files to binary ones using the build_binary executable.

In [None]:
!kenlm/build/bin/build_binary drive/MyDrive/SAURIANO/5gram_correct.arpa drive/MyDrive/SAURIANO/5gram.bin

 Remove the .arpa file.

In [None]:
!rm drive/MyDrive/SAURIANO/5gram_correct.arpa

### Decode audio data with Wav2Vec2 and a language model

In constrast to decoding the audio without language model, the processor now directly receives the model's output logits instead of the argmax(logits) (called pred_ids) above. The reason is that when decoding with a language model, at each time step, the processor takes the probabilities of all possible output characters into account.




#### Decode

In [None]:
ref = []
pred = []
wer_list = []
cer_list = []
for i in range(len(na_test)):
  input_dict = processor(na_test["input_values"][i], return_tensors="pt", padding=True, sampling_rate=16000)
  logits = model(input_dict.input_values.to("cuda")).logits.detach().cpu().numpy()
  transcription = processor_with_lm.batch_decode(logits).text
  predicted_sentence = transcription[0].lower()
  reference_sentence = na_test['sentence_lower'][i]
  pred.append(predicted_sentence)
  ref.append(reference_sentence)
  wer_score = wer(reference_sentence, predicted_sentence)
  cer_score = cer(reference_sentence, predicted_sentence)
  wer_list.append(wer_score)
  cer_list.append(cer_score)
print('Mean WER', np.mean(wer_list))
print('SD WER', np.std(wer_list))
print('Mean CER', np.mean(cer_list))
print('SD CER', np.std(cer_list))

## Evaluation on noisy audio data: Phonogrammarchiv

# TODO: proper segmentation

In [None]:
noisy_data_dir = os.path.join(root_dir, 'Phonogrammarchiv')
frame_length = 80000

phonogrammarchiv = []
transcriptions = []
for noisy_audio in os.listdir(noisy_data_dir):
  noisy_speech_array, noisy_sampling_rate = torchaudio.load(os.path.join(noisy_data_dir, noisy_audio))
  noisy_speech = noisy_speech_array[0].numpy()
  noisy_speech = librosa.resample(np.asarray(noisy_speech), orig_sr=noisy_sampling_rate, target_sr=16000)
  num_frames = len(noisy_speech) // frame_length
  for i in range(num_frames):
    start = i * frame_length
    input_dict = processor(noisy_speech[start : start + frame_length], return_tensors="pt", padding=True, sampling_rate=16000)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    predicted_sentence = processor.decode(pred_ids)
    phonogrammarchiv.append(noisy_audio + '_' + str(i))
    transcriptions.append(predicted_sentence)
  if len(noisy_speech) - num_frames * frame_length > 16000:
    input_dict = processor(noisy_speech[num_frames * frame_length :], return_tensors="pt", padding=True, sampling_rate=16000)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    predicted_sentence = processor.decode(pred_ids)
    phonogrammarchiv.append(noisy_audio + '_' + str(num_frames))
    transcriptions.append(predicted_sentence)

df_results = pd.DataFrame({'Five seconds Audio segment': phonogrammarchiv, 'Transcription': transcriptions})
csv_path = model_path + "/transcriptions_phonogrammarchiv_" + str(batch_size) + "_" + str(gradient_accumulation_steps) + str(percentage) + ".csv"
df_results.to_csv(csv_path, index=False, sep='\t')

## Grid search

Because the dataset is quite small (~1.5h of training data) and because it is a bit noisy, fine-tuning Facebook's wav2vec2-large-xlsr-53 checkpoint seems to require some hyper-parameter tuning: dropout, SpecAugment's masking dropout rate, layer dropout, and learning rate.

We perform a grid search to find the optimal hyperparameters

In [None]:
import csv

loaded_data = {'path': [], 'sentence': [], 'speaker': []}
csv_filename = 'drive/MyDrive/SAURIANO/data.csv'

with open(csv_filename, 'r', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        loaded_data['path'].append(row['path'])
        loaded_data['sentence'].append(row['sentence'])
        loaded_data['speaker'].append(row['speaker'])

data = loaded_data

from datasets import Dataset
from sklearn.model_selection import train_test_split

dataset = Dataset.from_dict(data)

train_indices, temp_indices = train_test_split(range(len(dataset)), test_size=0.2, random_state=42)
val_indices, test_indices = train_test_split(temp_indices, test_size=0.5, random_state=42)

import random

percentage = 100
train_num_elements_to_select = int(len(train_indices) * percentage / 100)
train_random_selection = random.sample(train_indices, train_num_elements_to_select)

valid_num_elements_to_select = int(len(temp_indices) * percentage / 100)
valid_random_selection = random.sample(temp_indices, valid_num_elements_to_select)

na_train = dataset.select(train_random_selection)
na_valid = dataset.select(valid_random_selection)
na_test = dataset.select(test_indices)

from datasets import ClassLabel
import random
from IPython.display import display, HTML
import pandas as pd


PUNC_SYMBOLS = [',', '.', ':', ';', '?', '!', "'", '"', '*', '«',
                '»', '“', '”', "", "–", "-", '‘', '’', '…']

def final_text_words(batch):
    sentence = batch['sentence']
    sentence_lower = sentence.lower()
    for char in sentence_lower:
      if char in PUNC_SYMBOLS:
        sentence_lower = sentence_lower.replace(char,"")
    batch['sentence_lower'] = sentence_lower
    characters_per_sentence = []
    for char in sentence_lower:
        if char == " ":
            if characters_per_sentence and characters_per_sentence[-1] != "|":
                characters_per_sentence.append("|")
        elif char not in PUNC_SYMBOLS:
            characters_per_sentence.append(char)
    batch['characters'] = characters_per_sentence
    return batch


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))


na_train_1 = na_train.map(final_text_words)
na_train_2 = na_train_1.remove_columns(['sentence'])
na_test_1 = na_test.map(final_text_words)
na_test_2 = na_test_1.remove_columns(['sentence'])
na_valid_1 = na_valid.map(final_text_words)
na_valid_2 = na_valid_1.remove_columns(['sentence'])


def extract_all_chars(batch):
  all_text = "".join(batch["sentence_lower"])
  voc = []
  for i in batch['characters']:
    voc.append(list(set(i)))
  voc = [item for l in voc for item in l]
  vocab = list(set(voc))
  return {"vocab": [vocab], "all_text": [all_text]}


vocab_train = na_train_2.map(extract_all_chars, batched=True,
                           batch_size=-1, keep_in_memory=True,
                           remove_columns=na_train_2.column_names)
vocab_valid = na_valid_2.map(extract_all_chars, batched=True,
                         batch_size=-1, keep_in_memory=True,
                         remove_columns=na_valid_2.column_names)

vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_valid["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)


import json
with open('/content/drive/MyDrive/SAURIANO/vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)


from transformers import Wav2Vec2CTCTokenizer
tokenizer = Wav2Vec2CTCTokenizer("/content/drive/MyDrive/SAURIANO/vocab.json",
                                 unk_token="[UNK]", pad_token="[PAD]",
                                 word_delimiter_token="|")


from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1,
                                             sampling_rate=16000,
                                             padding_value=0.0,
                                             do_normalize=True,
                                             return_attention_mask=True)


from transformers import Wav2Vec2Processor
import os
import torchaudio

import random

rand_int = random.randint(0, len(na_train_2))


batch_size = 8 # SET BATCH SIZE TO 8 OR 16
if batch_size == 8:
  gradient_accumulation_steps = 2
elif batch_size == 16:
  gradient_accumulation_steps = 8


from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import torch
import librosa
import numpy as np


@dataclass
class DataCollatorCTCWithPadding:

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [
            {"input_values": feature["input_values"]} for feature in features]
        label_features = [
            {"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch


from evaluate import load

wer_metric = load("wer")
cer_metric = load("cer")


from transformers import TrainingArguments

from itertools import product

# Define ranges for hyperparameters
attention_dropout_range = [0.1] #[0.1, 0.5]
hidden_dropout_range = [0.1] #[0.1, 0.5]
mask_time_prob_range = [0.1] #[0.05, 0.1]
layerdrop_range = [0.1] #[0.1, 0.5]

attention_dropout = 0.1
hidden_dropout = 0.1
mask_time_prob = 0.075
layerdrop = 0.1

# Create all possible combinations of hyperparameters
hyperparameter_combinations = product(
    attention_dropout_range,
    hidden_dropout_range,
    mask_time_prob_range,
    layerdrop_range
    )


from transformers import Wav2Vec2ForCTC, Trainer, Wav2Vec2Processor
from jiwer import wer, cer
noisy_data_dir = os.path.join(root_dir, 'Phonogrammarchiv')
frame_length = 80000

hyperparameters_dict = {'attention_dropout': [], 'hidden_dropout': [],
                        'mask_time_prob': [], 'layerdrop': [], 'metrics': []}

hyperparameters_dict_eval = {'attention_dropout': [], 'hidden_dropout': [],
                             'mask_time_prob': [], 'layerdrop': [],
                             'wer_mean': [], 'cer_mean': [],
                             'wer_std': [], 'cer_std': []}

#for attention_dropout, hidden_dropout, mask_time_prob, layerdrop in hyperparameter_combinations:
#for i in range(10, 110, 10):



model_path = "/content/drive/MyDrive/SAURIANO/model_" + str(batch_size) + "_" + str(gradient_accumulation_steps) + "_" + str(percentage) + "_" + str(attention_dropout) + "_" + str(hidden_dropout) + "_" + str(mask_time_prob) + "_" + str(layerdrop)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor,
                              tokenizer=tokenizer)
processor.save_pretrained(model_path)


tokenizer.add_tokens(["<s>", "</s>"], special_tokens=True)
tokenizer.save_pretrained(model_path)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor,
                              tokenizer=tokenizer)
processor.save_pretrained(model_path)

def speech_file_to_array_fn(batch):
    # speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array, sampling_rate = torchaudio.load(os.path.join(root_dir, batch["path"]))
    batch["speech"] = speech_array[0].numpy()
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch["sentence_lower"]
    return batch

show_random_elements(na_train_2, num_examples=5)

na_train_3 = na_train_2.map(speech_file_to_array_fn,
                      remove_columns=na_train_2.column_names)
na_valid_3 = na_valid_2.map(speech_file_to_array_fn,
                      remove_columns=na_valid_2.column_names)
na_test_3 = na_test_2.map(speech_file_to_array_fn)


print("Target text:", na_train_3[rand_int]["target_text"])
print("Input array shape:", np.asarray(na_train_3[rand_int]["speech"]).shape)
print("Sampling rate:", na_train_3[rand_int]["sampling_rate"])


def resample(batch):
  batch["speech"] = librosa.resample(np.asarray(batch["speech"]),
                                      orig_sr=batch["sampling_rate"],
                                      target_sr=16000)
  batch["sampling_rate"] = 16000
  return batch


na_train_4 = na_train_3.map(resample, num_proc=3)
na_valid_4 = na_valid_3.map(resample, num_proc=3)
na_test_4 = na_test_3.map(resample, num_proc=3)


def prepare_dataset(batch):
    # check that all files have the correct sampling rate
    assert (
        len(set(batch["sampling_rate"])) == 1
    ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["speech"],
                                      sampling_rate=batch["sampling_rate"][0]).input_values

    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch


na_train_5 = na_train_4.map(prepare_dataset, remove_columns=na_train_4.column_names,
                        batch_size=batch_size, batched=True)
na_valid_5 = na_valid_4.map(prepare_dataset, remove_columns=na_valid_4.column_names,
                      batch_size=batch_size, batched=True)
na_test_5 = na_test_4.map(prepare_dataset, batch_size=batch_size, batched=True)

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

metrics = {"wer": [], "cer": []}

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)

    metrics['wer'].append(wer)
    metrics['cer'].append(cer)

    return {"wer": wer, "cer": cer}


training_args = TrainingArguments(
output_dir=model_path,
group_by_length=True,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
evaluation_strategy="steps",
num_train_epochs=60, #60
fp16=True,
save_steps=100,
eval_steps=50,
logging_steps=50,
learning_rate=3e-4,
warmup_steps=500,
save_total_limit=2,
resume_from_checkpoint=True
)

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    attention_dropout=attention_dropout,
    hidden_dropout=hidden_dropout,
    feat_proj_dropout=0.0,
    mask_time_prob=mask_time_prob,
    layerdrop=layerdrop,
    gradient_checkpointing=True,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
    )

trainer = Trainer(
  model=model,
  data_collator=data_collator,
  args=training_args,
  compute_metrics=compute_metrics,
  train_dataset=na_train_5,
  eval_dataset=na_valid_5,
  tokenizer=processor.feature_extractor,
  )

trainer.train()
#trainer.train(resume_from_checkpoint=True)
trainer.save_model()
processor.save_pretrained(model_path)
tokenizer.save_pretrained(training_args.output_dir)
feature_extractor.save_pretrained(model_path)
hyperparameters_dict['attention_dropout'].append(attention_dropout)
hyperparameters_dict['hidden_dropout'].append(hidden_dropout)
hyperparameters_dict['mask_time_prob'].append(mask_time_prob)
hyperparameters_dict['layerdrop'].append(layerdrop)
hyperparameters_dict['metrics'].append(metrics)
model = Wav2Vec2ForCTC.from_pretrained(model_path).to("cuda")
processor = Wav2Vec2Processor.from_pretrained(model_path)

ref = []
pred = []
wer_list = []
cer_list = []
for i in range(len(na_test_5)):
    input_dict = processor(na_test_5["input_values"][i], return_tensors="pt", padding=True, sampling_rate=16000)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    predicted_sentence = processor.decode(pred_ids)
    reference_sentence = na_test_5['sentence_lower'][i]
    pred.append(predicted_sentence)
    ref.append(reference_sentence)
    wer_score = wer(reference_sentence, predicted_sentence)
    cer_score = cer(reference_sentence, predicted_sentence)
    wer_list.append(wer_score)
    cer_list.append(cer_score)

hyperparameters_dict_eval['attention_dropout'].append(attention_dropout)
hyperparameters_dict_eval['hidden_dropout'].append(hidden_dropout)
hyperparameters_dict_eval['mask_time_prob'].append(mask_time_prob)
hyperparameters_dict_eval['layerdrop'].append(layerdrop)
hyperparameters_dict_eval['wer_mean'].append(np.mean(wer_list))
hyperparameters_dict_eval['cer_mean'].append(np.mean(cer_list))
hyperparameters_dict_eval['wer_std'].append(np.std(wer_list))
hyperparameters_dict_eval['cer_std'].append(np.std(cer_list))

df_results = pd.DataFrame({'Reference': ref, 'Prediction': pred, 'WER': wer_list, 'CER': cer_list})
csv_path = model_path + "/results_" + str(batch_size) + "_" + str(gradient_accumulation_steps) + str(percentage) + ".csv"
df_results.to_csv(csv_path, index=False, sep='\t')


phonogrammarchiv = []
transcriptions = []
for noisy_audio in os.listdir(noisy_data_dir):
  noisy_speech_array, noisy_sampling_rate = torchaudio.load(os.path.join(noisy_data_dir, noisy_audio))
  noisy_speech = noisy_speech_array[0].numpy()
  noisy_speech = librosa.resample(np.asarray(noisy_speech), orig_sr=noisy_sampling_rate, target_sr=16000)
  num_frames = len(noisy_speech) // frame_length
  for i in range(num_frames):
    start = i * frame_length
    input_dict = processor(noisy_speech[start : start + frame_length], return_tensors="pt", padding=True, sampling_rate=16000)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    predicted_sentence = processor.decode(pred_ids)
    phonogrammarchiv.append(noisy_audio + '_' + str(i))
    transcriptions.append(predicted_sentence)
  if len(noisy_speech) - num_frames * frame_length > 16000:
    input_dict = processor(noisy_speech[num_frames * frame_length :], return_tensors="pt", padding=True, sampling_rate=16000)
    logits = model(input_dict.input_values.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)[0]
    predicted_sentence = processor.decode(pred_ids)
    phonogrammarchiv.append(noisy_audio + '_' + str(num_frames))
    transcriptions.append(predicted_sentence)

df_results = pd.DataFrame({'Five seconds Audio segment': phonogrammarchiv, 'Transcription': transcriptions})
csv_path = model_path + "/transcriptions_phonogrammarchiv_" + str(batch_size) + "_" + str(gradient_accumulation_steps) + str(percentage) + ".csv"
df_results.to_csv(csv_path, index=False, sep='\t')




Map:   0%|          | 0/842 [00:00<?, ? examples/s]

Map:   0%|          | 0/106 [00:00<?, ? examples/s]

Map:   0%|          | 0/211 [00:00<?, ? examples/s]

Map:   0%|          | 0/842 [00:00<?, ? examples/s]

Map:   0%|          | 0/211 [00:00<?, ? examples/s]

Unnamed: 0,path,speaker,sentence_lower,characters
0,segments/ZOOM0025_11.mp3,Renza,unt in onder ame hause vame peater klendar sel in kleindis,"[u, n, t, |, i, n, |, o, n, d, e, r, |, a, m, e, |, h, a, u, s, e, |, v, a, m, e, |, p, e, a, t, e, r, |, k, l, e, n, d, a, r, |, s, e, l, |, i, n, |, k, l, e, i, n, d, i, s]"
1,segments/ZOOM0044_6.mp3,Duilio,desevortn ist geben de gebounhait schie za petan in roasnkhronz,"[d, e, s, e, v, o, r, t, n, |, i, s, t, |, g, e, b, e, n, |, d, e, |, g, e, b, o, u, n, h, a, i, t, |, s, c, h, i, e, |, z, a, |, p, e, t, a, n, |, i, n, |, r, o, a, s, n, k, h, r, o, n, z, |]"
2,segments/ZOOM0032_22.mp3,Gianni,bairach van der khurche ne unt olif unt a khorzle a ma s ist geben ce fa,"[b, a, i, r, a, c, h, |, v, a, n, |, d, e, r, |, k, h, u, r, c, h, e, |, n, e, |, u, n, t, |, o, l, i, f, |, u, n, t, |, a, |, k, h, o, r, z, l, e, |, a, |, m, a, |, s, |, i, s, t, |, g, e, b, e, n, |, c, e, |, f, a]"
3,segments/ZOOM0047_9.mp3,Duilio,unt se ontse gesot,"[u, n, t, |, s, e, |, o, n, t, s, e, |, g, e, s, o, t, |]"
4,segments/ZOOM0030_17.mp3,Gianni,vaspegn as otn gehot ausgelot,"[v, a, s, p, e, g, n, |, a, s, |, o, t, n, |, g, e, h, o, t, |, a, u, s, g, e, l, o, t, |]"


Map:   0%|          | 0/842 [00:00<?, ? examples/s]

Map:   0%|          | 0/211 [00:00<?, ? examples/s]

Map:   0%|          | 0/106 [00:00<?, ? examples/s]

Target text: unt sel istar horte khemen auvar 
Input array shape: (154218,)
Sampling rate: 44100


Map (num_proc=3):   0%|          | 0/842 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/211 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/106 [00:00<?, ? examples/s]

Map:   0%|          | 0/842 [00:00<?, ? examples/s]

  tensor = as_tensor(value)


Map:   0%|          | 0/211 [00:00<?, ? examples/s]

Map:   0%|          | 0/106 [00:00<?, ? examples/s]

Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2ForCTC: ['project_q.weight', 'project_hid.weight', 'project_hid.bias', 'project_q.bias', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to u

Step,Training Loss,Validation Loss,Wer,Cer
50,15.4571,15.444547,1.0,0.901935
100,9.4138,3.930475,1.0,0.979328
150,3.4364,2.984653,1.0,0.979328
200,2.9404,2.879167,1.0,0.979328
250,2.8868,2.870449,1.0,0.979328
300,2.8779,2.852555,1.0,0.979328
350,2.8671,2.831174,1.0,0.979328
400,2.7419,2.433332,0.99889,0.973116
450,1.8914,0.963609,0.994451,0.30336
500,0.925,0.520023,0.598779,0.156823




Step,Training Loss,Validation Loss,Wer,Cer
50,15.4571,15.444547,1.0,0.901935
100,9.4138,3.930475,1.0,0.979328
150,3.4364,2.984653,1.0,0.979328
200,2.9404,2.879167,1.0,0.979328
250,2.8868,2.870449,1.0,0.979328
300,2.8779,2.852555,1.0,0.979328
350,2.8671,2.831174,1.0,0.979328
400,2.7419,2.433332,0.99889,0.973116
450,1.8914,0.963609,0.994451,0.30336
500,0.925,0.520023,0.598779,0.156823




In [None]:
from itertools import product

# Define ranges for hyperparameters
attention_dropout_range = [0.1, 0.2, 0.3, 0.4, 0.5]
hidden_dropout_range = [0.1, 0.2, 0.3, 0.4, 0.5]
mask_time_prob_range = [0.05, 0.1, 0.15, 0.2]
layerdrop_range = [0.1, 0.2, 0.3, 0.4, 0.5]

attention_dropout_range = [0.1, 0.5]
hidden_dropout_range = [0.1, 0.5]
mask_time_prob_range = [0.05, 0.2]
layerdrop_range = [0.1, 0.5]

# Create all possible combinations of hyperparameters
hyperparameter_combinations = product(
    attention_dropout_range,
    hidden_dropout_range,
    mask_time_prob_range,
    layerdrop_range
    )
n = 0
for attention_dropout, hidden_dropout, mask_time_prob, layerdrop in hyperparameter_combinations:
  n += 1
  print('attention_dropout', attention_dropout)
  print('hidden_dropout', hidden_dropout)
  print('mask_time_prob', mask_time_prob)
  print('layerdrop', layerdrop)
print('combinations', n)

attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.05
layerdrop 0.1
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.05
layerdrop 0.2
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.05
layerdrop 0.3
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.05
layerdrop 0.4
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.05
layerdrop 0.5
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.1
layerdrop 0.1
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.1
layerdrop 0.2
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.1
layerdrop 0.3
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.1
layerdrop 0.4
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.1
layerdrop 0.5
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.15
layerdrop 0.1
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.15
layerdrop 0.2
attention_dropout 0.1
hidden_dropout 0.1
mask_time_prob 0.15
layerdrop 0.3
attention_dropout 0.1
hidden_d