# **Torgo Speakers spelling correction evaluation and testing script using machine translation**

### **Objective: Spelling correction evaluation and testing for Torgo dataset speakers using machine translation**

### **Ensure that GPU and RAM is set up: will be needed for training purpose**

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: nvidia-smi: command not found


In [None]:
# ensure enough memory present so that training does not stop
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


### **Install the libraries**

In [None]:
# Install required libraries
!pip install datasets
!pip install transformers==4.28.0
!pip install accelerate
!pip install jiwer
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
Collec

### **Import libraries**

In [None]:
# Import libraries
import torch
from transformers import BartTokenizerFast, BartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from jiwer import wer
from huggingface_hub import notebook_login

In [None]:
# Login to Hugging Face
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### **Mount the json files from Google Drive**

In [None]:
# mount other_speakers.json file is stored in Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load the dataset from the JSON files
dataset = load_dataset('json', data_files={'train': '/content/drive/MyDrive/M04_other_speakers.json',
                                           'test': '/content/drive/MyDrive/speaker_M04.json'})

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-11e5d0d7bb4fe3ee/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-11e5d0d7bb4fe3ee/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['path', 'prediction', 'actual', 'speaker'],
        num_rows: 5212
    })
    test: Dataset({
        features: ['path', 'prediction', 'actual', 'speaker'],
        num_rows: 215
    })
})


In [None]:
print(dataset['train'][:5])
print(dataset['test'][:5])

{'path': ['content/downloads/Torgo/F03/Session1/wav_arrayMic/0005.wav', 'content/downloads/Torgo/F03/Session1/wav_arrayMic/0006.wav', 'content/downloads/Torgo/F03/Session1/wav_arrayMic/0007.wav', 'content/downloads/Torgo/F03/Session1/wav_arrayMic/0008.wav', 'content/downloads/Torgo/F03/Session1/wav_arrayMic/0009.wav'], 'prediction': ['beta', 'stubble', 'stubble', 'trace', 'goat'], 'actual': ['beta ', 'stubble ', 'stubble ', 'trace ', 'goat '], 'speaker': ['F03', 'F03', 'F03', 'F03', 'F03']}
{'path': ['content/downloads/Torgo/F01/Session1/wav_arrayMic/0006.wav', 'content/downloads/Torgo/F01/Session1/wav_arrayMic/0008.wav', 'content/downloads/Torgo/F01/Session1/wav_arrayMic/0009.wav', 'content/downloads/Torgo/F01/Session1/wav_arrayMic/0010.wav', 'content/downloads/Torgo/F01/Session1/wav_arrayMic/0012.wav'], 'prediction': ['snick', 'ealsein the winu we wol  orice pl', 'pat', 'up', 'nit'], 'actual': ['stick ', 'except in the winter when the ooze or snow or ice prevents ', 'pat ', 'up ', 'm

By setting a fixed random seed, the data will be split into training and validation sets in a consistent manner each time the code is executed. This is useful for debugging, testing, and comparing different runs of the code. The choice of the number 42 as the seed is arbitrary and can be any integer value. The important aspect is to use the same seed consistently if reproducibility is desired.

In [None]:
# Split the dataset into train and validation sets
train_dataset, val_dataset = train_test_split(dataset['train'], test_size=0.1, random_state=42)

In [None]:
from datasets import Dataset

train_data = Dataset.from_dict(train_dataset)  # Convert the train data to a dataset
val_data = Dataset.from_dict(val_dataset)      # Convert the validation data to a dataset

In [None]:
print(type(train_data))
print(type(val_data))

<class 'datasets.arrow_dataset.Dataset'>
<class 'datasets.arrow_dataset.Dataset'>


In [None]:
# Check column names in train dataset
print(train_data.column_names)

# Check column names in validation dataset
print(val_data.column_names)

['path', 'prediction', 'actual', 'speaker']
['path', 'prediction', 'actual', 'speaker']


In [None]:
print(train_data['actual'][:5])
print(train_data['prediction'][:5])
#print(val_data[:5])

['at ', 'the ', 'grow ', 'i can read ', 'this is not a program of socialized medicine ']
['at', 'bill', 'grow', 'i can read', 'this is not a program of socialized medicine']


In [None]:
# Define the source and target language columns
source_lang = 'prediction'
target_lang = 'actual'

In [None]:
print(source_lang)

prediction


In [None]:
# Define the max_length for padding and truncation
max_length = 512

The preprocessing function serves to prepare the data for training or evaluation. It uses a tokenizer to tokenize the inputs and labels, formats the inputs by adding a source language identifier, encodes the tokenized inputs and labels, and creates a dictionary of model inputs. The function ensures that the data is properly tokenized, formatted, and encoded according to the model's requirements. It helps maintain consistency and compatibility between the input data and the model during training or evaluation.

In [None]:
# Initialize the tokenizer
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-base')

# Tokenize the data
# The preprocess_function function is defined to preprocess the data by tokenizing the inputs and labels
def preprocess_function(examples):
    inputs = [f'{source_lang}: {text}' for text in examples[source_lang]]
    targets = examples[target_lang]
    encoding = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt', max_length=max_length)
    model_inputs = {
        'input_ids': encoding['input_ids'].squeeze(),
        'attention_mask': encoding['attention_mask'].squeeze(),
        'labels': tokenizer(targets, padding=True, truncation=True, return_tensors='pt')['input_ids'].squeeze()
    }
    return model_inputs

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

In [None]:
# Select a random data point from the train dataset
sample_data = train_data[0]

# Call the preprocess function on the sample data
processed_data = preprocess_function(sample_data)

# Inspect the output
print(processed_data)

{'input_ids': tensor([[    0, 37466, 26579,    35,    10,     2],
        [    0, 37466, 26579,    35,   326,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]]), 'labels': tensor([   0,  415, 1437,    2])}


In [None]:
#train_data = preprocess_function(train_data)
#val_data = preprocess_function(val_data)

# Apply preprocess_function to train_data and val_data
train_data = train_data.map(preprocess_function, batched=True)
val_data = val_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/4298 [00:00<?, ? examples/s]

Map:   0%|          | 0/478 [00:00<?, ? examples/s]

In [None]:
# Access a few samples from train_data
for i in range(5):
    sample_input_ids = train_data['input_ids'][i]
    sample_attention_mask = train_data['attention_mask'][i]
    sample_labels = train_data['labels'][i]

    print(f"Sample {i+1}:")
    print("Input IDs:", sample_input_ids)
    print("Attention Mask:", sample_attention_mask)
    print("Labels:", sample_labels)
    print()

Sample 1:
Input IDs: [0, 37466, 26579, 35, 23, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention Mask: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Labels: [0, 415, 1437, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sample 2:
Input IDs: [0, 37466, 26579, 35, 1087, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention Mask: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Labels: [0, 627, 1437, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sample 3:
Input IDs: [0, 37466, 26579, 35, 1733, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention Mask: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Labels: [0, 36058, 1437, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sample 4:
Input IDs: [0, 37466, 26579, 35, 939, 64, 1166, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

A data loader is a component used in machine learning frameworks, such as PyTorch, to handle the loading and batching of data during the training or evaluation process. Its main purpose is to efficiently provide batches of data to the model for processing.

### **Evaluate and Test the model**

In [None]:
# Load the trained model from Hugging Face
model = BartForConditionalGeneration.from_pretrained("monideep2255/spell_correction_M04")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [None]:
# Move the model to the GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
#model.eval()

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

In [None]:
# Test the model on speaker dataset
test_dataset = dataset['test'].filter(lambda example: example['speaker'] == 'M04')
print(len(test_dataset))

Filter:   0%|          | 0/651 [00:00<?, ? examples/s]

651


In [None]:
print(test_dataset)

Dataset({
    features: ['path', 'prediction', 'actual', 'speaker'],
    num_rows: 651
})


In [None]:
# Inspect column names
column_names = test_dataset.column_names
print("Column names:", column_names)

# Inspect column data types
for column in column_names:
    column_data = test_dataset[column]
    data_type = type(column_data[0]) if len(column_data) > 0 else "Unknown"
    print(f"Column name: {column}, Data type: {data_type}")

Column names: ['path', 'prediction', 'actual', 'speaker']
Column name: path, Data type: <class 'str'>
Column name: prediction, Data type: <class 'str'>
Column name: actual, Data type: <class 'str'>
Column name: speaker, Data type: <class 'str'>


In [None]:
# verification
# Test the model on speaker dataset
verify_test_dataset = dataset['test']
print(len(verify_test_dataset))
print((verify_test_dataset))

651
Dataset({
    features: ['path', 'prediction', 'actual', 'speaker'],
    num_rows: 651
})


In [None]:
# Print the samples
for sample in test_dataset:
    print("actual:", sample["actual"])
    print("Speaker ID:", sample["speaker"])
    print("prediction:", sample["prediction"])
    print()

actual: trouble 
Speaker ID: M04
prediction: trouble

actual: spark 
Speaker ID: M04
prediction: spark

actual: weed 
Speaker ID: M04
prediction: weed

actual: store 
Speaker ID: M04
prediction: store

actual: form 
Speaker ID: M04
prediction: form

actual: twice each day he plays skillfully and with zest upon our small organ 
Speaker ID: M04
prediction: was eachthey play cr lamsand wish the approus our house cal okens

actual: knew 
Speaker ID: M04
prediction: knew

actual: knee 
Speaker ID: M04
prediction: knee

actual: sip 
Speaker ID: M04
prediction: sip

actual: meat 
Speaker ID: M04
prediction: meat

actual: jacket 
Speaker ID: M04
prediction: jacket

actual: trade 
Speaker ID: M04
prediction: trade

actual: stick 
Speaker ID: M04
prediction: stick

actual: well he is nearly ninetythree years old 
Speaker ID: M04
prediction: well she i is  m  ninchree e ou

actual: goat 
Speaker ID: M04
prediction: goat

actual: beat 
Speaker ID: M04
prediction: beat

actual: fee 
Speaker ID: M04

In [None]:
#F03_test_data = F03_test_dataset.map(preprocess_function, batched=True)
#test_data = test_dataset.map(preprocess_function, batched=False)
#F01_test_dataloader = torch.utils.data.DataLoader(F01_test_data, batch_size=16)
#F03_test_dataloader = torch.utils.data.DataLoader(F03_test_data, batch_size=4)

#model.eval()

test_data = test_dataset.map(preprocess_function, batched=True, batch_size=len(test_dataset))
model.eval()


Map:   0%|          | 0/651 [00:00<?, ? examples/s]

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

In the following code, we iterate over each example in the test_dataset and perform the following steps:

- Concatenate the speaker ID and prediction to form the input text.
- Tokenize the input text using the tokenizer.
- Generate the output sequence using the trained model.
- Decode the output sequence to obtain the predicted sentence.
- Append the actual reference and the predicted sentence to the references and predictions lists, respectively.

In [None]:
predictions = []
references = []

for example in test_dataset:
    # This creates an input text by combining the values of the 'speaker' and 'prediction' fields from the current example.
    # It assumes that the example is a dictionary-like object with keys 'speaker' and 'prediction'.
    input_text = f"{example['speaker']}: {example['prediction']}"

    # This creates a context where no gradients are computed, which can improve efficiency during inference.
    with torch.no_grad():

        # This tokenizes the input_text using the tokenizer and converts it into input IDs as a PyTorch tensor.
        input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

        # The generate method generates the output sequence based on the provided input.
        outputs = model.generate(input_ids=input_ids, max_length=max_length)

    predicted_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)

    references.append(example['actual'])
    predictions.append(predicted_sentence)

In [None]:
# Verify that the number of predictions and references are the same
if len(predictions) == len(references):
    print("Number of predictions and references are the same.")
else:
    print("Mismatch in the number of predictions and references.")

Number of predictions and references are the same.


In [None]:
# Print the number of predictions and references
print("Number of predictions:", len(predictions))
print("Number of references:", len(references))

# print the length of the dataset
print("Number of rows in dataset:", len(test_dataset))

Number of predictions: 651
Number of references: 651
Number of rows in dataset: 651


In [None]:
# verification
print("prediction:", predictions)
print("reference:", references)

prediction: ['trouble ', 'spark ', 'weed ', 'store ', 'form ', 'were each day they play skillfully and with zest upon our little organ ', 'knew ', 'knee ', 'sip ', 'meat ', 'jacket ', 'trade ', 'stick ', 'well she is nearly ninetythree years old ', 'goat ', 'beat ', 'fee ', 'rave ', 'bug ', 'floor ', 'trait ', 'share ', 'range ', 'trace ', 'stubble ', 'rake ', 'the lazy dog jumps over the lazy dog ', 'dagger ', 'one ', 'up ', 'grandfather likes to be modern in his language ', 'chair ', 'beta ', 'air ', 'storm ', 'yet she still thinks as swiftly as ever ', 'warm ', 'right ', 'swore ', 'knew ', 'he dresses himself in an ancient black frock coat ', 'i can read ', 'i can read ', 'sip ', 'sticks ', 'go ', 'gadget ', 'park ', 'chair ', 'grow ', 'except when the old guard is present ', 'he slowly takes a short walk in the open air each day ', 'rate ', 'double ', 'pat ', 'dug ', 'bat ', 'torn ', 'dark ', 'left ', 'feed ', "don't ask me to carry an oily rag like that ", 'but he always answers b

### **WER calculation for test speaker**

The code snippet calculates the Word Error Rate (WER) for the test speaker. It iterates over the predictions and references, comparing each prediction to its corresponding reference.

In [None]:
# calculate WER

from jiwer import wer

wer_value = wer(predictions, references)
wer_percentage = wer_value * 100

print(f"WER: {wer_percentage:.2f}%")

WER: 13.63%
