# **Torgo Speakers spelling correction evaluation and testing script using machine translation**

### **Objective: Spelling correction evaluation and testing for Torgo dataset speakers using machine translation**

### **Ensure that GPU and RAM is set up: will be needed for training purpose**

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Jul 17 14:49:52 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-32GB            Off| 00000000:AF:00.0 Off |                    0 |
| N/A   54C    P0              262W / 300W|   7043MiB / 32768MiB |     71%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
# ensure enough memory present so that training does not stop
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 201.2 gigabytes of available RAM

You are using a high-RAM runtime!


### **Install the libraries**

In [3]:
# Install required libraries
!pip install datasets
!pip install transformers==4.28.0
!pip install accelerate
!pip install jiwer
!pip install huggingface_hub

[0m

### **Import libraries**

In [4]:
# Import libraries
import torch
from transformers import BartTokenizerFast, BartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from jiwer import wer
from huggingface_hub import notebook_login

In [6]:
# Load the dataset from the JSON files
dataset = load_dataset('json', data_files={'train': '/work/van-speech-nlp/TORGO experiments/spelling correction/data preparation/language model experiments/json files/F01_other_speakers_LM.json',
                                           'test': '/work/van-speech-nlp/TORGO experiments/spelling correction/data preparation/language model experiments/json files/speaker_F01_LM.json'})

Found cached dataset json (/home/chakraborti.m/.cache/huggingface/datasets/json/default-834bbb932995c235/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)


  0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['actual', 'prediction', 'speaker', 'path'],
        num_rows: 5212
    })
    test: Dataset({
        features: ['actual', 'prediction', 'speaker', 'path'],
        num_rows: 215
    })
})


In [14]:
print(dataset['train'][:5])
print(dataset['test'][:5])

{'actual': ['beta ', 'stubble ', 'stubble ', 'trace ', 'goat '], 'prediction': ['beta', 'stubble', 'stubble', 'trace', 'goat'], 'speaker': ['F03', 'F03', 'F03', 'F03', 'F03'], 'path': ['/work/van-speech-nlp/TORGO experiments/dataset//Torgo/F03/Session1/wav_arrayMic/0005.wav', '/work/van-speech-nlp/TORGO experiments/dataset//Torgo/F03/Session1/wav_arrayMic/0006.wav', '/work/van-speech-nlp/TORGO experiments/dataset//Torgo/F03/Session1/wav_arrayMic/0007.wav', '/work/van-speech-nlp/TORGO experiments/dataset//Torgo/F03/Session1/wav_arrayMic/0008.wav', '/work/van-speech-nlp/TORGO experiments/dataset//Torgo/F03/Session1/wav_arrayMic/0009.wav']}
{'actual': ['stick ', 'except in the winter when the ooze or snow or ice prevents ', 'pat ', 'up ', 'meat '], 'prediction': ['snick', 'ealsein the winu we wol  orice pl', 'pat', 'up', 'nit'], 'speaker': ['F01', 'F01', 'F01', 'F01', 'F01'], 'path': ['/work/van-speech-nlp/TORGO experiments/dataset/Torgo/F01/Session1/wav_arrayMic/0006.wav', '/work/van-spe

In [15]:
# Define the max_length for padding and truncation
max_length = 512

The preprocessing function serves to prepare the data for training or evaluation. It uses a tokenizer to tokenize the inputs and labels, formats the inputs by adding a source language identifier, encodes the tokenized inputs and labels, and creates a dictionary of model inputs. The function ensures that the data is properly tokenized, formatted, and encoded according to the model's requirements. It helps maintain consistency and compatibility between the input data and the model during training or evaluation.

In [16]:
# Initialize the tokenizer
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-base')

# Tokenize the data
# The preprocess_function function is defined to preprocess the data by tokenizing the inputs and labels
def preprocess_function(examples):
    inputs = [f'{source_lang}: {text}' for text in examples[source_lang]]
    targets = examples[target_lang]
    encoding = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt', max_length=max_length)
    model_inputs = {
        'input_ids': encoding['input_ids'].squeeze(),
        'attention_mask': encoding['attention_mask'].squeeze(),
        'labels': tokenizer(targets, padding=True, truncation=True, return_tensors='pt')['input_ids'].squeeze()
    }
    return model_inputs

### **Evaluate and Test the model**

In [18]:
# Load the trained model from Hugging Face
model = BartForConditionalGeneration.from_pretrained("monideep2255/spell_correction_F01_LM")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/292 [00:00<?, ?B/s]

In [19]:
# Move the model to the GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
#model.eval()

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05,

In [20]:
# Test the model on speaker dataset
test_dataset = dataset['test'].filter(lambda example: example['speaker'] == 'F01')
print(len(test_dataset))

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

215


In [21]:
print(test_dataset)

Dataset({
    features: ['actual', 'prediction', 'speaker', 'path'],
    num_rows: 215
})


In [22]:
# Inspect column names
column_names = test_dataset.column_names
print("Column names:", column_names)

# Inspect column data types
for column in column_names:
    column_data = test_dataset[column]
    data_type = type(column_data[0]) if len(column_data) > 0 else "Unknown"
    print(f"Column name: {column}, Data type: {data_type}")

Column names: ['actual', 'prediction', 'speaker', 'path']
Column name: actual, Data type: <class 'str'>
Column name: prediction, Data type: <class 'str'>
Column name: speaker, Data type: <class 'str'>
Column name: path, Data type: <class 'str'>


In [23]:
# verification
# Test the model on speaker dataset
verify_test_dataset = dataset['test']
print(len(verify_test_dataset))
print((verify_test_dataset))

215
Dataset({
    features: ['actual', 'prediction', 'speaker', 'path'],
    num_rows: 215
})


In [25]:
# Define the source and target language columns
source_lang = 'prediction'
target_lang = 'actual'

In [26]:
test_data = test_dataset.map(preprocess_function, batched=True, batch_size=len(test_dataset))
model.eval()

Map:   0%|          | 0/215 [00:00<?, ? examples/s]

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05,

In the following code, we iterate over each example in the test_dataset and perform the following steps:

- Concatenate the speaker ID and prediction to form the input text.
- Tokenize the input text using the tokenizer.
- Generate the output sequence using the trained model.
- Decode the output sequence to obtain the predicted sentence.
- Append the actual reference and the predicted sentence to the references and predictions lists, respectively.

In [27]:
predictions = []
references = []
inputs = []
for example in test_dataset:
    # This creates an input text by combining the values of the 'speaker' and 'prediction' fields from the current example.
    # It assumes that the example is a dictionary-like object with keys 'speaker' and 'prediction'.
    input_text = f"{example['speaker']}: {example['prediction']}"

    # This creates a context where no gradients are computed, which can improve efficiency during inference.
    with torch.no_grad():

        # This tokenizes the input_text using the tokenizer and converts it into input IDs as a PyTorch tensor.
        input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

        # The generate method generates the output sequence based on the provided input.
        outputs = model.generate(input_ids=input_ids, max_length=max_length)

    predicted_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    inputs.append(example['prediction'])
    references.append(example['actual'])
    predictions.append(predicted_sentence)

In [28]:
# Verify that the number of predictions and references are the same
if len(predictions) == len(references):
    print("Number of predictions and references are the same.")
else:
    print("Mismatch in the number of predictions and references.")

Number of predictions and references are the same.


In [29]:
# Print the number of predictions and references
print("Number of predictions:", len(predictions))
print("Number of references:", len(references))
print("Number of inputs:", len(inputs))
# print the length of the dataset
print("Number of rows in dataset:", len(test_dataset))

Number of predictions: 215
Number of references: 215
Number of inputs: 215
Number of rows in dataset: 215


In [30]:
# Verify the first 10 predictions and references side by side
for i in range(10):
    print("Reference:", references[i])
    print("Prediction:", predictions[i])
    print("Inputs:", inputs[i])
    print("---")

Reference: stick 
Prediction: snoop 
Inputs: snick
---
Reference: except in the winter when the ooze or snow or ice prevents 
Prediction: in the winter we play skillfully and with zest upon our small organ 
Inputs: ealsein the winu we wol  orice pl
---
Reference: pat 
Prediction: pat 
Inputs: pat
---
Reference: up 
Prediction: up 
Inputs: up
---
Reference: meat 
Prediction: nit 
Inputs: nit
---
Reference: know 
Prediction: know 
Inputs: now
---
Reference: he slowly takes a short walk in the open air each day 
Prediction: giving those who observe him a pronounced feeling of the utmost respect 
Inputs: e loly caks a f walt muopeing ar eack day
---
Reference: air 
Prediction: hem 
Inputs: hem
---
Reference: swarm 
Prediction: floor 
Inputs: floor
---
Reference: double 
Prediction: double 
Inputs: double
---


In [31]:
import csv

# Specify the CSV file path
csv_file = '/work/van-speech-nlp/TORGO experiments/spelling correction/SC_train and evaluate/language model/results-csv/F01.csv'

# Verify and write all predictions and references to the CSV file
with open(csv_file, 'w', newline='') as file:
    writer = csv.writer(file)

    # Write the header
    writer.writerow(['Reference', 'Input sentence', 'Prediction'])

    # Write the predictions and references
    for i in range(len(references)):
        reference = references[i]
        prediction = predictions[i]
        input_sentence = inputs[i]
        # Write the row to the CSV file
        writer.writerow([reference, input_sentence, prediction])

print("Results saved to", csv_file)


Results saved to /work/van-speech-nlp/TORGO experiments/spelling correction/SC_train and evaluate/language model/results-csv/F01.csv


### **WER calculation for test speaker**

The code snippet calculates the Word Error Rate (WER) for the test speaker. It iterates over the predictions and references, comparing each prediction to its corresponding reference.

In [32]:
# calculate WER

from jiwer import wer

wer_value = wer(predictions, references)
wer_percentage = wer_value * 100

print(f"WER: {wer_percentage:.2f}%")

WER: 31.23%
