###Setting up dependencies

In [None]:
!pip install --upgrade pip
!pip install --upgrade datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###Cleaning dataset.

Importing a custom dataset created with over 400 samples of audio from the roblox spelling bee.The dataset had both punctuation and Nan values for the transcriptions, thosen need to be removed from the dataset.Train test split is also made from the dataset.

In [None]:
from datasets import Dataset, DatasetDict
import pandas as pd
from sklearn.model_selection import train_test_split
import string

# Creating train test split.
translator = str.maketrans('', '', string.punctuation)
spelling_df = pd.read_csv('drive/MyDrive/spelling_dataset - spelling_dataset.csv')
spelling_df['audio_path'] = spelling_df['audio_path'].apply(lambda x: "drive/MyDrive/" + x) # Appending necessary file structure because of drive.
spelling_df.dropna(inplace=True)
spelling_df['transcription'] = spelling_df['transcription'].apply(lambda x: x.lower())
spelling_df['transcription'] = spelling_df['transcription'].apply(lambda x: x.translate(translator))
spelling_df.rename(inplace=True, columns={'audio_path' : 'audio'})
spelling_train_df, spelling_test_df = train_test_split(
    spelling_df, test_size=0.2, random_state=42)

# Putting data into hugging face dataset
spelling_dataset = DatasetDict()
spelling_dataset['train']= Dataset.from_pandas(spelling_train_df, preserve_index=False, split='train+validation')
spelling_dataset['test'] = Dataset.from_pandas(spelling_test_df, preserve_index=False, split='test')

Importing the whisper feature extractor, and tokenizer from HuggingFace. These API's allow the proper tranformations to be done for the input audio files and the transcribed labels. I also cast the 'audio' column as the HuggingFace audio class, which gets an array representation of the audio file with the specified sampling rate.

In [None]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor
from datasets import Audio

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="english", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="english", task="transcribe")
spelling_dataset = spelling_dataset.cast_column("audio",  Audio(sampling_rate=16000))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

In [None]:
spelling_dataset['train']['audio']

Using tokenizer and feature extractor on my data.

In [None]:
def prepare_dataset(batch):
  audio = batch['audio']
  batch['input_features'] = feature_extractor(audio['array'], sampling_rate=audio['sampling_rate']).input_features[0]
  batch["labels"] = tokenizer(batch["transcription"]).input_ids
  return batch

spelling_dataset = spelling_dataset.map(prepare_dataset, remove_columns=spelling_dataset.column_names["train"], num_proc=4)

Map (num_proc=4):   0%|          | 0/328 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/82 [00:00<?, ? examples/s]

###Training + Evaluation

Importing already trained whisper tiny model.

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.generation_config.language = "english"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

Adding the expected padding to both my input data and token data. This code was adapted from https://huggingface.co/blog/fine-tune-whisper


In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

Set up a word error rate function to be used during evaluation. This metric is commomly used for measuring the performance of an automatic speech recognition system.

In [None]:
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Establishing Hyper parameters for my model.

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-spelling-bee",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=300,
    max_steps=1000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=100,
    eval_steps=100,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)



In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=spelling_dataset["train"],
    eval_dataset=spelling_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(


In [None]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
100,0.3727,0.210663,7.719298
200,0.0181,0.120771,4.561404
300,0.014,0.110938,3.859649
400,0.0007,0.120409,4.210526
500,0.0004,0.123265,4.210526
600,0.0003,0.123489,4.210526
700,0.0003,0.12418,4.561404
800,0.0002,0.124587,4.561404
900,0.0002,0.124763,4.561404
1000,0.0002,0.124934,4.561404


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=1000, training_loss=0.11039111916953698, metrics={'train_runtime': 1380.9575, 'train_samples_per_second': 5.793, 'train_steps_per_second': 0.724, 'total_flos': 1.9695108096e+17, 'train_loss': 0.11039111916953698, 'epoch': 24.390243902439025})

Looking at the evaluated data, the 300 epoch model preformed the best, so that is the model I will be incorporating into my Roblox Spelling bee.