the original notebook is from hugging face colab notebook here (https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb)


Make sure you have the following dependencies installed in your environment

```
pip install datasets transformers evaulate jiwer
```

the common voice dataset is coming from here (https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/zh-HK/train?p=84)

It is part of mozilla foundation common voice project

----

We could use this format to prepare our own dataset to fine tune our version of whisper

----

If you want to re-use / avoid to download the voice file every time, you can un-comment the part which specify `cache_dir` and point it to the directory you want those file to be downloaded / already downloaded.

In [5]:
# import sys
# sys.path.append(datasets_dir)

from datasets import load_dataset, DatasetDict

dataset_name = "mozilla-foundation/common_voice_11_0"
language_to_train = 'yue'

common_voice = DatasetDict()
common_voice["train"] = load_dataset(
  dataset_name, language_to_train, 
  split="train+validation",
  # cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/datasets"
  )

common_voice["test"] = load_dataset(
  dataset_name, language_to_train, 
  split="test",  
  # cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/datasets"
  )

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 5296
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 2438
    })
})


In [6]:
# !pip install "tokenizers>=0.14,<0.15"

from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(
  "openai/whisper-small", 
  cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/models"
  ) # start with the whisper small checkout

In [9]:
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", 
language="cantonese", 
task="transcribe"
)

tokenizer_config.json: 100%|██████████| 805/805 [00:00<00:00, 765kB/s]
vocab.json: 100%|██████████| 836k/836k [00:00<00:00, 3.05MB/s]
tokenizer.json: 100%|██████████| 2.48M/2.48M [00:00<00:00, 5.95MB/s]
merges.txt: 100%|██████████| 494k/494k [00:00<00:00, 35.1MB/s]
normalizer.json: 100%|██████████| 52.7k/52.7k [00:00<00:00, 42.3MB/s]
added_tokens.json: 100%|██████████| 34.6k/34.6k [00:00<00:00, 29.3MB/s]
special_tokens_map.json: 100%|██████████| 2.08k/2.08k [00:00<00:00, 5.48MB/s]


In [10]:
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", 
language="cantonese", 
task="transcribe"
)

preprocessor_config.json: 100%|██████████| 185k/185k [00:00<00:00, 1.90MB/s]


In [11]:
# Preparing Data

print(common_voice["train"][0])

# Whisper expecting the audio to be at sampling rate @16000 - this is just to make sure the sampling rate fits whisper's training
# Since our input audio is sampled at 48kHz, we need to downsample it to 16kHz prior to passing it to the Whisper feature extractor, 
# 16kHz being the sampling rate expected by the Whisper model.
from datasets import Audio
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

print(common_voice["train"][0])

{'client_id': 'b472f8e5800db5f400e4b5858c16d68d205f19888992c65770b5a3852c5bfc0cb94a22890f49ca720dff40b3a1e8d466c2290a3480b916df03ea07322657f0d8', 'path': '/Volumes/BACKUP/Coding/HUGGING_FACE/datasets/downloads/extracted/512cb97225d04177325e321598281e08f0112d7fbb640b4e8d8f72d4962fae2e/yue_train_0/common_voice_yue_31209989.mp3', 'audio': {'path': '/Volumes/BACKUP/Coding/HUGGING_FACE/datasets/downloads/extracted/512cb97225d04177325e321598281e08f0112d7fbb640b4e8d8f72d4962fae2e/yue_train_0/common_voice_yue_31209989.mp3', 'array': array([ 0.00000000e+00, -0.00000000e+00,  0.00000000e+00, ...,
       -7.54985740e-06, -5.66183462e-06, -1.75592550e-06]), 'sampling_rate': 48000}, 'sentence': '美國都係唔考慮喇，睇下澳洲先', 'up_votes': 2, 'down_votes': 0, 'age': '', 'gender': '', 'accent': '', 'locale': 'yue', 'segment': ''}
{'client_id': 'b472f8e5800db5f400e4b5858c16d68d205f19888992c65770b5a3852c5bfc0cb94a22890f49ca720dff40b3a1e8d466c2290a3480b916df03ea07322657f0d8', 'path': '/Volumes/BACKUP/Coding/HUGGING_FA

prepare the dataset

In [12]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch



common_voice = common_voice.map(prepare_dataset, 
  remove_columns=common_voice.column_names["train"], 
  num_proc=1)
print(common_voice)

Map: 100%|██████████| 5296/5296 [02:34<00:00, 34.26 examples/s]
Map: 100%|██████████| 2438/2438 [01:06<00:00, 36.50 examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 5296
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 2438
    })
})





the following is the actual training and evaluation of the model

using the trainer provided by huggingface

Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metric. We need to define a compute_metrics function that handles this computation.

Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.

Define the training configuration: this will be used by the 🤗 Trainer to define the training schedule.

In [13]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [14]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Evaluation using hugging face metric - WER (Word error rate)

In [16]:
!pip install evaluate

Collecting evaluate
  Using cached evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Using cached evaluate-0.4.1-py3-none-any.whl (84 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [17]:
import evaluate

metric = evaluate.load("wer")


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Downloading builder script: 100%|██████████| 4.49k/4.49k [00:00<00:00, 9.36MB/s]


In [18]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(
  "openai/whisper-small", 
  # cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/models"
  )

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

config.json: 100%|██████████| 1.97k/1.97k [00:00<00:00, 1.08MB/s]
model.safetensors: 100%|██████████| 967M/967M [00:09<00:00, 106MB/s]  
generation_config.json: 100%|██████████| 3.84k/3.84k [00:00<00:00, 1.05MB/s]


What should be the training

In [None]:
!pip install tensorboardx

In [24]:
from transformers import Seq2SeqTrainingArguments
import datetime

now = datetime.datetime.now().strftime("%d-%m-%Y-%H-%M")

training_args = Seq2SeqTrainingArguments(
    output_dir="model/whisper-small-cantanese_"+now,  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=False,  # if we are not using CUDA or non graphics card, use fp16=false
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"], #this would requires the tensorboardx to be installed
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    checkpoint_activations=True
)

processor.save_pretrained(training_args.output_dir)

The actual Training Part

In [26]:
trainer.train()

  1%|          | 30/4000 [04:31<9:58:20,  9.04s/it]
  0%|          | 1/4000 [00:34<37:58:46, 34.19s/it]
  1%|          | 25/4000 [02:33<6:48:55,  6.17s/it]

{'loss': 1.7712, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.08}


  1%|▏         | 50/4000 [05:09<6:54:01,  6.29s/it]

{'loss': 1.3822, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.15}


  2%|▏         | 75/4000 [07:46<6:51:39,  6.29s/it]

{'loss': 0.9192, 'learning_rate': 1.5e-06, 'epoch': 0.23}


  2%|▎         | 100/4000 [10:25<6:51:12,  6.33s/it]

{'loss': 0.3453, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.3}


  3%|▎         | 125/4000 [13:02<6:43:31,  6.25s/it]

{'loss': 0.195, 'learning_rate': 2.5e-06, 'epoch': 0.38}


  4%|▍         | 150/4000 [15:42<7:36:55,  7.12s/it]

{'loss': 0.1582, 'learning_rate': 3e-06, 'epoch': 0.45}


  4%|▍         | 175/4000 [18:23<6:47:27,  6.39s/it]

{'loss': 0.1327, 'learning_rate': 3.5e-06, 'epoch': 0.53}


  5%|▌         | 200/4000 [21:01<6:41:45,  6.34s/it]

{'loss': 0.1401, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.6}


  6%|▌         | 225/4000 [23:41<6:42:13,  6.39s/it]

{'loss': 0.1452, 'learning_rate': 4.5e-06, 'epoch': 0.68}


  6%|▋         | 250/4000 [26:22<7:15:30,  6.97s/it]

{'loss': 0.1344, 'learning_rate': 5e-06, 'epoch': 0.76}


  7%|▋         | 275/4000 [28:58<6:27:39,  6.24s/it]

{'loss': 0.128, 'learning_rate': 5.500000000000001e-06, 'epoch': 0.83}


  8%|▊         | 300/4000 [31:40<6:59:36,  6.80s/it]

{'loss': 0.1289, 'learning_rate': 6e-06, 'epoch': 0.91}


  8%|▊         | 325/4000 [34:20<6:31:17,  6.39s/it]

{'loss': 0.1131, 'learning_rate': 6.5000000000000004e-06, 'epoch': 0.98}


  9%|▉         | 350/4000 [36:58<6:27:16,  6.37s/it]

{'loss': 0.0943, 'learning_rate': 7e-06, 'epoch': 1.06}


  9%|▉         | 375/4000 [39:37<6:18:22,  6.26s/it]

{'loss': 0.0902, 'learning_rate': 7.500000000000001e-06, 'epoch': 1.13}


 10%|█         | 400/4000 [42:19<6:21:42,  6.36s/it]

{'loss': 0.0819, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.21}


 11%|█         | 425/4000 [44:57<6:17:56,  6.34s/it]

{'loss': 0.0805, 'learning_rate': 8.5e-06, 'epoch': 1.28}


 11%|█▏        | 450/4000 [47:37<6:30:56,  6.61s/it]

{'loss': 0.0799, 'learning_rate': 9e-06, 'epoch': 1.36}


 12%|█▏        | 475/4000 [50:18<6:06:44,  6.24s/it]

{'loss': 0.0889, 'learning_rate': 9.5e-06, 'epoch': 1.44}


 12%|█▎        | 500/4000 [52:58<6:09:13,  6.33s/it]

{'loss': 0.086, 'learning_rate': 1e-05, 'epoch': 1.51}


 13%|█▎        | 525/4000 [55:37<6:01:14,  6.24s/it]

{'loss': 0.0841, 'learning_rate': 9.92857142857143e-06, 'epoch': 1.59}


 14%|█▍        | 550/4000 [58:16<6:13:39,  6.50s/it]

{'loss': 0.0886, 'learning_rate': 9.857142857142859e-06, 'epoch': 1.66}


 14%|█▍        | 575/4000 [1:00:55<5:56:45,  6.25s/it]

{'loss': 0.0872, 'learning_rate': 9.785714285714286e-06, 'epoch': 1.74}


 15%|█▌        | 600/4000 [1:03:31<5:54:24,  6.25s/it]

{'loss': 0.0934, 'learning_rate': 9.714285714285715e-06, 'epoch': 1.81}


 16%|█▌        | 625/4000 [1:06:07<5:51:57,  6.26s/it]

{'loss': 0.0915, 'learning_rate': 9.642857142857144e-06, 'epoch': 1.89}


 16%|█▋        | 650/4000 [1:08:43<5:48:34,  6.24s/it]

{'loss': 0.0899, 'learning_rate': 9.571428571428573e-06, 'epoch': 1.96}


 17%|█▋        | 675/4000 [1:11:23<5:44:07,  6.21s/it]

{'loss': 0.0634, 'learning_rate': 9.5e-06, 'epoch': 2.04}


 18%|█▊        | 700/4000 [1:14:05<5:42:45,  6.23s/it]

{'loss': 0.0323, 'learning_rate': 9.42857142857143e-06, 'epoch': 2.11}


 18%|█▊        | 725/4000 [1:16:43<6:27:21,  7.10s/it]

{'loss': 0.0365, 'learning_rate': 9.357142857142859e-06, 'epoch': 2.19}


 19%|█▉        | 750/4000 [1:19:19<5:36:49,  6.22s/it]

{'loss': 0.0353, 'learning_rate': 9.285714285714288e-06, 'epoch': 2.27}


 19%|█▉        | 775/4000 [1:21:55<5:33:51,  6.21s/it]

{'loss': 0.0368, 'learning_rate': 9.214285714285715e-06, 'epoch': 2.34}


 20%|██        | 800/4000 [1:24:31<5:31:36,  6.22s/it]

{'loss': 0.0413, 'learning_rate': 9.142857142857144e-06, 'epoch': 2.42}


 21%|██        | 825/4000 [1:27:06<5:29:51,  6.23s/it]

{'loss': 0.0363, 'learning_rate': 9.071428571428573e-06, 'epoch': 2.49}


 21%|██▏       | 850/4000 [1:29:42<5:26:53,  6.23s/it]

{'loss': 0.0389, 'learning_rate': 9e-06, 'epoch': 2.57}


 22%|██▏       | 875/4000 [1:32:18<5:24:02,  6.22s/it]

{'loss': 0.0377, 'learning_rate': 8.92857142857143e-06, 'epoch': 2.64}


 22%|██▎       | 900/4000 [1:34:52<4:47:41,  5.57s/it]

{'loss': 0.0343, 'learning_rate': 8.857142857142858e-06, 'epoch': 2.72}


 23%|██▎       | 925/4000 [1:37:28<5:17:36,  6.20s/it]

{'loss': 0.0345, 'learning_rate': 8.785714285714286e-06, 'epoch': 2.79}


 24%|██▍       | 950/4000 [1:40:03<5:16:15,  6.22s/it]

{'loss': 0.0372, 'learning_rate': 8.714285714285715e-06, 'epoch': 2.87}


 24%|██▍       | 975/4000 [1:42:39<5:13:34,  6.22s/it]

{'loss': 0.0412, 'learning_rate': 8.642857142857144e-06, 'epoch': 2.95}


 25%|██▌       | 1000/4000 [1:45:14<5:10:53,  6.22s/it]

{'loss': 0.036, 'learning_rate': 8.571428571428571e-06, 'epoch': 3.02}


                                                       
 25%|██▌       | 1000/4000 [1:59:15<5:10:53,  6.22s/it]

{'eval_loss': 0.1496237814426422, 'eval_wer': 69.63202587949858, 'eval_runtime': 840.2281, 'eval_samples_per_second': 2.902, 'eval_steps_per_second': 0.363, 'epoch': 3.02}


 26%|██▌       | 1025/4000 [2:02:00<5:11:32,  6.28s/it]   

{'loss': 0.0172, 'learning_rate': 8.5e-06, 'epoch': 3.1}


 26%|██▋       | 1050/4000 [2:04:36<5:05:04,  6.21s/it]

{'loss': 0.0205, 'learning_rate': 8.428571428571429e-06, 'epoch': 3.17}


 27%|██▋       | 1075/4000 [2:07:14<5:03:25,  6.22s/it]

{'loss': 0.0128, 'learning_rate': 8.357142857142858e-06, 'epoch': 3.25}


 28%|██▊       | 1100/4000 [2:09:50<5:00:38,  6.22s/it]

{'loss': 0.0146, 'learning_rate': 8.285714285714287e-06, 'epoch': 3.32}


 28%|██▊       | 1125/4000 [2:12:26<4:57:33,  6.21s/it]

{'loss': 0.0155, 'learning_rate': 8.214285714285714e-06, 'epoch': 3.4}


 29%|██▉       | 1150/4000 [2:15:04<5:18:11,  6.70s/it]

{'loss': 0.0176, 'learning_rate': 8.142857142857143e-06, 'epoch': 3.47}


 29%|██▉       | 1175/4000 [2:17:43<4:53:03,  6.22s/it]

{'loss': 0.0161, 'learning_rate': 8.071428571428572e-06, 'epoch': 3.55}


 30%|███       | 1200/4000 [2:20:21<4:50:09,  6.22s/it]

{'loss': 0.0155, 'learning_rate': 8.000000000000001e-06, 'epoch': 3.63}


 31%|███       | 1225/4000 [2:22:59<4:47:43,  6.22s/it]

{'loss': 0.0186, 'learning_rate': 7.928571428571429e-06, 'epoch': 3.7}


 31%|███▏      | 1250/4000 [2:25:35<4:46:47,  6.26s/it]

{'loss': 0.0171, 'learning_rate': 7.857142857142858e-06, 'epoch': 3.78}


 32%|███▏      | 1275/4000 [2:28:13<4:41:11,  6.19s/it]

{'loss': 0.0175, 'learning_rate': 7.785714285714287e-06, 'epoch': 3.85}


 32%|███▎      | 1300/4000 [2:30:48<4:39:55,  6.22s/it]

{'loss': 0.0217, 'learning_rate': 7.714285714285716e-06, 'epoch': 3.93}


 33%|███▎      | 1325/4000 [2:33:30<4:41:15,  6.31s/it]

{'loss': 0.0167, 'learning_rate': 7.642857142857143e-06, 'epoch': 4.0}


 34%|███▍      | 1350/4000 [2:36:07<4:37:34,  6.28s/it]

{'loss': 0.0086, 'learning_rate': 7.571428571428572e-06, 'epoch': 4.08}


 34%|███▍      | 1375/4000 [2:38:42<4:32:55,  6.24s/it]

{'loss': 0.0091, 'learning_rate': 7.500000000000001e-06, 'epoch': 4.15}


 35%|███▌      | 1400/4000 [2:41:16<4:26:30,  6.15s/it]

{'loss': 0.0083, 'learning_rate': 7.428571428571429e-06, 'epoch': 4.23}


 36%|███▌      | 1425/4000 [2:43:55<4:49:04,  6.74s/it]

{'loss': 0.0079, 'learning_rate': 7.357142857142858e-06, 'epoch': 4.31}


 36%|███▋      | 1450/4000 [2:46:29<4:23:19,  6.20s/it]

{'loss': 0.0075, 'learning_rate': 7.285714285714286e-06, 'epoch': 4.38}


 37%|███▋      | 1475/4000 [2:49:04<4:20:49,  6.20s/it]

{'loss': 0.0046, 'learning_rate': 7.2142857142857145e-06, 'epoch': 4.46}


 38%|███▊      | 1500/4000 [2:51:43<4:18:59,  6.22s/it]

{'loss': 0.008, 'learning_rate': 7.1428571428571436e-06, 'epoch': 4.53}


 38%|███▊      | 1525/4000 [2:54:18<4:16:28,  6.22s/it]

{'loss': 0.01, 'learning_rate': 7.0714285714285726e-06, 'epoch': 4.61}


 39%|███▉      | 1550/4000 [2:56:53<4:13:27,  6.21s/it]

{'loss': 0.0082, 'learning_rate': 7e-06, 'epoch': 4.68}


 39%|███▉      | 1575/4000 [2:59:29<4:11:54,  6.23s/it]

{'loss': 0.0049, 'learning_rate': 6.928571428571429e-06, 'epoch': 4.76}


 40%|████      | 1600/4000 [3:02:07<4:09:02,  6.23s/it]

{'loss': 0.0085, 'learning_rate': 6.857142857142858e-06, 'epoch': 4.83}


 41%|████      | 1625/4000 [3:04:43<4:05:59,  6.21s/it]

{'loss': 0.0091, 'learning_rate': 6.785714285714287e-06, 'epoch': 4.91}


 41%|████▏     | 1650/4000 [3:07:18<4:04:11,  6.23s/it]

{'loss': 0.0063, 'learning_rate': 6.714285714285714e-06, 'epoch': 4.98}


 42%|████▏     | 1675/4000 [3:09:54<4:01:30,  6.23s/it]

{'loss': 0.0063, 'learning_rate': 6.642857142857143e-06, 'epoch': 5.06}


 42%|████▎     | 1700/4000 [3:12:29<3:58:04,  6.21s/it]

{'loss': 0.0048, 'learning_rate': 6.571428571428572e-06, 'epoch': 5.14}


 43%|████▎     | 1725/4000 [3:15:04<3:55:38,  6.21s/it]

{'loss': 0.0042, 'learning_rate': 6.5000000000000004e-06, 'epoch': 5.21}


 44%|████▍     | 1750/4000 [3:17:39<3:52:34,  6.20s/it]

{'loss': 0.0043, 'learning_rate': 6.4285714285714295e-06, 'epoch': 5.29}


 44%|████▍     | 1775/4000 [3:20:19<3:51:40,  6.25s/it]

{'loss': 0.0027, 'learning_rate': 6.357142857142858e-06, 'epoch': 5.36}


 45%|████▌     | 1800/4000 [3:22:54<3:47:29,  6.20s/it]

{'loss': 0.0027, 'learning_rate': 6.285714285714286e-06, 'epoch': 5.44}


 46%|████▌     | 1825/4000 [3:25:29<3:44:15,  6.19s/it]

{'loss': 0.0025, 'learning_rate': 6.214285714285715e-06, 'epoch': 5.51}


 46%|████▋     | 1850/4000 [3:28:04<3:42:29,  6.21s/it]

{'loss': 0.0031, 'learning_rate': 6.142857142857144e-06, 'epoch': 5.59}


 47%|████▋     | 1875/4000 [3:30:40<3:40:03,  6.21s/it]

{'loss': 0.0027, 'learning_rate': 6.071428571428571e-06, 'epoch': 5.66}


 48%|████▊     | 1900/4000 [3:33:19<3:42:06,  6.35s/it]

{'loss': 0.0045, 'learning_rate': 6e-06, 'epoch': 5.74}


 48%|████▊     | 1925/4000 [3:35:54<3:34:36,  6.21s/it]

{'loss': 0.0042, 'learning_rate': 5.928571428571429e-06, 'epoch': 5.82}


 49%|████▉     | 1950/4000 [3:38:29<3:30:51,  6.17s/it]

{'loss': 0.0032, 'learning_rate': 5.857142857142858e-06, 'epoch': 5.89}


 49%|████▉     | 1975/4000 [3:41:04<3:30:08,  6.23s/it]

{'loss': 0.0032, 'learning_rate': 5.785714285714286e-06, 'epoch': 5.97}


 50%|█████     | 2000/4000 [3:43:41<3:26:39,  6.20s/it]

{'loss': 0.0035, 'learning_rate': 5.7142857142857145e-06, 'epoch': 6.04}


                                                       
 50%|█████     | 2000/4000 [3:57:22<3:26:39,  6.20s/it]

{'eval_loss': 0.17156465351581573, 'eval_wer': 71.20905782450465, 'eval_runtime': 820.7828, 'eval_samples_per_second': 2.97, 'eval_steps_per_second': 0.372, 'epoch': 6.04}


 51%|█████     | 2025/4000 [4:00:12<3:25:23,  6.24s/it]   

{'loss': 0.0023, 'learning_rate': 5.6428571428571435e-06, 'epoch': 6.12}


 51%|█████▏    | 2050/4000 [4:02:47<3:20:44,  6.18s/it]

{'loss': 0.0017, 'learning_rate': 5.571428571428572e-06, 'epoch': 6.19}


 52%|█████▏    | 2075/4000 [4:05:21<3:18:01,  6.17s/it]

{'loss': 0.0015, 'learning_rate': 5.500000000000001e-06, 'epoch': 6.27}


 52%|█████▎    | 2100/4000 [4:07:56<3:15:40,  6.18s/it]

{'loss': 0.003, 'learning_rate': 5.428571428571429e-06, 'epoch': 6.34}


 53%|█████▎    | 2125/4000 [4:10:34<3:14:21,  6.22s/it]

{'loss': 0.0015, 'learning_rate': 5.357142857142857e-06, 'epoch': 6.42}


 54%|█████▍    | 2150/4000 [4:13:08<3:10:04,  6.16s/it]

{'loss': 0.0018, 'learning_rate': 5.285714285714286e-06, 'epoch': 6.5}


 54%|█████▍    | 2175/4000 [4:15:46<3:07:31,  6.16s/it]

{'loss': 0.0029, 'learning_rate': 5.214285714285715e-06, 'epoch': 6.57}


 55%|█████▌    | 2200/4000 [4:18:24<3:13:19,  6.44s/it]

{'loss': 0.0033, 'learning_rate': 5.142857142857142e-06, 'epoch': 6.65}


 56%|█████▌    | 2225/4000 [4:21:01<3:02:09,  6.16s/it]

{'loss': 0.0012, 'learning_rate': 5.071428571428571e-06, 'epoch': 6.72}


 56%|█████▋    | 2250/4000 [4:23:35<3:00:06,  6.18s/it]

{'loss': 0.0011, 'learning_rate': 5e-06, 'epoch': 6.8}


 57%|█████▋    | 2275/4000 [4:26:13<2:56:54,  6.15s/it]

{'loss': 0.0019, 'learning_rate': 4.928571428571429e-06, 'epoch': 6.87}


 57%|█████▊    | 2300/4000 [4:28:50<2:57:21,  6.26s/it]

{'loss': 0.0028, 'learning_rate': 4.857142857142858e-06, 'epoch': 6.95}


 58%|█████▊    | 2325/4000 [4:31:24<2:51:15,  6.13s/it]

{'loss': 0.0013, 'learning_rate': 4.785714285714287e-06, 'epoch': 7.02}


 59%|█████▉    | 2350/4000 [4:34:01<2:49:47,  6.17s/it]

{'loss': 0.0016, 'learning_rate': 4.714285714285715e-06, 'epoch': 7.1}


 59%|█████▉    | 2375/4000 [4:36:36<2:47:28,  6.18s/it]

{'loss': 0.0008, 'learning_rate': 4.642857142857144e-06, 'epoch': 7.18}


 60%|██████    | 2400/4000 [4:39:10<2:44:32,  6.17s/it]

{'loss': 0.0012, 'learning_rate': 4.571428571428572e-06, 'epoch': 7.25}


 61%|██████    | 2425/4000 [4:41:48<2:41:45,  6.16s/it]

{'loss': 0.0011, 'learning_rate': 4.5e-06, 'epoch': 7.33}


 61%|██████▏   | 2450/4000 [4:44:23<2:39:15,  6.17s/it]

{'loss': 0.0011, 'learning_rate': 4.428571428571429e-06, 'epoch': 7.4}


 62%|██████▏   | 2475/4000 [4:47:01<2:38:56,  6.25s/it]

{'loss': 0.001, 'learning_rate': 4.357142857142857e-06, 'epoch': 7.48}


 62%|██████▎   | 2500/4000 [4:49:36<2:35:21,  6.21s/it]

{'loss': 0.0009, 'learning_rate': 4.2857142857142855e-06, 'epoch': 7.55}


 63%|██████▎   | 2525/4000 [4:52:10<2:31:59,  6.18s/it]

{'loss': 0.0006, 'learning_rate': 4.2142857142857145e-06, 'epoch': 7.63}


 64%|██████▍   | 2550/4000 [4:54:46<2:32:13,  6.30s/it]

{'loss': 0.0014, 'learning_rate': 4.1428571428571435e-06, 'epoch': 7.7}


 64%|██████▍   | 2575/4000 [4:57:20<2:26:30,  6.17s/it]

{'loss': 0.0017, 'learning_rate': 4.071428571428572e-06, 'epoch': 7.78}


 65%|██████▌   | 2600/4000 [4:59:55<2:23:42,  6.16s/it]

{'loss': 0.001, 'learning_rate': 4.000000000000001e-06, 'epoch': 7.85}


 66%|██████▌   | 2625/4000 [5:02:33<2:33:54,  6.72s/it]

{'loss': 0.0006, 'learning_rate': 3.928571428571429e-06, 'epoch': 7.93}


 66%|██████▋   | 2650/4000 [5:05:12<2:20:30,  6.24s/it]

{'loss': 0.0007, 'learning_rate': 3.857142857142858e-06, 'epoch': 8.01}


 67%|██████▋   | 2675/4000 [5:07:47<2:16:20,  6.17s/it]

{'loss': 0.0014, 'learning_rate': 3.785714285714286e-06, 'epoch': 8.08}


 68%|██████▊   | 2700/4000 [5:10:21<2:13:42,  6.17s/it]

{'loss': 0.0005, 'learning_rate': 3.7142857142857146e-06, 'epoch': 8.16}


 68%|██████▊   | 2725/4000 [5:12:58<2:10:59,  6.16s/it]

{'loss': 0.0005, 'learning_rate': 3.642857142857143e-06, 'epoch': 8.23}


 69%|██████▉   | 2750/4000 [5:15:33<2:09:05,  6.20s/it]

{'loss': 0.0007, 'learning_rate': 3.5714285714285718e-06, 'epoch': 8.31}


 69%|██████▉   | 2775/4000 [5:18:08<2:05:40,  6.16s/it]

{'loss': 0.0005, 'learning_rate': 3.5e-06, 'epoch': 8.38}


 70%|███████   | 2800/4000 [5:20:42<2:03:45,  6.19s/it]

{'loss': 0.0005, 'learning_rate': 3.428571428571429e-06, 'epoch': 8.46}


 71%|███████   | 2825/4000 [5:23:17<2:00:36,  6.16s/it]

{'loss': 0.0005, 'learning_rate': 3.357142857142857e-06, 'epoch': 8.53}


 71%|███████▏  | 2850/4000 [5:25:53<1:58:03,  6.16s/it]

{'loss': 0.0008, 'learning_rate': 3.285714285714286e-06, 'epoch': 8.61}


 72%|███████▏  | 2875/4000 [5:28:27<1:55:32,  6.16s/it]

{'loss': 0.0005, 'learning_rate': 3.2142857142857147e-06, 'epoch': 8.69}


 72%|███████▎  | 2900/4000 [5:31:02<1:53:17,  6.18s/it]

{'loss': 0.0005, 'learning_rate': 3.142857142857143e-06, 'epoch': 8.76}


 73%|███████▎  | 2925/4000 [5:33:42<1:53:02,  6.31s/it]

{'loss': 0.0005, 'learning_rate': 3.071428571428572e-06, 'epoch': 8.84}


 74%|███████▍  | 2950/4000 [5:36:16<1:47:49,  6.16s/it]

{'loss': 0.0009, 'learning_rate': 3e-06, 'epoch': 8.91}


 74%|███████▍  | 2975/4000 [5:38:55<2:04:45,  7.30s/it]

{'loss': 0.0006, 'learning_rate': 2.928571428571429e-06, 'epoch': 8.99}


 75%|███████▌  | 3000/4000 [5:41:30<1:42:59,  6.18s/it]

{'loss': 0.0005, 'learning_rate': 2.8571428571428573e-06, 'epoch': 9.06}


                                                       
 75%|███████▌  | 3000/4000 [5:55:58<1:42:59,  6.18s/it]

{'eval_loss': 0.18760216236114502, 'eval_wer': 64.33481601293974, 'eval_runtime': 868.3804, 'eval_samples_per_second': 2.808, 'eval_steps_per_second': 0.351, 'epoch': 9.06}


 76%|███████▌  | 3025/4000 [5:58:40<1:40:55,  6.21s/it]  

{'loss': 0.0006, 'learning_rate': 2.785714285714286e-06, 'epoch': 9.14}


 76%|███████▋  | 3050/4000 [6:01:14<1:37:09,  6.14s/it]

{'loss': 0.0004, 'learning_rate': 2.7142857142857144e-06, 'epoch': 9.21}


 77%|███████▋  | 3075/4000 [6:03:53<1:44:39,  6.79s/it]

{'loss': 0.0004, 'learning_rate': 2.642857142857143e-06, 'epoch': 9.29}


 78%|███████▊  | 3100/4000 [6:06:27<1:32:50,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 2.571428571428571e-06, 'epoch': 9.37}


 78%|███████▊  | 3125/4000 [6:09:06<1:31:19,  6.26s/it]

{'loss': 0.0004, 'learning_rate': 2.5e-06, 'epoch': 9.44}


 79%|███████▉  | 3150/4000 [6:11:42<1:28:05,  6.22s/it]

{'loss': 0.0005, 'learning_rate': 2.428571428571429e-06, 'epoch': 9.52}


 79%|███████▉  | 3175/4000 [6:14:23<1:25:16,  6.20s/it]

{'loss': 0.0004, 'learning_rate': 2.3571428571428574e-06, 'epoch': 9.59}


 80%|████████  | 3200/4000 [6:16:58<1:22:51,  6.21s/it]

{'loss': 0.0005, 'learning_rate': 2.285714285714286e-06, 'epoch': 9.67}


 81%|████████  | 3225/4000 [6:19:36<1:21:04,  6.28s/it]

{'loss': 0.0014, 'learning_rate': 2.2142857142857146e-06, 'epoch': 9.74}


 81%|████████▏ | 3250/4000 [6:22:11<1:17:08,  6.17s/it]

{'loss': 0.0004, 'learning_rate': 2.1428571428571427e-06, 'epoch': 9.82}


 82%|████████▏ | 3275/4000 [6:24:49<1:15:48,  6.27s/it]

{'loss': 0.0004, 'learning_rate': 2.0714285714285717e-06, 'epoch': 9.89}


 82%|████████▎ | 3300/4000 [6:27:24<1:11:55,  6.17s/it]

{'loss': 0.0004, 'learning_rate': 2.0000000000000003e-06, 'epoch': 9.97}


 83%|████████▎ | 3325/4000 [6:30:02<1:09:33,  6.18s/it]

{'loss': 0.0004, 'learning_rate': 1.928571428571429e-06, 'epoch': 10.05}


 84%|████████▍ | 3350/4000 [6:32:37<1:07:05,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 1.8571428571428573e-06, 'epoch': 10.12}


 84%|████████▍ | 3375/4000 [6:35:12<1:04:46,  6.22s/it]

{'loss': 0.0004, 'learning_rate': 1.7857142857142859e-06, 'epoch': 10.2}


 85%|████████▌ | 3400/4000 [6:37:46<1:01:51,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 1.7142857142857145e-06, 'epoch': 10.27}


 86%|████████▌ | 3425/4000 [6:40:20<59:11,  6.18s/it]  

{'loss': 0.0006, 'learning_rate': 1.642857142857143e-06, 'epoch': 10.35}


 86%|████████▋ | 3450/4000 [6:42:55<56:44,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 1.5714285714285714e-06, 'epoch': 10.42}


 87%|████████▋ | 3475/4000 [6:45:30<54:09,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 1.5e-06, 'epoch': 10.5}


 88%|████████▊ | 3500/4000 [6:48:05<51:38,  6.20s/it]

{'loss': 0.0004, 'learning_rate': 1.4285714285714286e-06, 'epoch': 10.57}


 88%|████████▊ | 3525/4000 [6:50:39<49:01,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 1.3571428571428572e-06, 'epoch': 10.65}


 89%|████████▉ | 3550/4000 [6:53:17<46:52,  6.25s/it]

{'loss': 0.0006, 'learning_rate': 1.2857142857142856e-06, 'epoch': 10.73}


 89%|████████▉ | 3575/4000 [6:55:51<43:40,  6.16s/it]

{'loss': 0.0004, 'learning_rate': 1.2142857142857144e-06, 'epoch': 10.8}


 90%|█████████ | 3600/4000 [6:58:29<41:08,  6.17s/it]

{'loss': 0.0004, 'learning_rate': 1.142857142857143e-06, 'epoch': 10.88}


 91%|█████████ | 3625/4000 [7:01:09<39:23,  6.30s/it]

{'loss': 0.0004, 'learning_rate': 1.0714285714285714e-06, 'epoch': 10.95}


 91%|█████████▏| 3650/4000 [7:03:46<35:54,  6.16s/it]

{'loss': 0.0004, 'learning_rate': 1.0000000000000002e-06, 'epoch': 11.03}


 92%|█████████▏| 3675/4000 [7:06:19<33:19,  6.15s/it]

{'loss': 0.0005, 'learning_rate': 9.285714285714287e-07, 'epoch': 11.1}


 92%|█████████▎| 3700/4000 [7:08:53<30:49,  6.16s/it]

{'loss': 0.0004, 'learning_rate': 8.571428571428572e-07, 'epoch': 11.18}


 93%|█████████▎| 3725/4000 [7:11:32<29:02,  6.34s/it]

{'loss': 0.0004, 'learning_rate': 7.857142857142857e-07, 'epoch': 11.25}


 94%|█████████▍| 3750/4000 [7:14:06<25:38,  6.16s/it]

{'loss': 0.0003, 'learning_rate': 7.142857142857143e-07, 'epoch': 11.33}


 94%|█████████▍| 3775/4000 [7:16:44<27:16,  7.27s/it]

{'loss': 0.0005, 'learning_rate': 6.428571428571428e-07, 'epoch': 11.4}


 95%|█████████▌| 3800/4000 [7:19:22<20:59,  6.30s/it]

{'loss': 0.0004, 'learning_rate': 5.714285714285715e-07, 'epoch': 11.48}


 96%|█████████▌| 3825/4000 [7:21:56<17:55,  6.15s/it]

{'loss': 0.0003, 'learning_rate': 5.000000000000001e-07, 'epoch': 11.56}


 96%|█████████▋| 3850/4000 [7:24:32<15:24,  6.16s/it]

{'loss': 0.0003, 'learning_rate': 4.285714285714286e-07, 'epoch': 11.63}


 97%|█████████▋| 3875/4000 [7:27:09<12:49,  6.16s/it]

{'loss': 0.0004, 'learning_rate': 3.5714285714285716e-07, 'epoch': 11.71}


 98%|█████████▊| 3900/4000 [7:29:45<10:16,  6.16s/it]

{'loss': 0.0003, 'learning_rate': 2.8571428571428575e-07, 'epoch': 11.78}


 98%|█████████▊| 3925/4000 [7:32:22<07:44,  6.19s/it]

{'loss': 0.0004, 'learning_rate': 2.142857142857143e-07, 'epoch': 11.86}


 99%|█████████▉| 3950/4000 [7:34:56<05:08,  6.17s/it]

{'loss': 0.0004, 'learning_rate': 1.4285714285714287e-07, 'epoch': 11.93}


 99%|█████████▉| 3975/4000 [7:37:30<02:35,  6.21s/it]

{'loss': 0.0004, 'learning_rate': 7.142857142857144e-08, 'epoch': 12.01}


100%|██████████| 4000/4000 [7:40:04<00:00,  6.13s/it]

{'loss': 0.0004, 'learning_rate': 0.0, 'epoch': 12.08}


                                                     
100%|██████████| 4000/4000 [7:54:19<00:00,  6.13s/it]

{'eval_loss': 0.19237440824508667, 'eval_wer': 64.4561261625556, 'eval_runtime': 854.6178, 'eval_samples_per_second': 2.853, 'eval_steps_per_second': 0.357, 'epoch': 12.08}


There were missing keys in the checkpoint model loaded: ['proj_out.weight'].
100%|██████████| 4000/4000 [7:54:30<00:00,  7.12s/it]

{'train_runtime': 28470.8418, 'train_samples_per_second': 2.248, 'train_steps_per_second': 0.14, 'train_loss': 0.04882594305602834, 'epoch': 12.08}





if you need to push the model to hugging face hub, run the following block

```
pip install --upgrade huggingface_hub
```

In [None]:
# this is optional. but it would allow you to upload the model to hugging face space later on
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# to push

#the following arguments are needed only when we are pushing the model to hugging face hub
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0",  # a 'pretty' name for the training dataset
    "dataset_args": "config: hi, split: test",
    "language": "Cantonese",
    "model_name": "[language-x-change] Custom Whisper for Cantanese",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

trainer.push_to_hub(**kwargs)

The following is only needed when we want to deploy a runnable version with our uploaded model on hugging face spaces

In [None]:
!pip install gradio

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="your-own-model")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs="text",
    title="Whisper Small Hindi",
    description="Realtime demo for Hindi speech recognition using a fine-tuned Whisper small model.",
)

iface.launch()

to use the model we just compiled (https://huggingface.co/docs/transformers/tasks/asr#inference)


In [9]:
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
import datetime
import json

path = "model/whisper-small-cantanese/checkpoint-4000"
processor_path = "model/whisper-small-cantanese"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
   path, 
   local_files_only=True,
)

processor = AutoProcessor.from_pretrained(processor_path)

transcriber = pipeline("automatic-speech-recognition", 
    model=model,  
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    # chunk_length_s=10,
    max_new_tokens=1000,
   #  batch_size=16,
    return_timestamps=True
   )
transcriber.tokenizer.get_decoder_prompt_ids(language='cantonese', task="transcribe")
result = transcriber("source/trimmed_sample.mp3")

now = datetime.datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
json_object = json.dumps(result, indent=4)
with open('output/'+now+".json", "w") as f:
    f.write(json_object)

# also it will print out the result in the following output block
print(result)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'text': ' 其實都有佢嘅價值啦可能會係同身粉認同有關係啦又或者可能佢會大動到一個地方嘅文化旅游啦佢隕藏住同埋佢對於呢個社會創造緊嘅價值其實都係好重要嘅元素嚟嘅 市單定加一大嘅唐州活發工程已經完成嘅嘞 而市建共亦都話嚟緊會引入共同租住單位嘅共居模式希望底時呢度呢就可以變成一個充滿文化特色同埋活力嘅社區', 'chunks': [{'timestamp': (0.0, 1.4), 'text': ' 其實都有佢嘅價值啦'}, {'timestamp': (1.4, 4.68), 'text': '可能會係同身粉認同有關係啦'}, {'timestamp': (4.68, 8.0), 'text': '又或者可能佢會大動到一個地方嘅文化旅游啦'}, {'timestamp': (8.0, 15.04), 'text': '佢隕藏住同埋佢對於呢個社會創造緊嘅價值其實都係好重要嘅元素嚟嘅'}, {'timestamp': (19.04, 22.64), 'text': ' 市單定加一大嘅唐州活發工程已經完成嘅嘞'}, {'timestamp': (22.64, 26.88), 'text': ' 而市建共亦都話嚟緊會引入共同租住單位嘅共居模式'}, {'timestamp': (0.0, 4.32), 'text': '希望底時呢度呢就可以變成一個充滿文化特色同埋活力嘅社區'}]}


install the following dependencies for plotting and tabulation

```
pip install pandas
```

In [90]:
import pandas as pd
import json
from IPython.display import display


df = pd.json_normalize(result, record_path =['chunks'])
display(df)

# show df in a tablular format



Unnamed: 0,timestamp,text
0,"(0.0, 1.4)",其實都有佢嘅價值啦
1,"(1.4, 4.68)",可能會係同身粉認同有關係啦
2,"(4.68, 8.0)",又或者可能佢會大動到一個地方嘅文化旅游啦
3,"(8.0, 15.04)",佢隕藏住同埋佢對於呢個社會創造緊嘅價值其實都係好重要嘅元素嚟嘅
4,"(19.04, 22.64)",市單定加一大嘅唐州活發工程已經完成嘅嘞
5,"(22.64, 26.88)",而市建共亦都話嚟緊會引入共同租住單位嘅共居模式
6,"(0.0, 4.32)",希望底時呢度呢就可以變成一個充滿文化特色同埋活力嘅社區
