the original notebook is from hugging face colab notebook here (https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb)


Make sure you have the following dependencies installed in your environment

```
pip install datasets transformers evaulate jiwer
```

the common voice dataset is coming from here (https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/zh-HK/train?p=84)

It is part of mozilla foundation common voice project

----

We could use this format to prepare our own dataset to fine tune our version of whisper

----

If you want to re-use / avoid to download the voice file every time, you can un-comment the part which specify `cache_dir` and point it to the directory you want those file to be downloaded / already downloaded.

In [33]:
# import sys
# sys.path.append(datasets_dir)

# before downloading any new dataset, 
# make sure to check if it needs to Check and Agrees to the terms first, otherwise the download would fail

from datasets import load_dataset, DatasetDict

dataset_name = "mozilla-foundation/common_voice_15_0"
language_to_train = 'yue'

common_voice = DatasetDict()
common_voice["train"] = load_dataset(
  dataset_name, language_to_train, 
  split="train+validation",
  cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/datasets"
  )

common_voice["test"] = load_dataset(
  dataset_name, language_to_train, 
  split="test",  
  cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/datasets"
  )

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 5636
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 2565
    })
})


In [34]:
# !pip install "tokenizers>=0.14,<0.15"

from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(
  "openai/whisper-small", 
  cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/feature"
  ) # start with the whisper small checkout

In [35]:
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", 
language="cantonese", 
task="transcribe",
cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/tokenizer"
)

In [36]:
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", 
language="cantonese", 
task="transcribe",
cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/processor"
)

In [37]:
# Preparing Data

# Whisper expecting the audio to be at sampling rate @16000 - this is just to make sure the sampling rate fits whisper's training
# Since our input audio is sampled at 48kHz, we need to downsample it to 16kHz prior to passing it to the Whisper feature extractor, 
# 16kHz being the sampling rate expected by the Whisper model.
from datasets import Audio
raw_common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

print(raw_common_voice["train"][0])

{'client_id': '2ecfe4e00a829397a04e316949bf3058c9ed72b0da9fad2686b0bc3bd98654d8a586e878cc3aa7609bf0359f56e24b3bc0f6f1ec4d1ec958e569bbaaf742560b', 'path': '/Volumes/BACKUP/Coding/HUGGING_FACE/datasets/downloads/extracted/5f8c376b62cbcec81f092e38c43f1519f67645668f8044d9b7c5a51c4297c524/yue_train_0/common_voice_yue_31210647.mp3', 'audio': {'path': '/Volumes/BACKUP/Coding/HUGGING_FACE/datasets/downloads/extracted/5f8c376b62cbcec81f092e38c43f1519f67645668f8044d9b7c5a51c4297c524/yue_train_0/common_voice_yue_31210647.mp3', 'array': array([ 1.45519152e-10,  4.36557457e-11,  4.36557457e-11, ...,
       -2.06303957e-06, -1.26592931e-06,  1.36844028e-06]), 'sampling_rate': 16000}, 'sentence': '睇內容長短嘅', 'up_votes': 4, 'down_votes': 0, 'age': 'teens', 'gender': 'male', 'accent': '香港粵語', 'locale': 'yue', 'segment': '', 'variant': ''}


prepare the dataset
doing the encoding -> preparing the vector

In [38]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch



finalized_common_voice = raw_common_voice.map(prepare_dataset, 
  remove_columns=raw_common_voice.column_names["train"], 
  num_proc=2)
print(finalized_common_voice)

Map (num_proc=2): 100%|██████████| 5636/5636 [08:57<00:00, 10.49 examples/s]
Map (num_proc=2): 100%|██████████| 2565/2565 [04:05<00:00, 10.46 examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 5636
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 2565
    })
})





the following is the actual training and evaluation of the model

using the trainer provided by huggingface

Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metric. We need to define a compute_metrics function that handles this computation.

Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.

Define the training configuration: this will be used by the 🤗 Trainer to define the training schedule.

In [39]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [40]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Evaluation using hugging face metric - WER (Word error rate)

In [16]:
!pip install evaluate

Collecting evaluate
  Using cached evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Using cached evaluate-0.4.1-py3-none-any.whl (84 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [41]:
import evaluate

metric = evaluate.load("wer")


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Downloading builder script: 100%|██████████| 4.49k/4.49k [00:00<00:00, 12.6MB/s]


In [43]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(
  "openai/whisper-small", 
  cache_dir="/Volumes/BACKUP/Coding/HUGGING_FACE/models"
  )

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

What should be the training

In [None]:
# this is a nice youtube video guide / introduction for how to use tensorboard (https://www.youtube.com/watch?v=VJW9wU-1n18&t=4s)
!pip install tensorboardx

In [48]:
from transformers import Seq2SeqTrainingArguments
import datetime

now = datetime.datetime.now().strftime("%d-%m-%Y-%H%M")

training_args = Seq2SeqTrainingArguments(
    output_dir="model/whisper-small-cantonese_"+now,  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=500,
    gradient_checkpointing=True,
    fp16=False,  # if we are not using CUDA or non graphics card, use fp16=false
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"], #this would requires the tensorboardx to be installed
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=finalized_common_voice["train"],
    eval_dataset=finalized_common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    # checkpoint_activations=True
)

processor.save_pretrained(training_args.output_dir)

The actual Training Part

In [49]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
1000,0.0331,0.157449,69.692308
2000,0.0049,0.171989,63.230769
3000,0.0005,0.181519,61.807692
4000,0.0004,0.186846,61.307692


There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=4000, training_loss=0.05433571213460527, metrics={'train_runtime': 29224.9012, 'train_samples_per_second': 2.19, 'train_steps_per_second': 0.137, 'total_flos': 1.843137234763776e+19, 'train_loss': 0.05433571213460527, 'epoch': 11.33})

if you need to push the model to hugging face hub, run the following block

```
pip install --upgrade huggingface_hub
```

In [None]:
# this is optional. but it would allow you to upload the model to hugging face space later on
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# to push

#the following arguments are needed only when we are pushing the model to hugging face hub
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0",  # a 'pretty' name for the training dataset
    "dataset_args": "config: hi, split: test",
    "language": "Cantonese",
    "model_name": "[language-x-change] Custom Whisper for Cantanese",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

trainer.push_to_hub(**kwargs)

The following is only needed when we want to deploy a runnable version with our uploaded model on hugging face spaces

In [None]:
!pip install gradio

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="your-own-model")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs="text",
    title="Whisper Small Hindi",
    description="Realtime demo for Hindi speech recognition using a fine-tuned Whisper small model.",
)

iface.launch()

to use the model we just compiled (https://huggingface.co/docs/transformers/tasks/asr#inference)


In [54]:
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
import datetime
import json

def write_contents_to_file(content): 
    now = datetime.datetime.now().strftime("%d-%m-%Y-%H-%M-%S")
    json_object = json.dumps(result, indent=4)
    with open('output/'+now+".json", "w") as f:
        f.write(json_object)

path = "model/whisper-small-cantonese_18-12-2023-22-27/checkpoint-4000"
processor_path = "model/whisper-small-cantonese_18-12-2023-22-27"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
   path, 
   local_files_only=True,
)

processor = AutoProcessor.from_pretrained(processor_path)

transcriber = pipeline("automatic-speech-recognition", 
    model=model,  
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    # chunk_length_s=20,
    max_new_tokens=500,
   #  batch_size=16,
    # return_timestamps=True
   )
transcriber.tokenizer.get_decoder_prompt_ids(language='cantonese', task="transcribe")

# file_list = ["Audio1_2.mp3","Audio1_4.mp3","Audio1_5.mp3","Audio1_9.mp3","Audio1_10.mp3","Audio1_11.mp3"]
# for index, file in enumerate(file_list):
#     result = transcriber("source/"+file)
#     write_contents_to_file(result)
#     # also it will print out the result in the following output block
#     print(f'[{index}] - {result}')



num_of_chunks = 41
file_prefix = "chunk";
file_suffix = ".mp3"

for index in range(0, num_of_chunks):
    result = transcriber("source/rthk/"+file_prefix+str(index)+file_suffix)
    # write_contents_to_file(result)
    print(f'[{index}] - {result}')






Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[0] - {'text': ' 中上環近半山一帶 除咗畀人個感覺靈性之外 呢度嘅建築物亦都係新夠交融㗎 唔講你唔知 呢度曾經出現嗰個叫做三日間嘅社區啲'}
[1] - {'text': ' 被稱為沙亞間嘅社區 目前並冇完整嘅文獻記錄資料 上傳喺十九世紀 士丹頓街同埋必烈者時間附近一帶嘅華人聚居地 所建造嘅三十間石屋而得名'}
[2] - {'text': '隨住社會變遷 石屋已經不復存在現時 方位內仍有數座 大約喺一九五零年代建成嘅唐流建築分別係喺二零一九年 確定成為二級歷史建築嘅史丹頓街八十八及九十號同埋平級有代蘋果嘅話言方西唐流建築群'}
[3] - {'text': ' Carol,其實當初三十間個起源係'}
[4] - {'text': '其實如果揾返資料的話 最早其實我哋係喺一百八零年嘅政府憲報度見到三十間呢個名嘅 因為我哋而家喺香港嘅地圖上面 其實我哋都好難可以見到三十間呢個名嘅喇 已經'}
[5] - {'text': '真係知道呢個名嘅人呢 大概都已經去到六十歲或者以上嘅人先至會識得用呢個名'}
[6] - {'text': '當時其實係呢個位置應該就係起咗大約三十間'}
[7] - {'text': '如果肉眼見到嘅痕跡其實可能只係得返'}
[8] - {'text': '即係呢一度三十間街坊儒蘭會呢個招牌'}
[9] - {'text': '可能就係唯一我哋可以反映到以前呢度真係叫到沙亞間嘅一樣嘢咁上環以前其實有好多華人最高嘅地方嚟'}
[10] - {'text': '佢哋其實係一班居民'}
[11] - {'text': '組織出來嘅一個地方嚟'}
[12] - {'text': '儒蘭性會 其實對於一個華人社會來講係非常之重要啦超到一啲孤魂嘅鬼 令到呢度嘅可能傷破 街坊可以安心啲 咁樣嘅一個傳統習俗啦'}
[13] - {'text': '喺街道佈局上面三十間社區嘅特色係點嘅'}
[14] - {'text': '其實如果想理解三十間嘅範圍其實我哋應該由下面嘅時單頓街開始計啦嗰個其實係個俗心'}
[15] - {'text': '跟住就一路打上去到上面半山嘅堅到範圍'}
[16] - {'text': '中間記憶一個範圍其實我哋都可以理解為三十間'}
[17] - {'text': '在呢度當中裏面其實都有唔少嘅地方 全部都係一啲淨係人行

install the following dependencies for plotting and tabulation

```
pip install pandas
```

In [90]:
import pandas as pd
import json
from IPython.display import display


df = pd.json_normalize(result, record_path =['chunks'])
display(df)

# show df in a tablular format



Unnamed: 0,timestamp,text
0,"(0.0, 1.4)",其實都有佢嘅價值啦
1,"(1.4, 4.68)",可能會係同身粉認同有關係啦
2,"(4.68, 8.0)",又或者可能佢會大動到一個地方嘅文化旅游啦
3,"(8.0, 15.04)",佢隕藏住同埋佢對於呢個社會創造緊嘅價值其實都係好重要嘅元素嚟嘅
4,"(19.04, 22.64)",市單定加一大嘅唐州活發工程已經完成嘅嘞
5,"(22.64, 26.88)",而市建共亦都話嚟緊會引入共同租住單位嘅共居模式
6,"(0.0, 4.32)",希望底時呢度呢就可以變成一個充滿文化特色同埋活力嘅社區
