## **Fine Tune Whisper**

Leverage the extensive multilingual ASR knowledge acquired by Whisper during pre-training for our low-resource language: Singlish

### **Load Dataset**

Whenever changes are made to the dataset repo, run ```Remove-Item -Recurse -Force ~/.cache/huggingface/datasets/``` from the terminal

In [1]:
from datasets import load_dataset
from IPython.display import Audio

In [2]:
dataset_repo = "johnlohjy/imda_nsc_p3_same_closemic_train"
dataset_train = load_dataset(dataset_repo, split='train', streaming=True, trust_remote_code=True)

Downloading builder script:   0%|          | 0.00/3.92k [00:00<?, ?B/s]

In [3]:
print(dataset_train)

IterableDataset({
    features: ['path', 'audio', 'sentence'],
    n_shards: 1
})


### **Prepare Dataset for Whisper**

- Feature extractor
    - Pads (with silence)/truncates audio to 30s
    - Convert raw audio-inputs to log-mel spectrogram input features

- Tokenizer
    - Tokenizer maps seq of token ids output by Whisper model back to their corresponding text string

In [None]:
from transformers import WhisperProcessor

In [None]:
whisper_ver = 'whisper-tiny'

In [None]:
# WhisperProcesser class provides both feature extractor and tokenizer
processor = WhisperProcessor.from_pretrained(f"openai/{whisper_ver}", language="English", task="transcribe")

In [None]:
def prepare_dataset(batch):
    # load and (possibly) resample audio data to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]
    
    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    return batch