# Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

Reference: https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb

## Prepare Environment

We can verify that we've been assigned a GPU and view its specifications:

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Dec 14 01:17:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 517.00       Driver Version: 517.00       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   65C    P8    14W /  N/A |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# !pip install datasets>=2.6.1
# !pip install git+https://github.com/huggingface/transformers
# !pip install librosa
# !pip install evaluate>=0.30
# !pip install jiwer
# !pip install gradio

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load Dataset

In [3]:
from datasets import Dataset, DatasetDict, load_dataset

# Replace "/path/to/folder" with the actual path to your data folder
data_folder_path = "./data/all/"

# Load the audiofolder dataset
dataset = load_dataset("audiofolder", data_dir=data_folder_path)

# Display information about the loaded dataset
# print(dataset)

#use random seed to shuffle data
shuffled_dataset = dataset.shuffle(seed=123)

# Manually split the dataset into train and test
split_percentage = 0.8  # You can adjust the percentage as needed
split_index = int(len(shuffled_dataset["train"]) * split_percentage)

train_dataset = Dataset.from_dict(shuffled_dataset["train"][:split_index])
test_dataset = Dataset.from_dict(shuffled_dataset["train"][split_index:])

# Create a custom DatasetDict
final_dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

# Display available splits in the custom DatasetDict
print(final_dataset)


Resolving data files:   0%|          | 0/5952 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2957 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2995 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2995 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 2364
    })
    test: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 592
    })
})


## Prepare Feature Extractor, Tokenizer and Data

### Load WhisperFeatureExtractor

In [4]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
print(feature_extractor)

WhisperFeatureExtractor {
  "chunk_length": 30,
  "feature_extractor_type": "WhisperFeatureExtractor",
  "feature_size": 80,
  "hop_length": 160,
  "n_fft": 400,
  "n_samples": 480000,
  "nb_max_frames": 3000,
  "padding_side": "right",
  "padding_value": 0.0,
  "processor_class": "WhisperProcessor",
  "return_attention_mask": false,
  "sampling_rate": 16000
}



### Load WhisperTokenizer

In [5]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Vietnamese", task="transcribe")

### Combine To Create A WhisperProcessor

In [6]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Vietnamese", task="transcribe")

### Prepare Data

In [7]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

In [8]:
final_dataset = final_dataset.map(prepare_dataset, remove_columns=final_dataset.column_names["train"], num_proc=1)

Map:   0%|          | 0/2364 [00:00<?, ? examples/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

In [9]:
print(final_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 2364
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 592
    })
})


## Training and Evaluation

### Define a Data Collator

In [10]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [11]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

In [12]:
import evaluate

metric = evaluate.load("wer")

`AnnotionFormat` is deprecated and will be removed in v4.38. Please use `transformers.image_utils.AnnotationFormat` instead.


In [13]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Load a Pre-Trained Checkpoint

In [14]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

In [15]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

### Define the Training Configuration

In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).

In [18]:
# ! pip install -U accelerate
# ! pip install -U transformers

In [16]:
import accelerate
import transformers

transformers.__version__, accelerate.__version__

('4.37.0.dev0', '0.25.0')

In [17]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="ZHProject23/whisper-small-vn",  
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5, # 1e-5
    warmup_steps=8,
    max_steps=148,
    gradient_checkpointing=True,
    # fp16=True,
    # fp16_full_eval=False,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=2,
    eval_steps=2, # number of steps to log evaluation metrics
    logging_steps=1, # 25
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True, # set to True if want to upload mdel checkpoints to Hugging Face hub
)

**Note**: if one does not want to upload the model checkpoints to the Hub,
set `push_to_hub=False`.

In [18]:
print(len(final_dataset["train"]))
print(len(final_dataset["test"]))

2364
592


In [19]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=final_dataset["train"],
    eval_dataset=final_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [20]:
processor.save_pretrained(training_args.output_dir)

### Training

In [21]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
2,4.5043,4.463892,33.695652
4,4.0539,3.797525,35.978261
6,3.3205,3.008419,37.210145
8,2.7077,2.55547,37.971014
10,2.2203,2.205124,63.586957
12,2.1151,1.90064,132.922705
14,2.0148,1.612193,134.722222
16,1.2862,1.324448,144.082126
18,1.207,1.098434,110.857488
20,1.146,1.019575,105.205314


Checkpoint destination directory ZHProject23/whisper-small-vn\checkpoint-2 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ZHProject23/whisper-small-vn\checkpoint-4 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ZHProject23/whisper-small-vn\checkpoint-6 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ZHProject23/whisper-small-vn\checkpoint-8 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ZHProject23/whisper-small-vn\checkpoint-10 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ZHProject23/whisper-small-vn\checkpoint-12 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory

TrainOutput(global_step=148, training_loss=0.7472315674295297, metrics={'train_runtime': 51493.4944, 'train_samples_per_second': 0.046, 'train_steps_per_second': 0.003, 'total_flos': 6.8221588635648e+17, 'train_loss': 0.7472315674295297, 'epoch': 1.0})

In [22]:
kwargs = {
    # "dataset_tags": "vietnamese_custom_asr_corpus",
    "dataset": "Vietnamese ASR Custom Corpus",
    "dataset_args": "config: vi, split: test",
    "language": "vi",
    "model_name": "Whisper Small Vietnamese",
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}


In [23]:
trainer.save_model() 
trainer.push_to_hub(**kwargs) 

'https://huggingface.co/ZHProject23/whisper-small-vn/tree/main/'

## Building a Demo

In [24]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="ZHProject23/whisper-small-vn")

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),
    outputs="text",
    title="Whisper Small Vietnamese",
    description="Realtime demo for Vietnamese speech recognition using a fine-tuned Whisper Small model.",
)

# To create a public link, set `share=True` in `launch()`]]]
iface.launch()


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Running on local URL:  http://127.0.0.1:7863

To create a public link, set `share=True` in `launch()`.




In this blog, we covered a step-by-step guide on fine-tuning Whisper for multilingual ASR
using 🤗 Datasets, Transformers and the Hugging Face Hub. For more details on the Whisper model, the Common Voice dataset and the theory behind fine-tuning, refere to the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper). If you're interested in fine-tuning other
Transformers models, both for English and multilingual ASR, be sure to check out the
examples scripts at [examples/pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition).