## Finetune your own Speech-to-Text Whisper model on the language of your choice on a GPU, for free!

### Setup GPU
First, you'll need to enable GPUs for the notebook: Navigate to Edit→Notebook Settings Select T4 GPU from the Hardware Accelerator section Click Save and accept. Next, we'll confirm that we can connect to the GPU:

In [None]:
import torch

if not torch.cuda.is_available():
    print("GPU NOT available!")
else:
    print("GPU is available!")

### Setup and login Hugging Face 

The dataset we use for finetuning is Mozilla's [Common Voice](https://commonvoice.mozilla.org/).

In order to download the Common Voice dataset, track training and evaluation metrics of the finetuning and save your final model to use it and share it with others later, we will be using the Hugging Face (HF) platform. Before starting, make sure you:
1. have a HF [account](https://huggingface.co/join)
2. set up [personal access token](huggingface.co/settings/tokens)
3. login to hugging face in this notebook by running the command below and using your token


In [None]:
!huggingface-cli login

### Download and install speech-to-text-finetune package

In [None]:
!git clone https://github.com/mozilla-ai/speech-to-text-finetune.git

In [None]:
%cd speech-to-text-finetune/

In [None]:
!pip install --quiet -e .

***IMPORTANT:*** After installing the package, you need to restart the kernel / session: "Runtime -> Restart session" and then run the cells below

In [None]:
%cd speech-to-text-finetune/  # after restarting the session, you will need to change directory again

In [None]:
from speech_to_text_finetune.finetune_whisper import run_finetuning

**NOTE**: Certain "high-resource" languages like English or French have really big datasets (+50GB) which might fill up your disk storage fast. Make sure you have enough storage available before choosing a Common Voice language and finetuning on it.

In [None]:
# @title Finetuning configuration and hyperparameter setting
import yaml


def save_to_yaml(filename="config.yaml"):
    with open(filename, "w") as file:
        yaml.dump(cfg, file)


model_id = "openai/whisper-small"  # @param ["openai/whisper-tiny", "openai/whisper-small", "openai/whisper-medium","openai/whisper-large-v3"]
dataset_id = "mozilla-foundation/common_voice_17_0"  # @param {type: "string"}
dataset_source = "HF"  # @param ["HF", "local"]
language = "Hindi"  # @param {type: "string"}
repo_name = "default"  # @param {type: "string"}
push_to_hub = True  # @param {type: 'boolean'}
hub_private_repo = True  # @param {type: 'boolean'}
max_steps = 50  # @param {type: "slider", min: 1, max: 3000, step: 10}
per_device_train_batch_size = 32  # @param {type: "slider", min: 1, max: 300}
gradient_accumulation_steps = 1  # @param {type: "slider", min: 1, max: 10}
warmup_steps = 50  # @param {type: "slider", min: 0, max: 500}
gradient_checkpointing = True  # @param {type: 'boolean'}
fp16 = True  # @param {type: 'boolean'}
per_device_eval_batch_size = 8  # @param {type: "slider", min: 1, max: 200}
save_steps = 5  # @param {type: "slider", min: 1, max: 500}
logging_steps = 5  # @param {type: "slider", min: 1, max: 500}
load_best_model_at_end = True  # @param {type: 'boolean'}

cfg = {
    "model_id": model_id,
    "dataset_id": dataset_id,
    "dataset_source": dataset_source,
    "language": language,
    "repo_name": repo_name,
    "training_hp": {
        "push_to_hub": push_to_hub,
        "hub_private_repo": hub_private_repo,
        "max_steps": max_steps,
        "per_device_train_batch_size": per_device_train_batch_size,
        "gradient_accumulation_steps": gradient_accumulation_steps,
        "learning_rate": 1e-5,
        "warmup_steps": warmup_steps,
        "gradient_checkpointing": gradient_checkpointing,
        "fp16": fp16,
        "eval_strategy": "steps",
        "per_device_eval_batch_size": per_device_eval_batch_size,
        "predict_with_generate": True,
        "generation_max_length": 225,
        "save_steps": save_steps,
        "logging_steps": logging_steps,
        "load_best_model_at_end": load_best_model_at_end,
        "save_total_limit": 1,
        "metric_for_best_model": "wer",
        "greater_is_better": False,
    },
}

save_to_yaml()

### Start finetuning job

Note that this might take a while, anything from 10min to 10hours depending on your model choice and hyper-parameter configuration

In [None]:
run_finetuning(config_path="config.yaml")