# Fine-tuning Whisper

This notebook fine-tunes Whisper on French. The default Whisper multilingual model initially seems to have rather poor performance on French.

This notebook roughly follows [this blog post](https://huggingface.co/blog/fine-tune-whisper).

**Goal**: Fine-tune `whisper-tiny` to have medium to high performance on French-language input *without* timestamps.

In [1]:
!pip install --upgrade pip
# jiwer is used for the word error rate (WER) metric
!pip install --upgrade datasets[audio] transformers evaluate jiwer

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1
Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting datasets[audio]
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets[audio])
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets[audio])
  Downloading xxhash-3.5.0-cp311-cp311-manyl

In [2]:
!pip install pyspellchecker==0.8.1

Collecting pyspellchecker==0.8.1
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m111.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


In [3]:
import wandb
# See https://discuss.huggingface.co/t/how-to-turn-wandb-off-in-trainer/6237/10
wandb.init(mode='disabled')

In [4]:
from pathlib import Path

checkpoint_remote_path = Path('./final-checkpoints').resolve()
def connect_to_google_drive():
    """ Connects to Google Drive and configures the notebook to upload final
        checkpoints. """
    from google.colab import drive

    drive.mount('/content/drive')
    return Path('/content/drive/My Drive') / 'whisper' / 'checkpoints'

# Optional:
#checkpoint_remote_path = connect_to_google_drive()

In [5]:
if not checkpoint_remote_path.parent.exists():
    checkpoint_remote_path.parent.mkdir(parents=True)

In [6]:
checkpoint_path = Path('./whisper/checkpoints').resolve()

In [7]:
import shutil


## Load data

The [Common Voice dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) will be used to fine-tune Whisper.

To speed up processing later on, we download the full dataset at once (`streaming=False`). The initial download may take some time.

In [8]:
from datasets import load_dataset, IterableDatasetDict

common_voice_data_raw = IterableDatasetDict()

dataset_id = 'mozilla-foundation/common_voice_11_0'
common_voice_data_raw['train'] = load_dataset(dataset_id, 'fr', split='train', streaming=False).to_iterable_dataset()
print("Loaded training data. Loading test data:")
common_voice_data_raw['test'] = load_dataset(dataset_id, 'fr', split='test', streaming=False).to_iterable_dataset()

# Preview it
common_voice_data_raw

README.md:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

common_voice_11_0.py:   0%|          | 0.00/8.13k [00:00<?, ?B/s]

languages.py:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

release_stats.py:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

The repository for mozilla-foundation/common_voice_11_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_11_0.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


n_shards.json:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

fr_train_0.tar:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

fr_train_1.tar:   0%|          | 0.00/1.59G [00:00<?, ?B/s]

fr_train_2.tar:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

fr_train_3.tar:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

fr_train_4.tar:   0%|          | 0.00/1.48G [00:00<?, ?B/s]

fr_train_5.tar:   0%|          | 0.00/1.49G [00:00<?, ?B/s]

fr_train_6.tar:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

fr_train_7.tar:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

fr_train_8.tar:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

fr_train_9.tar:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

fr_train_10.tar:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

fr_train_11.tar:   0%|          | 0.00/1.80G [00:00<?, ?B/s]

fr_train_12.tar:   0%|          | 0.00/168M [00:00<?, ?B/s]

fr_dev_0.tar:   0%|          | 0.00/702M [00:00<?, ?B/s]

fr_test_0.tar:   0%|          | 0.00/714M [00:00<?, ?B/s]

fr_other_0.tar:   0%|          | 0.00/478M [00:00<?, ?B/s]

fr_invalidated_0.tar:   0%|          | 0.00/1.80G [00:00<?, ?B/s]

fr_invalidated_1.tar:   0%|          | 0.00/652M [00:00<?, ?B/s]

train.tsv:   0%|          | 0.00/125M [00:00<?, ?B/s]

dev.tsv:   0%|          | 0.00/3.83M [00:00<?, ?B/s]

test.tsv:   0%|          | 0.00/3.81M [00:00<?, ?B/s]

other.tsv:   0%|          | 0.00/3.68M [00:00<?, ?B/s]

invalidated.tsv:   0%|          | 0.00/14.4M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 11781it [00:00, 117803.09it/s][A
Reading metadata...: 25990it [00:00, 132083.31it/s][A
Reading metadata...: 40105it [00:00, 136219.25it/s][A
Reading metadata...: 53727it [00:00, 135893.19it/s][A
Reading metadata...: 68444it [00:00, 139948.23it/s][A
Reading metadata...: 82440it [00:00, 139500.31it/s][A
Reading metadata...: 96391it [00:00, 128823.83it/s][A
Reading metadata...: 109418it [00:00, 124990.36it/s][A
Reading metadata...: 123234it [00:00, 128853.67it/s][A
Reading metadata...: 136226it [00:01, 129166.71it/s][A
Reading metadata...: 149968it [00:01, 131608.81it/s][A
Reading metadata...: 163184it [00:01, 130564.62it/s][A
Reading metadata...: 176279it [00:01, 128238.37it/s][A
Reading metadata...: 190212it [00:01, 131490.02it/s][A
Reading metadata...: 203393it [00:01, 130978.93it/s][A
Reading metadata...: 216513it [00:01, 129174.98it/s][A
Reading metadata...: 230595it [00:01, 132598.05it/s][A
Reading met

Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 5474it [00:00, 54733.53it/s][A
Reading metadata...: 16089it [00:00, 55621.53it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 16089it [00:00, 86008.58it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 14359it [00:00, 76923.58it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 7776it [00:00, 77750.43it/s][A
Reading metadata...: 16211it [00:00, 81627.57it/s][A
Reading metadata...: 24392it [00:00, 81707.88it/s][A
Reading metadata...: 32905it [00:00, 83055.72it/s][A
Reading metadata...: 41219it [00:00, 83082.64it/s][A
Reading metadata...: 57607it [00:00, 81175.27it/s]


Loaded training data. Loading test data:


IterableDatasetDict({
    train: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_shards: 1
    })
    test: IterableDataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_shards: 1
    })
})

Not all data columns will be used. Let's remove the unused ones:



In [9]:
common_voice_data_raw = common_voice_data_raw.remove_columns(['accent', 'age', 'client_id', 'locale', 'segment', 'gender', 'up_votes', 'down_votes', 'path'])

common_voice_data_raw

IterableDatasetDict({
    train: IterableDataset({
        features: ['audio', 'sentence'],
        num_shards: 1
    })
    test: IterableDataset({
        features: ['audio', 'sentence'],
        num_shards: 1
    })
})

In [10]:
from datasets import Audio

common_voice_data = common_voice_data_raw.cast_column('audio', Audio(sampling_rate=16_000))

The GGML conversion script has trouble with some characters (e.g. the `\u0301` accute accent character). For now, replace these characters early so they won't appear in the updated vocabulary:



In [11]:
# Normalize text
import unicodedata, re

def normalize_text(text: str):
    replacements = [
        ['’', '\''],
        ['‘', '\''],
        ['́a', 'á'], # Convert from two-character á to one-character á
        ['́u', 'ú'],
        ['́e', 'é'],
        ['̀e', 'è'],
        ['̀a', 'à'],
        # Some characters don't work with the GGML conversion script:
        ['œ', 'oe'],
        ['́', '\''],
        ['̂', '\''],
        ['̀', '\''],
        ['—', '--'],
        ['…', '...'],
        ['の', ''],
    ]
    for [orig, replace] in replacements:
        text = text.replace(orig, replace)

    return text
def normalize_texts(batch):
    return { 'sentence': [
        normalize_text(text) for text in batch
    ] }
common_voice_data = common_voice_data.map(normalize_texts, batched=True, input_columns=['sentence'])

print(next(iter(common_voice_data['train'])))

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/4d487dff50c7da77bd8812384dcadeddf7eece27dd93d909d5c67e4752f45c01/fr_train_0/common_voice_fr_29111041.mp3', 'array': array([1.45519152e-10, 1.45519152e-10, 8.73114914e-11, ...,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00]), 'sampling_rate': 16000}, 'sentence': 'Il est dissous à Trèves.'}


## Inspecting a sample

Let's check that the expected columns are still present in the training data:

In [12]:
sample = next(iter(common_voice_data['train']))

In [13]:
sample

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/4d487dff50c7da77bd8812384dcadeddf7eece27dd93d909d5c67e4752f45c01/fr_train_0/common_voice_fr_29111041.mp3',
  'array': array([1.45519152e-10, 1.45519152e-10, 8.73114914e-11, ...,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00]),
  'sampling_rate': 16000},
 'sentence': 'Il est dissous à Trèves.'}

## Create the feature extractor and tokenizer

We'll be fine-tuning the `openai/whisper-tiny` model. Here, the feature extractor and tokenizer for this model are fetched from Huggingface:

In [14]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer

feature_extractor = WhisperFeatureExtractor.from_pretrained('openai/whisper-tiny', language='french', task='transcribe')
tokenizer_original = WhisperTokenizer.from_pretrained('openai/whisper-tiny', language='french', task='transcribe')

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

We'll create a customized tokenizer based on `tokenizer_original` in the next section.

## Vocabulary adjustements

**Note**: Adjusting the vocabulary makes training Whisper a bit more difficult. Consider skipping this section.

At present, this notebook only supports fine-tuning languages supported by the upstream Whisper project.

It may be possible to get better accuracy by customizing the vocabulary. One way to do this might be with the (very slow) `tokenizer.train_new_from_iterator` function. For example, with something similar to the following:
```python
def sentence_data_generator():
    """ Outputs *just* the batched string data from common_voice_data """
    sentences = common_voice_data['train'].select_columns(['sentence'])
    for samples in sentences.iter(batch_size=500):
        # Yields a list of all sentences in the batch
        yield samples['sentence']

text_data = data_generator()
print(next(text_data))

# 5027 is the size of whisper-tiny's default vocabulary
tokenizer = tokenizer_original.train_new_from_iterator(text_data, 50257)
```

Changing the vocabulary like this may also increase the time needed to train the model.

For now, we demonstrate replacing unused/unwanted tokens with ones that might be more useful and reloading the tokenizer:

In [15]:
# Step 1: Save the vocabulary to a file
tokenizer_directory = Path('whisper-default-tokenizer')
tokenizer_original.save_pretrained(tokenizer_directory)


('whisper-default-tokenizer/tokenizer_config.json',
 'whisper-default-tokenizer/special_tokens_map.json',
 'whisper-default-tokenizer/vocab.json',
 'whisper-default-tokenizer/merges.txt',
 'whisper-default-tokenizer/normalizer.json',
 'whisper-default-tokenizer/added_tokens.json')

Now that the tokenizer is saved in `tokenizer_directory`, we can load `tokenizer_directory/vocab.json` and modify it:

In [16]:
# Step 2: Get vocab.json
import json

def json_from_path(path: Path):
    with open(path, 'r', encoding='utf-8') as f:
        return json.loads(f.read())

vocab = json_from_path(tokenizer_directory / 'vocab.json')

In [17]:
# Step 3: Find some words we can definitely remove
from spellchecker import SpellChecker

english_checker = SpellChecker(language='en')
french_checker = SpellChecker(language='fr')
def is_known_word(spell_checker, word: str):
    """ Returns true if the `spell_checker` thinks `word` is spelled correctly.
        Changing the `spell_checker` changes which words are considered correct.
    """
    return len(spell_checker.unknown([word.lower()])) == 0

def is_english_only_word(word: str):
    """ Returns true if `word` is an English word, but not a French word """
    is_english = is_known_word(english_checker, word)
    is_french = is_known_word(french_checker, word)
    return is_english and not is_french

print('The is_english_only_word function should return True is a word is spelled correctly in English, but not in French:', is_english_only_word('testing'))

# This character marks the beginning of a word in vocab.json
word_start_char = 'Ġ'
replacable_keys = []

def mark_english_only_words():
    """ Marks all English-only words are replacable """
    for key in vocab:
        if not key.startswith(word_start_char):
            continue

        # Skip short words, as they're more likely to be prefixes of French words, too.
        if len(key) <= 4:
            continue
        word = key[1:]
        if is_english_only_word(word):
            replacable_keys.append(key)

mark_english_only_words()
replacable_keys[0:10]

The is_english_only_word function should return True is a word is spelled correctly in English, but not in French: True


['ĠABOUT',
 'ĠAIDS',
 'ĠANNOUNCER',
 'ĠAPPLAUSE',
 'ĠAbigail',
 'ĠAboriginal',
 'ĠAbout',
 'ĠAbove',
 'ĠAbsolutely',
 'ĠAcademic']

In [18]:
# Step 4: Collect information about French words
from collections import defaultdict
import re

NONWORD_REGEX = re.compile(r'[ \t?.,;!()/\-«»]+')
def split_by_word(text: str):
    """ Splits the given `text` into words. Returns a list of those words. """
    return NONWORD_REGEX.split(text)

def build_word_counts():
    """ Builds a map from certain words to the number of times they appear.
        This map will not include all words in the training set.
    """
    # Constants: Ignore short words
    min_word_length = 3
    max_sentences_to_process = 7_000 # Don't process more than roughly this number of sentences

    # Output
    word_counts = defaultdict(lambda: 0)

    sentences = common_voice_data['train'].select_columns(['sentence'])
    sentences_processed = 0
    for column in sentences.iter(batch_size=100):
        sentences = column['sentence']
        for sentence in sentences:
            for word in split_by_word(sentence):
                if len(word) >= min_word_length:
                    word_counts[word.lower()] += 1
            sentences_processed += 1

        if sentences_processed > max_sentences_to_process:
            break
    return word_counts

word_counts = build_word_counts()
# Sort by occurrences
def get_val(pair):
    (key, val) = pair
    return val
most_common_words = sorted(word_counts.items(), key=get_val, reverse=True)

In [19]:
most_common_words[0:10]

[('est', 1605),
 ('les', 1125),
 ('des', 943),
 ('dans', 587),
 ('une', 586),
 ('elle', 502),
 ('par', 463),
 ('pour', 447),
 ('son', 410),
 ('sont', 367)]

In [20]:
# Step 5: Replace!
next_replacement_idx = 0
new_vocab = dict(vocab)
replaced_keys = set()

for key in replacable_keys:
    if next_replacement_idx >= len(most_common_words):
        # Out of words to replace with
        break
    (replacement,count) = most_common_words[next_replacement_idx]
    next_replacement_idx += 1
    new_key = word_start_char + replacement
    # Don't map multiple keys to the same token value
    if new_key in new_vocab:
        continue
    # Don't add uncommon words
    if count <= 2:
        continue

    # Replace [key] with [new_key]
    token_value = new_vocab[key]
    del new_vocab[key]
    new_vocab[new_key] = token_value
    replaced_keys.add(key)

print("Made {} replacements".format(len(replaced_keys)))

new_merges = []
with open(tokenizer_directory / 'merges.txt', 'r', encoding='utf-8') as merges:
    for line in merges.readlines():
        if len(line) == 0:
            continue
        words = split_by_word(line)
        if not (words[0] in replaced_keys):
            new_merges.append(line.strip())

Made 2158 replacements


Great! We now have a vocabulary file optimized for French. Let's load it:

In [21]:
# Write to a file
tokenizer_fr_directory = Path('updated-tokenizer')
if tokenizer_fr_directory.exists():
    shutil.rmtree(tokenizer_fr_directory)
shutil.copytree(tokenizer_directory, tokenizer_fr_directory)
with open(tokenizer_fr_directory / 'vocab.json', 'w', encoding='utf-8') as f:
    json.dump(new_vocab, f, ensure_ascii=False)


with open(tokenizer_fr_directory / 'merges.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(new_merges))

In [22]:
from transformers import WhisperTokenizer

# Use a normal WhisperTokenizer -- WhisperTokenizerFast has trouble with the updated
# vocabulary.
tokenizer = WhisperTokenizer(
    tokenizer_fr_directory / 'vocab.json',
    tokenizer_fr_directory / 'merges.txt',
    tokenizer_fr_directory / 'normalizer.json',
    bos_token='<|startoftranscript|>',
    unk_token='',
    pad_token='<|endoftext|>',
    language='french',
    task='transcribe',
)

# See https://discuss.huggingface.co/t/fine-tuning-whisper-on-my-own-dataset-with-a-customized-tokenizer/25903
tokenizer.add_special_tokens(tokenizer_original.special_tokens_map)

105

In [23]:
# For debugging, update the output directory
shutil.rmtree(tokenizer_fr_directory)
tokenizer.save_pretrained(tokenizer_fr_directory)

('updated-tokenizer/tokenizer_config.json',
 'updated-tokenizer/special_tokens_map.json',
 'updated-tokenizer/vocab.json',
 'updated-tokenizer/merges.txt',
 'updated-tokenizer/normalizer.json',
 'updated-tokenizer/added_tokens.json')

## Create the processor

Next, load the `WhisperProcessor`, which combines a feature extractor and tokenizer.

In [24]:
from transformers import WhisperProcessor

processor = WhisperProcessor(feature_extractor, tokenizer)

Use the feature extractor to convert the data into a format suitable for the model:

In [25]:
def map_sample(batch):
    audio_data = batch['audio']['array']
    audio_sample_rate = batch['audio']['sampling_rate']
    features = processor.feature_extractor(audio_data, sampling_rate=audio_sample_rate)

    batch['input_features'] = features.input_features[0]
    batch['labels'] = processor.tokenizer(batch['sentence']).input_ids
    return batch

# Remove columns no longer used
common_voice_data_original = common_voice_data # For debugging
common_voice_data = common_voice_data.map(map_sample, remove_columns=['audio', 'sentence'])
common_voice_data

IterableDatasetDict({
    train: IterableDataset({
        features: Unknown,
        num_shards: 1
    })
    test: IterableDataset({
        features: Unknown,
        num_shards: 1
    })
})

In [26]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-tiny')
model.generation_config.language = 'french'
model.generation_config.task = 'transcribe'
model.generation_config.forced_decoder_ids = None


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

In [27]:
from dataclasses import dataclass
from typing import Any
import torch
# See the linked blog post and https://huggingface.co/docs/transformers/main_classes/data_collator

@dataclass
class DataCollatorWithPadding:
    ''' Converts raw data into a batch ready for the model '''
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: list) -> dict[str, torch.Tensor]:
        input_features = [{'input_features': f['input_features']} for f in features]
        label_features = [{'input_ids': f['labels']} for f in features]

        # According to the linked blog post, the input and label features need
        # to be padded separately (due to different final lengths), then
        # recombined:
        batch = self.processor.feature_extractor.pad(input_features, return_tensors='pt')

        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors='pt')

        # transformers uses -100 for masking
        labels = labels_batch['input_ids'].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # Don't double-prepend the beginning of sequence token:
        if (labels[:,0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch['labels'] = labels
        return batch

data_collator = DataCollatorWithPadding(processor=processor, decoder_start_token_id=model.config.decoder_start_token_id)

# Viewing sample data

Let's look at some of the training data:

In [28]:
sample_data = next(iter(common_voice_data['test']))
sample_labels = sample_data['labels']

In [29]:
processor.decode(sample_labels)

"<|startoftranscript|><|fr|><|transcribe|><|notimestamps|>Ce dernier a évolué tout au long de l'histoire romaine.<|endoftext|>"

In [None]:
def run_on_sample_audio():
    """ Returns the (text) result of running the model on a single audio sample. """
    sample_audio = next(iter(common_voice_data_original['test']))['audio']
    inputs = processor(sample_audio['array'], return_tensors='pt')
    try:
        generated_ids = model.generate(inputs=inputs.input_features)
    except:
        generated_ids = model.generate(inputs=inputs.input_features.to('cuda'))
    return processor.batch_decode(generated_ids)

In [None]:
print(run_on_sample_audio())

## Preparing an evaluation function


In [31]:
import evaluate

wer_metric = evaluate.load('wer')
cer_metric = evaluate.load('cer')

def compute_metrics(data):
    true_labels = data.label_ids
    predictions = data.predictions

    # Convert padding from HF
    true_labels[true_labels == -100] = processor.tokenizer.pad_token_id

    predicted_text = processor.batch_decode(predictions, skip_special_tokens=True)
    label_text = processor.batch_decode(true_labels, skip_special_tokens=True)

    wer = wer_metric.compute(predictions=predicted_text, references=label_text)
    cer = cer_metric.compute(predictions=predicted_text, references=label_text)
    return { 'wer': wer, 'cer': cer }


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.60k [00:00<?, ?B/s]

## Preparing training arguments

In [59]:
from transformers import Seq2SeqTrainingArguments

# TODO: Update this if you're planning to push the custom model to
# huggingface (ignore otherwise):
hub_model_id = 'personalizedrefrigerator/whisper-tiny-fr'

def make_training_args(max_steps: int):
    return Seq2SeqTrainingArguments(
        output_dir = checkpoint_path,
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 1,
        hub_model_id=hub_model_id,
        learning_rate=1e-5,
        max_steps=max_steps,
        gradient_checkpointing=True,
        logging_first_step=True,
        fp16=True,
        eval_strategy='steps',
        per_device_eval_batch_size=8,
        generation_max_length=256,
        predict_with_generate=True,
        save_steps=3000,
        eval_steps=1000,
        logging_steps=25,
        save_total_limit=1,
    )

In [33]:
small_eval_dataset = common_voice_data['test'].shuffle(seed=11).take(128)
large_eval_dataset = common_voice_data['test'].shuffle(seed=12).take(512)

In [60]:
from transformers import Seq2SeqTrainer

def make_trainer(max_steps: int = 12_000):
    return Seq2SeqTrainer(
        args=make_training_args(max_steps),
        model=model,
        train_dataset=common_voice_data['train'],
        eval_dataset=small_eval_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        processing_class=processor.feature_extractor,
    )

trainer = make_trainer()

## Training and evaluation

In [35]:
trainer.evaluate(large_eval_dataset)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


{'eval_loss': 1.8288989067077637,
 'eval_model_preparation_time': 0.0039,
 'eval_wer': 0.8389513108614233,
 'eval_cer': 0.4432802740278058,
 'eval_runtime': 107.7673,
 'eval_samples_per_second': 4.751,
 'eval_steps_per_second': 0.594}

In [36]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Model Preparation Time,Wer,Cer
1000,0.8478,1.090238,0.0039,0.593826,0.338562
2000,0.7015,1.028004,0.0039,0.484159,0.249804
3000,0.6217,1.004471,0.0039,0.447604,0.239869
4000,0.6719,0.982509,0.0039,0.454102,0.235948
5000,0.6555,0.953396,0.0039,0.451665,0.239216
6000,0.6421,0.915878,0.0039,0.428107,0.223791
7000,0.5723,0.903678,0.0039,0.432981,0.23098
8000,0.4799,0.901339,0.0039,0.943948,0.423791
9000,0.5777,0.897266,0.0039,0.429732,0.23268
10000,0.4862,0.881931,0.0039,0.405361,0.214641




TrainOutput(global_step=12000, training_loss=0.5791082464754581, metrics={'train_runtime': 8700.8374, 'train_samples_per_second': 22.067, 'train_steps_per_second': 1.379, 'total_flos': 4.72682594304e+18, 'train_loss': 0.5791082464754581, 'epoch': 1.0})

In [37]:
if checkpoint_remote_path.exists():
    shutil.rmtree(checkpoint_remote_path)
shutil.copytree(checkpoint_path, checkpoint_remote_path)

PosixPath('/content/final-checkpoints')

In [38]:
trainer.evaluate(large_eval_dataset)

{'eval_loss': 0.7990443110466003,
 'eval_model_preparation_time': 0.0039,
 'eval_wer': 0.5880149812734082,
 'eval_cer': 0.25999059708509636,
 'eval_runtime': 106.6601,
 'eval_samples_per_second': 4.8,
 'eval_steps_per_second': 0.6,
 'epoch': 1.0}

In [62]:
larger_eval_dataset = common_voice_data['test'].shuffle(seed=14).take(628)
trainer.evaluate(larger_eval_dataset)

KeyboardInterrupt: 

In [None]:
model_output_dir = Path('./final-model').resolve()
trainer.save_model(model_output_dir)
tokenizer.save_pretrained(model_output_dir)

In [None]:
print(run_on_sample_audio())

# Model conversion

Next, we need to convert the model into a format usable by Joplin. This next step converts the model from PyTorch to GGML.

In [41]:
!git clone https://github.com/openai/whisper whisper-github
!git clone https://github.com/ggerganov/whisper.cpp
!cd whisper.cpp && git checkout v1.7.4

Cloning into 'whisper-github'...
remote: Enumerating objects: 828, done.[K
remote: Counting objects: 100% (370/370), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 828 (delta 333), reused 301 (delta 301), pack-reused 458 (from 2)[K
Receiving objects: 100% (828/828), 8.26 MiB | 7.30 MiB/s, done.
Resolving deltas: 100% (496/496), done.
Cloning into 'whisper.cpp'...
remote: Enumerating objects: 15214, done.[K
remote: Counting objects: 100% (2410/2410), done.[K
remote: Compressing objects: 100% (375/375), done.[K
remote: Total 15214 (delta 2114), reused 2038 (delta 2035), pack-reused 12804 (from 4)[K
Receiving objects: 100% (15214/15214), 18.46 MiB | 16.05 MiB/s, done.
Resolving deltas: 100% (10486/10486), done.
Note: switching to 'v1.7.4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If

In [42]:
# Patch convert-h5-to-ggml to work with more recent model versions
conversion_script_path = Path('whisper.cpp/models/convert-h5-to-ggml.py')
conversion_script_content = conversion_script_path.read_text()
with open(conversion_script_path, 'w') as conversion_script:
    bad_if_statement = 'if "max_length" not in hparams:'
    replaced_if_statement = 'if "max_length" not in hparams or hparams["max_length"] == None:'
    conversion_script.write(conversion_script_content.replace(bad_if_statement, replaced_if_statement))

In [43]:
!mkdir ./ggml
!python whisper.cpp/models/convert-h5-to-ggml.py ./final-model ./whisper-github ./ggml

2025-02-26 03:23:09.709961: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740540189.943872   46281 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740540190.005939   46281 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
model.encoder.conv1.weight  ->  encoder.conv1.weight
encoder.conv1.weight 3 (384, 80, 3)
model.encoder.conv1.bias  ->  encoder.conv1.bias
  Reshaped variable:  encoder.conv1.bias  to shape:  (384, 1)
encoder.conv1.bias 2 (384, 1)
  Converting to float32
model.encoder.conv2.weight  ->  encoder.conv2.weight
encoder.conv2.weight 3 (384, 384, 3)
model.encoder.conv2.bias  ->  encoder.conv2.bias
  Reshaped variable:  encoder.conv2.bias  to

For smaller size and better performance, we can also quantize the GGML model:

In [44]:
!cd whisper.cpp && cmake -B build && cmake --build build --config Release
!./whisper.cpp/build/bin/quantize ./ggml/ggml-model.bin ./ggml/ggml-model-q5_0.bin q5_0

  Compatibility with CMake < 3.10 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCES

Now, let's make sure that the `.ggml` model works. Start by downloading some test audio:

In [45]:
!mkdir ./test-audio
# Download the first chapter of Alice in Wonderland (in French)
!wget -P ./test-audio/ https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3
# Convert it to a format that's understandable by whisper.cpp:
# -t 30                 Take the first 30s
# -i ...                Input path
# -ar 16000             Sample rate of 16000 HZ
# -ac 1                 1 audio channel
# -codec:a pcm_s16le    Audio codec
!ffmpeg -t 30 -i ./test-audio/aliceaupays_01_carroll_128kb.mp3 -ar 16000 -ac 1 -codec:a pcm_s16le ./test-audio/recording-fr.wav

--2025-02-26 03:25:39--  https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3
Resolving www.archive.org (www.archive.org)... 207.241.224.2
Connecting to www.archive.org (www.archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3 [following]
--2025-02-26 03:25:40--  https://archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ia903201.us.archive.org/25/items/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3 [following]
--2025-02-26 03:25:40--  https://ia903201.us.archive.org/25/items/alice_au_pays_des_mervei

Next, use the `whisper-cli` command to transcribe the audio using our GGML model:

In [46]:
# Test converting the WAV file to text using the GGML file that we built
!./whisper.cpp/build/bin/whisper-cli --language fr --no-timestamps -m ./ggml/ggml-model-q5_0.bin ./test-audio/recording-fr.wav

whisper_init_from_file_with_params_no_state: loading model from './ggml/ggml-model-q5_0.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model

In [None]:
from huggingface_hub import notebook_login, HfApi

# (Optional) Publish to Huggingface (does not currently include the ggml model)
def push_to_hub():
    notebook_login()
    # Publish the model, processor
    trainer.push_to_hub(
        dataset_tags='mozilla-foundation/common_voice_11_0',
        dataset='Common Voice 11.0',
        language='fr',
        model_name='Whisper Tiny (Finetuned on French)',
        finetuned_from='openai/whisper-tiny',
        tasks='automatic-speech-recognition',
    )
    # Note: If this creates a new repo, it will be public
    tokenizer.push_to_hub(hub_model_id)
    # Publish the GGML files
    api = HfApi()
    api.upload_folder(
        folder_path='./ggml',
        repo_id=hub_model_id,
        path_in_repo='ggml/'
    )

In [None]:
# Uncomment to publish
#push_to_hub()