# Early-stop encoding

The first part of this notebook is based on the [build notebook](https://github.com/futo-org/whisper-acft/blob/main/finetune.ipynb) for `whisper-acft`. It creates a variant of Whisper more [robust to an encoder that stops early](https://github.com/futo-org/whisper-acft?tab=readme-ov-file#motive-and-explanation-for-anyone-uninitiated).

The `whisper-acft` build notebook is licensed under the MIT license:
<details><summary>MIT License</summary>

MIT License

Copyright (c) 2024 FUTO Organization

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

</details>

In [None]:
!pip install transformers torch datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [None]:
from datasets import load_dataset, Audio
dataset_fr = load_dataset('google/fleurs', 'fr_fr', split='train')

audio_feature = Audio(sampling_rate=16_000)
dataset_fr = dataset_fr.cast_column('audio', audio_feature)

README.md:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

fleurs.py:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

The repository for google/fleurs contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/google/fleurs.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


train.tar.gz:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

dev.tar.gz:   0%|          | 0.00/143M [00:00<?, ?B/s]

test.tar.gz:   0%|          | 0.00/349M [00:00<?, ?B/s]

train.tsv:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

dev.tsv:   0%|          | 0.00/181k [00:00<?, ?B/s]

test.tsv:   0%|          | 0.00/457k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Model setup

Next, create two models: One to be trained and one to use as a refrerence. To try to keep the output as consistent as possible, we'll use the output of `model_base` as the expected output of `model_train`.

In [None]:
from transformers import WhisperModel, WhisperTokenizer, WhisperProcessor

# TODO: Change model_name to match the name of the model to update (e.g. to personalizedrefrigerator/whisper-base-fr)
whisper_mode = 'base'
model_name = f'personalizedrefrigerator/whisper-{whisper_mode}-fr'
model_train = WhisperModel.from_pretrained(model_name).cuda().train()
model_base = WhisperModel.from_pretrained(model_name).cuda().eval()
processor = WhisperProcessor.from_pretrained(model_name, language='french', task='transcribe')

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

## Fine-tune

In [None]:
ds = dataset_fr

In [None]:
def get_sample(example):
    waveform = example['audio']['array']
    sampling_rate = example['audio']['sampling_rate']
    assert sampling_rate == 16_000

    input_features = processor(
        waveform, sampling_rate=sampling_rate, return_tensors='pt'
    ).input_features
    input_ids = processor.tokenizer.encode(example['raw_transcription'])
    return {
        'length': len(waveform) / sampling_rate,
        'input_features': input_features,
        'input_ids': input_ids
    }

# Test
[processor.tokenizer.decode(i) for i in get_sample(ds[1])['input_ids']]

['<|startoftranscript|>',
 '<|fr|>',
 '<|transcribe|>',
 '<|notimestamps|>',
 'Il',
 ' s',
 '�',
 '�',
 'agit',
 ' d',
 '�',
 '�',
 'une',
 ' ent',
 'ité',
 ' très',
 ' complex',
 'e',
 ' qui',
 ' consiste',
 ',',
 ' selon',
 ' un',
 ' modèle',
 ' de',
 ' Boh',
 'r',
 ' simpl',
 'ifi',
 'é',
 ',',
 ' en',
 ' un',
 ' no',
 'y',
 'au',
 ' central',
 ' orb',
 'ité',
 ' par',
 ' des',
 ' élect',
 'rons',
 ',',
 ' un',
 ' peu',
 ' comme',
 ' les',
 ' plan',
 'è',
 'tes',
 ' en',
 ' orb',
 'ite',
 ' autour',
 ' du',
 ' sole',
 'il',
 ' —',
 ' c',
 'f',
 '.',
 ' illustration',
 '�',
 '�',
 '1',
 '.',
 '1',
 '.',
 '<|endoftext|>']

In [None]:
import torch
from tqdm import tqdm
from torch import nn

# Note: Mostly copied from https://github.com/futo-org/whisper-acft/blob/main/finetune.ipynb
#       See above for license and other information.

def compute_partially_encoder(model, data, n_audio_ctx):
    """
        Computes hidden states for the given model with only a partial run of the encoder.

        Parameters:
        - model: The model.
        - data: Input features to the model.
        - n_audio_ctx: Constant slightly larger than the recording length (in 1 unit / 50s). Set to 1500 to use the full recording. See https://github.com/futo-org/whisper-acft/issues/6#issuecomment-2290093422.
    """
    diffy = 2 * n_audio_ctx - data.shape[2]
    if diffy > 0:
        data = nn.functional.pad(data, [0, diffy, 0, 0, 0, 0], 'constant', 0.0)
    elif diffy < 0:
        data = data[:,:,:diffy]

    # Default encoding -- the full audio
    if n_audio_ctx == 1500:
        return model.encoder(data).last_hidden_state

    input_embeds = nn.functional.gelu(model.encoder.conv1(data))
    input_embeds = nn.functional.gelu(model.encoder.conv2(input_embeds))
    input_embeds = input_embeds.permute(0, 2, 1)

    embed_pos = model.encoder.embed_positions.weight[:n_audio_ctx]

    hidden_states = input_embeds + embed_pos
    hidden_states = nn.functional.dropout(hidden_states, p=model.encoder.dropout, training=model.encoder.training)

    for idx, encoder_layer in enumerate(model.encoder.layers):
        to_drop = False
        if model.encoder.training:
            dropout_probability = torch.rand([])
            if dropout_probability < model.encoder.layerdrop:
                to_drop = True

        if to_drop:
            layer_outputs = (None, None)
        else:
            if model.encoder.gradient_checkpointing and model.encoder.training:
                layer_outputs = model.encoder._gradient_checkpointing_func(
                    encoder_layer.__call__,
                    hidden_states,
                    None,
                    None,
                    False,
                )
            else:
                layer_outputs = encoder_layer(
                    hidden_states,
                    None,
                    layer_head_mask=None,
                    output_attentions=False,
                )

            hidden_states = layer_outputs[0]

    hidden_states = model.encoder.layer_norm(hidden_states)
    return hidden_states


def compute_hidden_state_loss(model_train, model_base, optimizer, criterion, example):
    optimizer.zero_grad()

    n_ctx = int(round((1500.0 / 30.0) * example["length"] ))

    extra_ctx = torch.randint(-min(64, n_ctx // 3), min(64, n_ctx // 3), (1,)).item()
    n_ctx += extra_ctx

    input_features = example["input_features"].cuda()
    input_ids = torch.tensor([example["input_ids"]], dtype=torch.long).cuda()

    encoder_hidden_states_partial = compute_partially_encoder(model_train, input_features, n_ctx)
    output_partial = model_train.decoder(
        input_ids=input_ids,
        encoder_hidden_states=encoder_hidden_states_partial,
        output_hidden_states=True
    )

    with torch.no_grad():
        encoder_hidden_states_full = compute_partially_encoder(model_base, input_features, 1500)
        output_full = model_base.decoder(
            input_ids=input_ids,
            encoder_hidden_states=encoder_hidden_states_full,
            output_hidden_states=True
        )

    loss = criterion(
        #output_partial.hidden_states[-1],
        #output_full.hidden_states[-1]
        torch.cat(output_partial.hidden_states, 0),
        torch.cat(output_full.hidden_states, 0)
    )

    loss.backward()
    optimizer.step()

    return loss

Next, enter the training loop:

In [None]:
from torch.utils.tensorboard import SummaryWriter

# Note: Mostly copied from https://github.com/futo-org/whisper-acft/blob/main/finetune.ipynb
#       See above for license and other information.

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model_train.parameters(), lr=1e-7)


writer = SummaryWriter()
writer.add_text("name", f"{model_name} v3")

num_length = 0
step = 0
for epoch in range(8):
  pbar = tqdm(ds.shuffle(seed=epoch))
  for example in pbar:
    example = get_sample(example)
    if example["length"] > 29.0: continue

    loss = compute_hidden_state_loss(model_train, model_base, optimizer, criterion, example)
    step += 1
    num_length += example["length"]

    writer.add_scalar("loss/train", loss.item(), step)
    writer.add_scalar("length/train", num_length, step)
    writer.add_scalar("epoch/train", epoch, step)

    pbar.set_description(f"Epoch {epoch}, Loss: {loss.item()}")


  0%|          | 0/3193 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Epoch 0, Loss: 0.03439435735344887: 100%|██████████| 3193/3193 [06:32<00:00,  8.14it/s]
Epoch 1, Loss: 0.042581308633089066: 100%|██████████| 3193/3193 [06:26<00:00,  8.27it/s]
Epoch 2, Loss: 0.022116176784038544: 100%|██████████| 3193/3193 [06:26<00:00,  8.27it/s]
Epoch 3, Loss: 0.017987575381994247: 100%|██████████| 3193/3193 [06:25<00:00,  8.28it/s]
Epoch 4, Loss: 0.013689461164176464: 100%|██████████| 3193/3193 [06:26<00:00,  8.26it/s]
Epoch 5, Loss: 0.02007947489619255: 100%|██████████| 3193/3193 [06:27<00:00,  8.25it/s]
Epoch 6, Loss: 0.019504878669977188: 100%|██████████| 3193/3193 [06:25<00:00,  8.28it/s]
Epoch 7, Loss: 0.01682526059448719: 100%|██████████| 3193/3193 [06:26<00:00,  8.26it/s]


In [None]:
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained(model_name).eval().cpu()
model.model = model_train.eval().cpu()

model.save_pretrained('final-model')
processor.tokenizer.save_pretrained('final-model')

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

('final-model/tokenizer_config.json',
 'final-model/special_tokens_map.json',
 'final-model/vocab.json',
 'final-model/merges.txt',
 'final-model/normalizer.json',
 'final-model/added_tokens.json')

## Testing it!

To verify that the model still works, log the model's output on the first sample in the dataset.

In [None]:
sample_data = next(iter(dataset_fr))

input_ids = processor(
    sample_data['audio']['array'], return_tensors='pt'
).input_features
output_ids = model.generate(inputs=input_ids)
processor.batch_decode(output_ids)

It is strongly recommended to pass the `sampling_rate` argument to `WhisperFeatureExtractor()`. Failing to do so can result in silent errors that might be hard to debug.
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["Quand la capsule rentrera dans l'atmosphère terrestre, vers 5 heures du matin, par de l'Est, elle offrira un spectacle lumineux spectaculaire aux habitants du nord de la Californie, de l'Oregon, du Nevada et de l'Utah."]

# Model conversion

Next, we need to convert the model into a format usable by Joplin. This next step converts the model from PyTorch to GGML. Note that this section has been copied and modified from the Joplin `whisper_vocab_cleanup.ipynb` notebook.

In [None]:
!git clone https://github.com/openai/whisper whisper-github
!git clone https://github.com/ggerganov/whisper.cpp
!cd whisper.cpp && git checkout v1.7.4

Cloning into 'whisper-github'...
remote: Enumerating objects: 828, done.[K
remote: Counting objects: 100% (370/370), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 828 (delta 333), reused 301 (delta 301), pack-reused 458 (from 2)[K
Receiving objects: 100% (828/828), 8.26 MiB | 16.58 MiB/s, done.
Resolving deltas: 100% (496/496), done.
Cloning into 'whisper.cpp'...
remote: Enumerating objects: 16341, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 16341 (delta 6), reused 13 (delta 4), pack-reused 16314 (from 2)[K
Receiving objects: 100% (16341/16341), 19.39 MiB | 16.43 MiB/s, done.
Resolving deltas: 100% (11348/11348), done.
Note: switching to 'v1.7.4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to 

`whisper.cpp` needs a patch in order to successfully convert the model:

In [None]:
from pathlib import Path
# Patch convert-h5-to-ggml to work with more recent model versions
conversion_script_path = Path('whisper.cpp/models/convert-h5-to-ggml.py')
conversion_script_content = conversion_script_path.read_text()
with open(conversion_script_path, 'w') as conversion_script:
    bad_if_statement = 'if "max_length" not in hparams:'
    replaced_if_statement = 'if "max_length" not in hparams or hparams["max_length"] == None:'
    conversion_script.write(conversion_script_content.replace(bad_if_statement, replaced_if_statement))

Now that the patch is applied, the model can be converted:

In [None]:
!mkdir ./ggml
!python whisper.cpp/models/convert-h5-to-ggml.py ./final-model ./whisper-github ./ggml

2025-03-25 19:12:12.046308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742929932.071602   14503 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742929932.078064   14503 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
model.encoder.conv1.weight  ->  encoder.conv1.weight
encoder.conv1.weight 3 (512, 80, 3)
model.encoder.conv1.bias  ->  encoder.conv1.bias
  Reshaped variable:  encoder.conv1.bias  to shape:  (512, 1)
encoder.conv1.bias 2 (512, 1)
  Converting to float32
model.encoder.conv2.weight  ->  encoder.conv2.weight
encoder.conv2.weight 3 (512, 512, 3)
model.encoder.conv2.bias  ->  encoder.conv2.bias
  Reshaped variable:  encoder.conv2.bias  to

For smaller size and better performance, we can also [quantize the GGML model](https://github.com/ggerganov/whisper.cpp/discussions/838):

In [None]:
!cd whisper.cpp && cmake -B build && cmake --build build --config Release
!./whisper.cpp/build/bin/quantize ./ggml/ggml-model.bin ./ggml/ggml-model-q8_0.bin q8_0
!./whisper.cpp/build/bin/quantize ./ggml/ggml-model.bin ./ggml/ggml-model-q5_0.bin q5_0

  Compatibility with CMake < 3.10 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCES

Now, let's make sure that the `.ggml` model works. Start by downloading some test audio:

In [None]:
!mkdir ./test-audio
# Download the first chapter of Alice in Wonderland (in French)
!wget -P ./test-audio/ https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_04_carroll_128kb.mp3
# Convert it to a format that's understandable by whisper.cpp:
# -t 30                 Take the first 30s
# -i ...                Input path
# -ar 16000             Sample rate of 16000 HZ
# -ac 1                 1 audio channel
# -codec:a pcm_s16le    Audio codec
!ffmpeg -t 10 -i ./test-audio/aliceaupays_04_carroll_128kb.mp3 -ar 16000 -ac 1 -codec:a pcm_s16le ./test-audio/recording-fr-4.wav

--2025-03-25 19:13:43--  https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_04_carroll_128kb.mp3
Resolving www.archive.org (www.archive.org)... 207.241.224.2
Connecting to www.archive.org (www.archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_04_carroll_128kb.mp3 [following]
--2025-03-25 19:13:45--  https://archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_04_carroll_128kb.mp3
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ia803201.us.archive.org/25/items/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_04_carroll_128kb.mp3 [following]
--2025-03-25 19:13:46--  https://ia803201.us.archive.org/25/items/alice_au_pays_des_mervei

Next, use the `whisper-cli` command to transcribe the audio using our GGML model:

In [None]:
# Test converting the WAV file to text using the GGML file that we built.
# The "-np" argument causes only the recognised text to be printed: 
!./whisper.cpp/build/bin/whisper-cli --language fr -np --no-timestamps -m ./ggml/ggml-model.bin ./test-audio/recording-fr-4.wav


Capétre 4, de aventure, d'Alic, au Pays des Marseille, par l'Ouestara. Cet enregistrement et les prévoxes fait partie.


## Packaging for Joplin

In [None]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer

feature_extractor = processor.feature_extractor
tokenizer_original = processor.tokenizer

We'll create a customized tokenizer based on `tokenizer_original` in the next section.

## Vocabulary adjustements

Next, we remove several unwanted tokens from the vocabulary. This section is originally from `whisper_vocab_cleanup.ipynb`:

In [None]:
# Step 1: Save the vocabulary to a file
tokenizer_directory = Path('whisper-default-tokenizer')
tokenizer_original.save_pretrained(tokenizer_directory)


('whisper-default-tokenizer/tokenizer_config.json',
 'whisper-default-tokenizer/special_tokens_map.json',
 'whisper-default-tokenizer/vocab.json',
 'whisper-default-tokenizer/merges.txt',
 'whisper-default-tokenizer/normalizer.json',
 'whisper-default-tokenizer/added_tokens.json')

Now that the tokenizer is saved in `tokenizer_directory`, we can load `tokenizer_directory/vocab.json` and modify it:

In [None]:
# Step 2: Get vocab.json
import json

def json_from_path(path: Path):
    with open(path, 'r', encoding='utf-8') as f:
        return json.loads(f.read())

vocab = json_from_path(tokenizer_directory / 'vocab.json')

In [None]:
import re
NONWORD_REGEX = re.compile(r'[ \t?.,;!()/\-«»]+')
def split_by_word(text: str):
    """ Splits the given `text` into words. Returns a list of those words. """
    return NONWORD_REGEX.split(text)


# This character marks the beginning of a word in vocab.json
word_start_char = 'Ġ'

In [None]:
# Step 3: Replace!
next_replacement_idx = 0
new_vocab = {}

# Token IDs can be found by inspecting the original vocab.json. These token IDs
# are specific to the multilingual whisper-tiny, but may also work for whisper-base. Each remapping should be unique.
token_id_remappings = {
    19186: "[swearS1]", # s***
    30748: word_start_char + "[swearS2]",
    4611: word_start_char + "[swearS3]",
    19593: word_start_char + "[swearS4]", # S***
    10965: word_start_char + "[swearF1]", # F***
    26154: word_start_char + "[swearF2]", # F***
    33342: word_start_char + "[swearF3]",
    47069: word_start_char + "[swearF4]", # f****
    3275: word_start_char + "[swearF5]",
    22518: word_start_char + "[swearF6]",
    20022: word_start_char + "[swearF7]",
    5546: word_start_char + "[swearF8]",
    47069: word_start_char + "[swearM1]",
    29537: word_start_char + "[swearM2]",
    22676: word_start_char + "[swearB1]", # bull****
    11960: word_start_char + "[swearB2]",
    42094: word_start_char + "[swearB3]",
    40678: word_start_char + "[swearB4]"
}
replaced_keys = set()

for key in vocab:
    token_id = vocab[key]
    if token_id in token_id_remappings:
        new_key = token_id_remappings[token_id]
        new_vocab[new_key] = token_id
        replaced_keys.add(key)
    else:
        new_vocab[key] = token_id

new_merges = []
with open(tokenizer_directory / 'merges.txt', 'r', encoding='utf-8') as merges:
    for line in merges.readlines():
        if len(line) == 0:
            continue
        words = split_by_word(line)
        if not (words[0] in replaced_keys):
            new_merges.append(line.strip())

To check for other indexes to replace (keeping in mind that the output should still be multi-lingual), we could do something like this:
```python
!pip install better_profanity==0.7.0

from better_profanity import profanity

profanity.load_censor_words()
for key in new_vocab:
    word = key
    if key.startswith(word_start_char):
        word = key[1:]
    if profanity.contains_profanity(word):
        print("Consider replacing", key, new_vocab[key])
```

In [None]:
# !pip install better_profanity==0.7.0

# from better_profanity import profanity

# profanity.load_censor_words()
# for key in new_vocab:
#     word = key
#     if key.startswith(word_start_char):
#         word = key[1:]
#     if profanity.contains_profanity(word):
#         print("Consider replacing", key, new_vocab[key])

Great! We now have an updated vocab file!

In [None]:
import shutil

# Write to a file
tokenizer_fr_directory = Path('updated-tokenizer')
if tokenizer_fr_directory.exists():
    shutil.rmtree(tokenizer_fr_directory)
shutil.copytree(tokenizer_directory, tokenizer_fr_directory)
with open(tokenizer_fr_directory / 'vocab.json', 'w', encoding='utf-8') as f:
    json.dump(new_vocab, f, ensure_ascii=False)


with open(tokenizer_fr_directory / 'merges.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(new_merges))

In [None]:
from transformers import WhisperTokenizer

# Use a normal WhisperTokenizer -- WhisperTokenizerFast has trouble with the updated
# vocabulary.
tokenizer = WhisperTokenizer(
    tokenizer_fr_directory / 'vocab.json',
    tokenizer_fr_directory / 'merges.txt',
    tokenizer_fr_directory / 'normalizer.json',
    bos_token='<|startoftranscript|>',
    unk_token='',
    pad_token='<|endoftext|>',
)

# See https://discuss.huggingface.co/t/fine-tuning-whisper-on-my-own-dataset-with-a-customized-tokenizer/25903
tokenizer.add_special_tokens(tokenizer_original.special_tokens_map)

105

In [None]:
# For debugging, update the output directory
shutil.rmtree(tokenizer_fr_directory)
tokenizer.save_pretrained(tokenizer_fr_directory)

('updated-tokenizer/tokenizer_config.json',
 'updated-tokenizer/special_tokens_map.json',
 'updated-tokenizer/vocab.json',
 'updated-tokenizer/merges.txt',
 'updated-tokenizer/normalizer.json',
 'updated-tokenizer/added_tokens.json')

Next, build the model:

In [None]:
model_output_dir = Path('./final-model').resolve()
model.save_pretrained(model_output_dir)
tokenizer.save_pretrained(model_output_dir)

('/content/final-model/tokenizer_config.json',
 '/content/final-model/special_tokens_map.json',
 '/content/final-model/vocab.json',
 '/content/final-model/merges.txt',
 '/content/final-model/normalizer.json',
 '/content/final-model/added_tokens.json')

We can now convert the model to GGML:

In [None]:
!mkdir ./ggml-updated
!python whisper.cpp/models/convert-h5-to-ggml.py ./final-model ./whisper-github ./ggml-updated

# Quantize. See https://github.com/ggerganov/whisper.cpp/discussions/838
!./whisper.cpp/build/bin/quantize ./ggml-updated/ggml-model.bin ./ggml-updated/ggml-model-q8_0.bin q8_0
!./whisper.cpp/build/bin/quantize ./ggml-updated/ggml-model.bin ./ggml-updated/ggml-model-q5_0.bin q5_0
!./whisper.cpp/build/bin/quantize ./ggml-updated/ggml-model.bin ./ggml-updated/ggml-model-q4_0.bin q4_0

2025-03-25 19:14:05.706870: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742930045.728506   15587 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742930045.739139   15587 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
model.encoder.conv1.weight  ->  encoder.conv1.weight
encoder.conv1.weight 3 (512, 80, 3)
model.encoder.conv1.bias  ->  encoder.conv1.bias
  Reshaped variable:  encoder.conv1.bias  to shape:  (512, 1)
encoder.conv1.bias 2 (512, 1)
  Converting to float32
model.encoder.conv2.weight  ->  encoder.conv2.weight
encoder.conv2.weight 3 (512, 512, 3)
model.encoder.conv2.bias  ->  encoder.conv2.bias
  Reshaped variable:  encoder.conv2.bias  to

# Building the Joplin-compatible model

Next, we need to convert the model to a format compatible with Joplin. A `.zip` file is created with the following structure:
```
model_name.zip/
| model_name/
| | README.md
| | model.bin
| | config.json
```

In [None]:
from pathlib import Path
import shutil, json, zipfile

def package_output(source_model: Path, output_dir: Path, output_filename: str):
    if not output_dir.exists():
        output_dir.mkdir()
    unzipped_dir = output_dir / output_filename
    if unzipped_dir.exists():
        shutil.rmtree(unzipped_dir)
    unzipped_dir.mkdir()

    shutil.copyfile(source_model, unzipped_dir / 'model.bin')
    # config.json
    config_filepath = unzipped_dir / 'config.json'
    config_filepath.write_text(json.dumps({
        'prompts': {
            # Custom prompts can improve accuracy.
            'en': 'Joplin is a note-taking application. This is a Joplin note.'
        },
        # Performance: Informs Joplin that the model supports a shortened audio context
        'shortAudioContext': True,
        'output': {
            '//': 'Each of the replacements is in the form [ original, replaceWith ]. For example, ["test", ""] replaces all instances of "test" with the empty string.',
            'stringReplacements': [
                [ '[BLANK_AUDIO]', '' ],
            ],
            'regexReplacements': [
                [ r'^\([^(),.?]+\)$', ''],
                [ r'^\[[^(),.?]+\]$', ''],
                [ r'^[.,?!]$', '' ],
                [ r'\[swearB1\]', 'BS' ],
                [ r'\[swear[A-Z][0-9]+\]', '****' ],
            ],
        }
    }, indent='\t'))
    # README.md
    readme_filepath = unzipped_dir / 'README.md'
    readme_filepath.write_text('\n'.join([
        '# {}'.format(output_filename),
        '',
        'This model is a version of `whisper-' + whisper_mode + '` with an [adjusted vocab.json](https://github.com/personalizedrefrigerator/joplin-voice-typing-test/blob/main/whisper_vocab_cleanup.ipynb) to reduce the probability of profanity when given noisy non-speech input.',
        '',
        'This model has also been [fine-tuned](https://github.com/joplin/voice-typing-models/blob/240c4de34b76aa516482f6e3155c19e14a414e37/whisper_more_efficient_encoding.ipynb) to improve efficiency.',
        '',
        '## License',
        '',
        'The Whisper model from which this is modified has the following license:',
        '''
        MIT License

        Copyright (c) 2022 OpenAI

        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:

        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.

        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        ''',
        '',
        'The fine-tuning code that helped generate this model has the following license:',
        '''
        MIT License

        Copyright (c) 2024 FUTO Organization

        Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

        The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

        '''
    ]))

    # Make the .zip file
    # See https://docs.python.org/3/library/shutil.html
    shutil.make_archive(
        output_dir / output_filename,
        'zip',
        root_dir=output_dir,
        base_dir=output_filename,
    )

package_output(
    Path('./ggml-updated/ggml-model.bin'),
    Path('./joplin-model'),
    'whisper-{}'.format(whisper_mode)
)
package_output(
    Path('./ggml-updated/ggml-model-q8_0.bin'),
    Path('./joplin-model-q8_0'),
    'whisper-{}-q8_0'.format(whisper_mode)
)

The models are now built! They're stored in the `./joplin-model` and `./joplin-model-q8_0` directories.

In [None]:

package_output(
    Path('./ggml-updated/ggml-model-q5_0.bin'),
    Path('./joplin-model-q5_0'),
    'whisper-{}-q5_0'.format(whisper_mode)
)
package_output(
    Path('./ggml-updated/ggml-model-q4_0.bin'),
    Path('./joplin-model-q4_0'),
    'whisper-{}-q4_0'.format(whisper_mode)
)

## Google-colab-specific

In [None]:
# Google colab only: Save the files to the local machine
from google.colab import files
files.download(f'./joplin-model/whisper-{whisper_mode}.zip')
files.download(f'./joplin-model-q5_0/whisper-{whisper_mode}-q5_0.zip')
files.download(f'./joplin-model-q8_0/whisper-{whisper_mode}-q8_0.zip')
files.download(f'./joplin-model-q4_0/whisper-{whisper_mode}-q4_0.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>