# Updating Whisper's default vocabulary

Some of the upstream Whisper models have been observed to swear (in English) when given non-speech input. This notebook adjusts the `vocab.json` for the `whisper-tiny` model to make such swearing less likely.

More specifically, this notebook:
- Fetches the upstream `whisper-tiny` model.
- Replaces most full-swearword entries (see below).
- Packages the model as GGML.

After running this notebook, it will still be nessary to put the model in a `.zip` file in the format supported by Joplin (see the existing models for an example).

In [1]:
!pip install --upgrade pip
# jiwer is used for the word error rate (WER) metric
!pip install --upgrade datasets[audio] transformers evaluate jiwer

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1
Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting datasets[audio]
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets[audio])
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets[audio])
  Downloading xxhash-3.5.0-cp311-cp311-manyl

In [2]:
import wandb
# See https://discuss.huggingface.co/t/how-to-turn-wandb-off-in-trainer/6237/10
wandb.init(mode='disabled')

In [3]:
from pathlib import Path
checkpoint_path = Path('./whisper/checkpoints').resolve()

In [4]:
import shutil


## Create the feature extractor and tokenizer

Fetch the feature extractor and tokenizer for this model from Huggingface:

In [5]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer

finetune_from_id = 'openai/whisper-tiny'
feature_extractor = WhisperFeatureExtractor.from_pretrained(finetune_from_id,  task='transcribe')
tokenizer_original = WhisperTokenizer.from_pretrained(finetune_from_id, task='transcribe')

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

We'll create a customized tokenizer based on `tokenizer_original` in the next section.

## Vocabulary adjustements

Next, we remove several unwanted tokens from the vocabulary:

In [6]:
# Step 1: Save the vocabulary to a file
tokenizer_directory = Path('whisper-default-tokenizer')
tokenizer_original.save_pretrained(tokenizer_directory)


('whisper-default-tokenizer/tokenizer_config.json',
 'whisper-default-tokenizer/special_tokens_map.json',
 'whisper-default-tokenizer/vocab.json',
 'whisper-default-tokenizer/merges.txt',
 'whisper-default-tokenizer/normalizer.json',
 'whisper-default-tokenizer/added_tokens.json')

Now that the tokenizer is saved in `tokenizer_directory`, we can load `tokenizer_directory/vocab.json` and modify it:

In [7]:
# Step 2: Get vocab.json
import json

def json_from_path(path: Path):
    with open(path, 'r', encoding='utf-8') as f:
        return json.loads(f.read())

vocab = json_from_path(tokenizer_directory / 'vocab.json')

In [8]:
import re
NONWORD_REGEX = re.compile(r'[ \t?.,;!()/\-«»]+')
def split_by_word(text: str):
    """ Splits the given `text` into words. Returns a list of those words. """
    return NONWORD_REGEX.split(text)


# This character marks the beginning of a word in vocab.json
word_start_char = 'Ġ'

In [9]:
# Step 3: Replace!
next_replacement_idx = 0
new_vocab = {}

# Token IDs can be found by inspecting the original vocab.json. These token IDs
# are specific to the multilingual whisper-tiny. Each remapping should be unique.
token_id_remappings = {
    19186: "[swearS1]", # s***
    30748: word_start_char + "[swearS2]",
    4611: word_start_char + "[swearS3]",
    19593: word_start_char + "[swearS4]", # S***
    10965: word_start_char + "[swearF1]", # F***
    26154: word_start_char + "[swearF2]", # F***
    33342: word_start_char + "[swearF3]",
    47069: word_start_char + "[swearF4]", # f****
    3275: word_start_char + "[swearF5]",
    22518: word_start_char + "[swearF6]",
    20022: word_start_char + "[swearF7]",
    5546: word_start_char + "[swearF8]",
    47069: word_start_char + "[swearM1]",
    29537: word_start_char + "[swearM2]",
    22676: word_start_char + "[swearB1]", # bull****
    11960: word_start_char + "[swearB2]",
    42094: word_start_char + "[swearB3]",
    40678: word_start_char + "[swearB4]"
}
replaced_keys = set()

for key in vocab:
    token_id = vocab[key]
    if token_id in token_id_remappings:
        new_key = token_id_remappings[token_id]
        new_vocab[new_key] = token_id
        replaced_keys.add(key)
    else:
        new_vocab[key] = token_id

new_merges = []
with open(tokenizer_directory / 'merges.txt', 'r', encoding='utf-8') as merges:
    for line in merges.readlines():
        if len(line) == 0:
            continue
        words = split_by_word(line)
        if not (words[0] in replaced_keys):
            new_merges.append(line.strip())

To check for other indexes to replace (keeping in mind that the output should still be multi-lingual), we could do something like this:
```python
!pip install better_profanity==0.7.0

from better_profanity import profanity

profanity.load_censor_words()
for key in new_vocab:
    word = key
    if key.startswith(word_start_char):
        word = key[1:]
    if profanity.contains_profanity(word):
        print("Consider replacing", key, new_vocab[key])
```

Great! We now have an updated vocab file!

In [10]:
# Write to a file
tokenizer_fr_directory = Path('updated-tokenizer')
if tokenizer_fr_directory.exists():
    shutil.rmtree(tokenizer_fr_directory)
shutil.copytree(tokenizer_directory, tokenizer_fr_directory)
with open(tokenizer_fr_directory / 'vocab.json', 'w', encoding='utf-8') as f:
    json.dump(new_vocab, f, ensure_ascii=False)


with open(tokenizer_fr_directory / 'merges.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(new_merges))

In [11]:
from transformers import WhisperTokenizer

# Use a normal WhisperTokenizer -- WhisperTokenizerFast has trouble with the updated
# vocabulary.
tokenizer = WhisperTokenizer(
    tokenizer_fr_directory / 'vocab.json',
    tokenizer_fr_directory / 'merges.txt',
    tokenizer_fr_directory / 'normalizer.json',
    bos_token='<|startoftranscript|>',
    unk_token='',
    pad_token='<|endoftext|>',
)

# See https://discuss.huggingface.co/t/fine-tuning-whisper-on-my-own-dataset-with-a-customized-tokenizer/25903
tokenizer.add_special_tokens(tokenizer_original.special_tokens_map)

105

In [12]:
# For debugging, update the output directory
shutil.rmtree(tokenizer_fr_directory)
tokenizer.save_pretrained(tokenizer_fr_directory)

('updated-tokenizer/tokenizer_config.json',
 'updated-tokenizer/special_tokens_map.json',
 'updated-tokenizer/vocab.json',
 'updated-tokenizer/merges.txt',
 'updated-tokenizer/normalizer.json',
 'updated-tokenizer/added_tokens.json')

## Create the processor

Next, load the `WhisperProcessor`, which combines a feature extractor and tokenizer.

In [13]:
from transformers import WhisperProcessor

processor = WhisperProcessor(feature_extractor, tokenizer)

Next, build the model:

In [14]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(finetune_from_id)
model.generation_config.forced_decoder_ids = None


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

In [15]:
from dataclasses import dataclass
from typing import Any
import torch
# See the linked blog post and https://huggingface.co/docs/transformers/main_classes/data_collator

@dataclass
class DataCollatorWithPadding:
    ''' Converts raw data into a batch ready for the model '''
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: list) -> dict[str, torch.Tensor]:
        input_features = [{'input_features': f['input_features']} for f in features]
        label_features = [{'input_ids': f['labels']} for f in features]

        # According to the linked blog post, the input and label features need
        # to be padded separately (due to different final lengths), then
        # recombined:
        batch = self.processor.feature_extractor.pad(input_features, return_tensors='pt')

        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors='pt')

        # transformers uses -100 for masking
        labels = labels_batch['input_ids'].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # Don't double-prepend the beginning of sequence token:
        if (labels[:,0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch['labels'] = labels
        return batch

data_collator = DataCollatorWithPadding(processor=processor, decoder_start_token_id=model.config.decoder_start_token_id)

In [16]:
model_output_dir = Path('./final-model').resolve()
model.save_pretrained(model_output_dir)
tokenizer.save_pretrained(model_output_dir)



('/content/final-model/tokenizer_config.json',
 '/content/final-model/special_tokens_map.json',
 '/content/final-model/vocab.json',
 '/content/final-model/merges.txt',
 '/content/final-model/normalizer.json',
 '/content/final-model/added_tokens.json')

# Model conversion

Next, we need to convert the model into a format usable by Joplin. This next step converts the model from PyTorch to GGML.

In [17]:
!git clone https://github.com/openai/whisper whisper-github
!git clone https://github.com/ggerganov/whisper.cpp
!cd whisper.cpp && git checkout v1.7.4

Cloning into 'whisper-github'...
remote: Enumerating objects: 828, done.[K
remote: Counting objects: 100% (340/340), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 828 (delta 305), reused 272 (delta 272), pack-reused 488 (from 2)[K
Receiving objects: 100% (828/828), 8.24 MiB | 18.96 MiB/s, done.
Resolving deltas: 100% (498/498), done.
Cloning into 'whisper.cpp'...
remote: Enumerating objects: 15791, done.[K
remote: Counting objects: 100% (3087/3087), done.[K
remote: Compressing objects: 100% (456/456), done.[K
remote: Total 15791 (delta 2734), reused 2631 (delta 2630), pack-reused 12704 (from 4)[K
Receiving objects: 100% (15791/15791), 19.84 MiB | 10.33 MiB/s, done.
Resolving deltas: 100% (10863/10863), done.
Note: switching to 'v1.7.4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

I

In [18]:
# Patch convert-h5-to-ggml to work with more recent model versions
conversion_script_path = Path('whisper.cpp/models/convert-h5-to-ggml.py')
conversion_script_content = conversion_script_path.read_text()
with open(conversion_script_path, 'w') as conversion_script:
    bad_if_statement = 'if "max_length" not in hparams:'
    replaced_if_statement = 'if "max_length" not in hparams or hparams["max_length"] == None:'
    conversion_script.write(conversion_script_content.replace(bad_if_statement, replaced_if_statement))

In [19]:
!mkdir ./ggml
!python whisper.cpp/models/convert-h5-to-ggml.py ./final-model ./whisper-github ./ggml
!mv ./ggml/ggml-model.bin ./ggml/ggml-clean.bin

2025-02-28 23:05:03.568600: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740783903.648093    1229 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740783903.659216    1229 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
model.encoder.conv1.weight  ->  encoder.conv1.weight
encoder.conv1.weight 3 (384, 80, 3)
model.encoder.conv1.bias  ->  encoder.conv1.bias
  Reshaped variable:  encoder.conv1.bias  to shape:  (384, 1)
encoder.conv1.bias 2 (384, 1)
  Converting to float32
model.encoder.conv2.weight  ->  encoder.conv2.weight
encoder.conv2.weight 3 (384, 384, 3)
model.encoder.conv2.bias  ->  encoder.conv2.bias
  Reshaped variable:  encoder.conv2.bias  to

For smaller size and better performance, we can also quantize the GGML model:

In [20]:
!cd whisper.cpp && cmake -B build && cmake --build build --config Release
!./whisper.cpp/build/bin/quantize ./ggml/ggml-clean.bin ./ggml/ggml-clean-q8_0.bin q8_0

  Compatibility with CMake < 3.10 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCES

## Testing it

Now, let's make sure that the `.ggml` model works. Start by downloading some test audio:

In [21]:
!mkdir ./test-audio
# Download the first chapter of Alice in Wonderland (in French)
!wget -P ./test-audio/ https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3
!wget -P ./test-audio/ https://www.archive.org/download/alice_in_wonderland_librivox/wonderland_ch_01.mp3
# Convert it to a format that's understandable by whisper.cpp:
# -t 30                 Take the first 30s
# -i ...                Input path
# -ar 16000             Sample rate of 16000 HZ
# -ac 1                 1 audio channel
# -codec:a pcm_s16le    Audio codec
!ffmpeg -t 30 -i ./test-audio/aliceaupays_01_carroll_128kb.mp3 -ar 16000 -ac 1 -codec:a pcm_s16le ./test-audio/recording-fr.wav
!ffmpeg -t 30 -i ./test-audio/wonderland_ch_01.mp3 -ar 16000 -ac 1 -codec:a pcm_s16le ./test-audio/recording-en.wav

--2025-02-28 23:07:21--  https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3
Resolving www.archive.org (www.archive.org)... 207.241.224.2
Connecting to www.archive.org (www.archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3 [following]
--2025-02-28 23:07:22--  https://archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ia903201.us.archive.org/25/items/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_01_carroll_128kb.mp3 [following]
--2025-02-28 23:07:23--  https://ia903201.us.archive.org/25/items/alice_au_pays_des_mervei

Next, use the `whisper-cli` command to transcribe the audio using our GGML model:

In [22]:
# Test converting the WAV file to text using the GGML file that we built
!./whisper.cpp/build/bin/whisper-cli --language fr --no-timestamps -m ./ggml/ggml-clean.bin ./test-audio/recording-fr.wav
!./whisper.cpp/build/bin/whisper-cli --language en --no-timestamps -m ./ggml/ggml-clean.bin ./test-audio/recording-en.wav

whisper_init_from_file_with_params_no_state: loading model from './ggml/ggml-clean.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load

In [23]:
# Compare with the upstream model
!mkdir ./ggml-upstream/
!sh ./whisper.cpp/models/download-ggml-model.sh tiny ./ggml-upstream/
!./whisper.cpp/build/bin/whisper-cli --language fr --no-timestamps -m ./ggml-upstream/ggml-tiny.bin ./test-audio/recording-fr.wav

Downloading ggml model tiny from 'https://huggingface.co/ggerganov/whisper.cpp' ...
Done! Model 'tiny' saved in './ggml-upstream//ggml-tiny.bin'
You can now use it like this:

  $ ./main -m ./ggml-upstream//ggml-tiny.bin -f samples/jfk.wav

whisper_init_from_file_with_params_no_state: loading model from './ggml-upstream/ggml-tiny.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_mode

# Building the Joplin-compatible model

Next, we need to convert the model to a format compatible with Joplin. A `.zip` file is created with the following structure:
```
model.zip/
| README.md
| model.bin
| config.json
```

In [33]:
from pathlib import Path
import shutil, json, zipfile

def package_output(source_model: Path, output_dir: Path, output_filename: str):
    if not output_dir.exists():
        output_dir.mkdir()
    unzipped_dir = output_dir / 'unzipped'
    if unzipped_dir.exists():
        shutil.rmtree(unzipped_dir)
    unzipped_dir.mkdir()

    shutil.copyfile(source_model, unzipped_dir / 'model.bin')
    # config.json
    config_filepath = unzipped_dir / 'config.json'
    config_filepath.write_text(json.dumps({
        'prompts': {
            # Custom prompts can improve accuracy.
            'en': 'Joplin is a note-taking application. This is a Joplin note.'
        },
        'output': {
            '//': 'Each of the replacements is in the form [ original, replaceWith ]. For example, ["test", ""] replaces all instances of "test" with the empty string.',
            'stringReplacements': [
                [ '[BLANK_AUDIO]', '' ],
            ],
            'regexReplacements': [
                [ r'^\([^(),.?]+\)$', ''],
                [ r'^\[[^(),.?]+\]$', ''],
                [ r'^[.,?!]$', '' ],
            ],
        }
    }, indent='\t'))
    # README.md
    readme_filepath = unzipped_dir / 'README.md'
    readme_filepath.write_text('\n'.join([
        '# {}'.format(output_filename),
        '',
        'This model is a version of `whisper-tiny` with an [adjusted vocab.json](https://github.com/personalizedrefrigerator/joplin-voice-typing-test/blob/main/whisper_vocab_cleanup.ipynb) to reduce the probability of profanity when given noisy non-speech input.',
        '',
        '## License',
        '',
        'The Whisper model from which this is modified has the following license:',
        '''
        MIT License

        Copyright (c) 2022 OpenAI

        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:

        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.

        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        '''
    ]))

    # Make the .zip file
    # See https://docs.python.org/3/library/shutil.html
    shutil.make_archive(output_dir / output_filename, 'zip', unzipped_dir)

package_output(
    Path('./ggml/ggml-clean.bin'),
    Path('./joplin-model'),
    'whisper-tiny'
)
package_output(
    Path('./ggml/ggml-clean-q8_0.bin'),
    Path('./joplin-model-q8_0'),
    'whisper-tiny-q8_0'
)

The models are now built! They're stored in the `./joplin-model` and `./joplin-model-q8_0` directories.