# Early-stop encoding

This notebook is based on the [build notebook](https://github.com/futo-org/whisper-acft/blob/main/finetune.ipynb) for `whisper-acft`. It creates a variant of Whisper more [robust to an encoder that stops early](https://github.com/futo-org/whisper-acft?tab=readme-ov-file#motive-and-explanation-for-anyone-uninitiated).

The original notebook is licensed under the MIT license:
<details><summary>MIT License</summary>

MIT License

Copyright (c) 2024 FUTO Organization

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

</details>

In [1]:
!pip install transformers torch datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [2]:
from datasets import load_dataset, Audio
dataset_fr = load_dataset('google/fleurs', 'fr_fr', split='train')

audio_feature = Audio(sampling_rate=16_000)
dataset_fr = dataset_fr.cast_column('audio', audio_feature)

README.md:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

fleurs.py:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

The repository for google/fleurs contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/google/fleurs.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


train.tar.gz:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

dev.tar.gz:   0%|          | 0.00/143M [00:00<?, ?B/s]

test.tar.gz:   0%|          | 0.00/349M [00:00<?, ?B/s]

train.tsv:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

dev.tsv:   0%|          | 0.00/181k [00:00<?, ?B/s]

test.tsv:   0%|          | 0.00/457k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Model setup

Next, create two models: One to be trained and one to use as a refrerence. To try to keep the output as consistent as possible, we'll use the output of `model_base` as the expected output of `model_train`.

In [3]:
from transformers import WhisperModel, WhisperTokenizer, WhisperProcessor

# TODO: Change model_name to match the name of the model to update (e.g. to personalizedrefrigerator/whisper-base-fr)
model_name = 'personalizedrefrigerator/whisper-tiny-fr'
model_train = WhisperModel.from_pretrained(model_name).cuda().train()
model_base = WhisperModel.from_pretrained(model_name).cuda().eval()
processor = WhisperProcessor.from_pretrained(model_name, language='french', task='transcribe')

config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

## Fine-tune

In [4]:
ds = dataset_fr

In [5]:
def get_sample(example):
    waveform = example['audio']['array']
    sampling_rate = example['audio']['sampling_rate']
    assert sampling_rate == 16_000

    input_features = processor(
        waveform, sampling_rate=sampling_rate, return_tensors='pt'
    ).input_features
    input_ids = processor.tokenizer.encode(example['raw_transcription'])
    return {
        'length': len(waveform) / sampling_rate,
        'input_features': input_features,
        'input_ids': input_ids
    }

# Test
[processor.tokenizer.decode(i) for i in get_sample(ds[1])['input_ids']]

['<|startoftranscript|>',
 '<|fr|>',
 '<|transcribe|>',
 '<|notimestamps|>',
 'Il',
 ' s',
 '�',
 '�',
 'agit',
 ' d',
 '�',
 '�',
 'une',
 ' ent',
 'ité',
 ' très',
 ' complex',
 'e',
 ' qui',
 ' consiste',
 ',',
 ' selon',
 ' un',
 ' modèle',
 ' de',
 ' Boh',
 'r',
 ' simpl',
 'ifi',
 'é',
 ',',
 ' en',
 ' un',
 ' no',
 'y',
 'au',
 ' central',
 ' orb',
 'ité',
 ' par',
 ' des',
 ' élect',
 'rons',
 ',',
 ' un',
 ' peu',
 ' comme',
 ' les',
 ' plan',
 'è',
 'tes',
 ' en',
 ' orb',
 'ite',
 ' autour',
 ' du',
 ' sole',
 'il',
 ' —',
 ' c',
 'f',
 '.',
 ' illustration',
 '�',
 '�',
 '1',
 '.',
 '1',
 '.',
 '<|endoftext|>']

In [6]:
import torch
from tqdm import tqdm
from torch import nn

# Note: Mostly copied from https://github.com/futo-org/whisper-acft/blob/main/finetune.ipynb
#       See above for license and other information.

def compute_partially_encoder(model, data, n_audio_ctx):
    """
        Computes hidden states for the given model with only a partial run of the encoder.

        Parameters:
        - model: The model.
        - data: Input features to the model.
        - n_audio_ctx: Constant slightly larger than the recording length (in 1 unit / 50s). Set to 1500 to use the full recording. See https://github.com/futo-org/whisper-acft/issues/6#issuecomment-2290093422.
    """
    diffy = 2 * n_audio_ctx - data.shape[2]
    if diffy > 0:
        data = nn.functional.pad(data, [0, diffy, 0, 0, 0, 0], 'constant', 0.0)
    elif diffy < 0:
        data = data[:,:,:diffy]

    # Default encoding -- the full audio
    if n_audio_ctx == 1500:
        return model.encoder(data).last_hidden_state

    input_embeds = nn.functional.gelu(model.encoder.conv1(data))
    input_embeds = nn.functional.gelu(model.encoder.conv2(input_embeds))
    input_embeds = input_embeds.permute(0, 2, 1)

    embed_pos = model.encoder.embed_positions.weight[:n_audio_ctx]

    hidden_states = input_embeds + embed_pos
    hidden_states = nn.functional.dropout(hidden_states, p=model.encoder.dropout, training=model.encoder.training)

    for idx, encoder_layer in enumerate(model.encoder.layers):
        to_drop = False
        if model.encoder.training:
            dropout_probability = torch.rand([])
            if dropout_probability < model.encoder.layerdrop:
                to_drop = True

        if to_drop:
            layer_outputs = (None, None)
        else:
            if model.encoder.gradient_checkpointing and model.encoder.training:
                layer_outputs = model.encoder._gradient_checkpointing_func(
                    encoder_layer.__call__,
                    hidden_states,
                    None,
                    None,
                    False,
                )
            else:
                layer_outputs = encoder_layer(
                    hidden_states,
                    None,
                    layer_head_mask=None,
                    output_attentions=False,
                )

            hidden_states = layer_outputs[0]

    hidden_states = model.encoder.layer_norm(hidden_states)
    return hidden_states


def compute_hidden_state_loss(model_train, model_base, optimizer, criterion, example):
    optimizer.zero_grad()

    n_ctx = int(round((1500.0 / 30.0) * example["length"] ))

    extra_ctx = torch.randint(-min(64, n_ctx // 3), min(64, n_ctx // 3), (1,)).item()
    n_ctx += extra_ctx

    input_features = example["input_features"].cuda()
    input_ids = torch.tensor([example["input_ids"]], dtype=torch.long).cuda()

    encoder_hidden_states_partial = compute_partially_encoder(model_train, input_features, n_ctx)
    output_partial = model_train.decoder(
        input_ids=input_ids,
        encoder_hidden_states=encoder_hidden_states_partial,
        output_hidden_states=True
    )

    with torch.no_grad():
        encoder_hidden_states_full = compute_partially_encoder(model_base, input_features, 1500)
        output_full = model_base.decoder(
            input_ids=input_ids,
            encoder_hidden_states=encoder_hidden_states_full,
            output_hidden_states=True
        )

    loss = criterion(
        #output_partial.hidden_states[-1],
        #output_full.hidden_states[-1]
        torch.cat(output_partial.hidden_states, 0),
        torch.cat(output_full.hidden_states, 0)
    )

    loss.backward()
    optimizer.step()

    return loss

Next, enter the training loop:

In [None]:
from torch.utils.tensorboard import SummaryWriter

# Note: Mostly copied from https://github.com/futo-org/whisper-acft/blob/main/finetune.ipynb
#       See above for license and other information.

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model_train.parameters(), lr=1e-7)


writer = SummaryWriter()
writer.add_text("name", f"{model_name} v3")

num_length = 0
step = 0
for epoch in range(8):
  pbar = tqdm(ds.shuffle(seed=epoch))
  for example in pbar:
    example = get_sample(example)
    if example["length"] > 29.0: continue

    loss = compute_hidden_state_loss(model_train, model_base, optimizer, criterion, example)
    step += 1
    num_length += example["length"]

    writer.add_scalar("loss/train", loss.item(), step)
    writer.add_scalar("length/train", num_length, step)
    writer.add_scalar("epoch/train", epoch, step)

    pbar.set_description(f"Epoch {epoch}, Loss: {loss.item()}")


  0%|          | 0/3193 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Epoch 0, Loss: 0.17653633654117584: 100%|██████████| 3193/3193 [04:02<00:00, 13.16it/s]
Epoch 1, Loss: 0.1711929440498352: 100%|██████████| 3193/3193 [03:45<00:00, 14.15it/s]
Epoch 2, Loss: 0.0884438008069992: 100%|██████████| 3193/3193 [03:46<00:00, 14.07it/s]
Epoch 3, Loss: 0.05065027251839638:  49%|████▉     | 1569/3193 [01:59<01:50, 14.70it/s]

In [None]:
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained(model_name).eval().cpu()
model.model = model_train.eval().cpu()

model.save_pretrained('final-model')
processor.tokenizer.save_pretrained('final-model')

## Testing it!

In [None]:
sample_data = next(iter(dataset_fr))

input_ids = processor(
    sample_data['audio']['array'], return_tensors='pt'
).input_features
output_ids = model.generate(inputs=input_ids)
processor.batch_decode(output_ids)

# Model conversion

Next, we need to convert the model into a format usable by Joplin. This next step converts the model from PyTorch to GGML. Note that this section has been copied and modified from the Joplin `fine_tune_whisper_for_french.ipynb` notebook.

In [None]:
!git clone https://github.com/openai/whisper whisper-github
!git clone https://github.com/ggerganov/whisper.cpp
!cd whisper.cpp && git checkout v1.7.4

In [None]:
from pathlib import Path
# Patch convert-h5-to-ggml to work with more recent model versions
conversion_script_path = Path('whisper.cpp/models/convert-h5-to-ggml.py')
conversion_script_content = conversion_script_path.read_text()
with open(conversion_script_path, 'w') as conversion_script:
    bad_if_statement = 'if "max_length" not in hparams:'
    replaced_if_statement = 'if "max_length" not in hparams or hparams["max_length"] == None:'
    conversion_script.write(conversion_script_content.replace(bad_if_statement, replaced_if_statement))

In [None]:
!mkdir ./ggml
!python whisper.cpp/models/convert-h5-to-ggml.py ./final-model ./whisper-github ./ggml

For smaller size and better performance, we can also quantize the GGML model:

In [None]:
!cd whisper.cpp && cmake -B build && cmake --build build --config Release
!./whisper.cpp/build/bin/quantize ./ggml/ggml-model.bin ./ggml/ggml-model-q8_0.bin q8_0
!./whisper.cpp/build/bin/quantize ./ggml/ggml-model.bin ./ggml/ggml-model-q5_0.bin q5_0

Now, let's make sure that the `.ggml` model works. Start by downloading some test audio:

In [None]:
!mkdir ./test-audio
# Download the first chapter of Alice in Wonderland (in French)
!wget -P ./test-audio/ https://www.archive.org/download/alice_au_pays_des_merveilles_1811_librivox/aliceaupays_04_carroll_128kb.mp3
# Convert it to a format that's understandable by whisper.cpp:
# -t 30                 Take the first 30s
# -i ...                Input path
# -ar 16000             Sample rate of 16000 HZ
# -ac 1                 1 audio channel
# -codec:a pcm_s16le    Audio codec
!ffmpeg -t 10 -i ./test-audio/aliceaupays_04_carroll_128kb.mp3 -ar 16000 -ac 1 -codec:a pcm_s16le ./test-audio/recording-fr-4.wav

Next, use the `whisper-cli` command to transcribe the audio using our GGML model:

In [None]:
# Test converting the WAV file to text using the GGML file that we built
!./whisper.cpp/build/bin/whisper-cli --language fr -np --no-timestamps -m ./ggml/ggml-model.bin ./test-audio/recording-fr-4.wav

## Google-colab-specific

In [None]:
# Google colab only: Save the files to the local machine
from google.colab import files
files.download('./ggml/ggml-model-q8_0.bin')
files.download('./ggml/ggml-model.bin')