# Comparing Whisper models (CPU)

This notebook compares multiple Whisper models on a test dataset, using a CPU.

Note that the `large-v3-turbo` Whisper model is not included in this notebook because it seems to take a very long time to run, per sample on the CPU (130 seconds per transcription in the most recent test).

## Install dependencies

In [1]:
!pip install --upgrade pip
# jiwer is used for the word error rate (WER) metric
!pip install --upgrade datasets[audio] transformers evaluate jiwer

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1
Collecting transformers
  Downloading transformers-4.51.2-py3-none-any.whl.metadata (38 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting datasets[audio]
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets[audio])
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets[audio])
  Downloading xxhash-3.5.0-cp311-cp311-manyl

In [2]:
!pip install pyspellchecker==0.8.1

Collecting pyspellchecker==0.8.1
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m92.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


In [3]:
import wandb
# See https://discuss.huggingface.co/t/how-to-turn-wandb-off-in-trainer/6237/10
wandb.init(mode='disabled')

In [4]:
import shutil


## Load data

We'll compare the models on the [`facebook/multilingual_librispeech`](https://huggingface.co/datasets/facebook/multilingual_librispeech/viewer/french?views%5B%5D=french_dev) and [`facebook/voxpopuli`](https://huggingface.co/datasets/facebook/voxpopuli) datasets.

In [5]:
from datasets import load_dataset, Audio

audio_feature = Audio(sampling_rate=16_000)
def load_librispeech(language: str):
    dataset = load_dataset('facebook/multilingual_librispeech', language, split='test', streaming=True)
    dataset = dataset.cast_column('audio', audio_feature)
    return dataset.select_columns(['transcript', 'audio'])

def load_voxpopuli(language: str):
    dataset = load_dataset('facebook/voxpopuli', language, split='test', streaming=True)

    def is_good_row(text):
        """ Avoid sentence fragments and empty data. """
        trimmed_text = text.strip()
        if trimmed_text == '':
            return False
        starts_with_uppercase = trimmed_text[0].upper() == trimmed_text[0]
        return starts_with_uppercase

    dataset = dataset.filter(is_good_row, input_columns=['raw_text'])
    dataset = dataset.rename_column('normalized_text', 'transcript')
    dataset = dataset.cast_column('audio', audio_feature)
    return dataset.select_columns(['transcript', 'audio'])

# multilingual_librispeech doesn't have English data
dataset_fr_librispeech = load_librispeech('french')
dataset_fr_voxpopuli = load_voxpopuli('fr')
# Other datasets do
dataset_en_voxpopuli = load_voxpopuli('en')

README.md:   0%|          | 0.00/18.1k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/34 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

voxpopuli.py:   0%|          | 0.00/8.84k [00:00<?, ?B/s]

The repository for facebook/voxpopuli contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/facebook/voxpopuli.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


In [6]:
sample_data_fr = next(iter(dataset_fr_librispeech))
print(sample_data_fr)
sample_data_en = next(iter(dataset_en_voxpopuli))
print(sample_data_en)

{'transcript': "pendant le second siècle je fis serment d'ouvrir tous les trésors de la terre à quiconque me mettrait en liberté mais je ne fus pas plus heureux dans le troisième je promis de faire puissant monarque mon libérateur d'être toujours près de lui en esprit", 'audio': {'path': '1406_1028_000000.opus', 'array': array([ 3.16312944e-04,  2.51584337e-04,  1.96699897e-04, ...,
        2.33122555e-04, -8.31385187e-05, -1.13553528e-04]), 'sampling_rate': 16000}}
{'transcript': 'imposition of switching off the life supporting treatment is nothing else than euthanasia.', 'audio': {'path': None, 'array': array([0.00170898, 0.00491333, 0.00326538, ..., 0.00048828, 0.00161743,
       0.00204468]), 'sampling_rate': 16000}}


## Fetch the Joplin models

The Joplin voice typing model releases can be [found on GitHub](https://github.com/joplin/voice-typing-models/releases/). For evaluation purposes, we'll download some of these:

In [7]:
import urllib.request, shutil, tempfile, zipfile, json
from pathlib import Path

def fetch_joplin_model(model_name: str):
    """ Downloads the [model_name] model from GitHub and writes it to [output_path] """
    base_url = "https://github.com/joplin/voice-typing-models/releases/download/v0.2.0/"
    url = base_url + model_name + '.zip'
    output_base_path = Path('./ggml-models/joplin')
    if not output_base_path.exists():
        output_base_path.mkdir(parents=True)

    def extract_model_from_archive(archive):
        paths = archive.namelist()
        model_path = [ path for path in paths if path.endswith('model.bin') ][0]
        config_path = [ path for path in paths if path.endswith('config.json') ][0]
        with archive.open(config_path) as config:
            config = json.loads(config.read())
        if 'shortAudioContext' in config:
            output_path = output_base_path / f'{model_name}.dynamic_ctx.bin'
        else:
            output_path = output_base_path / f'{model_name}.bin'

        with archive.open(model_path) as model_obj:
            with open(output_path, 'wb') as output:
                shutil.copyfileobj(model_obj, output)
        return output_path

    # See https://docs.python.org/3/howto/urllib2.html#fetching-urls
    with tempfile.NamedTemporaryFile() as zipped_file:
        with urllib.request.urlopen(url) as response:
            shutil.copyfileobj(response, zipped_file)
        # Extract
        with zipfile.ZipFile(zipped_file.name) as archive:
            output_path = extract_model_from_archive(archive)

    return output_path

joplin_model_paths = [
    fetch_joplin_model('whisper-tiny-q4_0'),
    fetch_joplin_model('whisper-tiny-q8_0'),
    fetch_joplin_model('whisper-base-q4_0'),
    fetch_joplin_model('whisper-base-q8_0'),
    fetch_joplin_model('whisper-small-q5_0'),
    fetch_joplin_model('whisper-small-q8_0'),

    # Fine-tuned French models
    fetch_joplin_model('whisper-base-q8_0.fr'),
    fetch_joplin_model('whisper-small-q8_0.fr'),
]
joplin_model_paths

[PosixPath('ggml-models/joplin/whisper-tiny-q4_0.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-tiny-q8_0.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-base-q4_0.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-base-q8_0.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-small-q5_0.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-small-q8_0.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-base-q8_0.fr.dynamic_ctx.bin'),
 PosixPath('ggml-models/joplin/whisper-small-q8_0.fr.dynamic_ctx.bin')]

## Build whisper.cpp

Joplin uses `whisper.cpp` to run Whisper models on Android. This notebook will use the same library.

In [8]:
%%shell
git clone https://github.com/ggerganov/whisper.cpp

cd whisper.cpp
git checkout v1.7.5
# GPU:
# cmake -B build -DGGML_CUDA=1
cmake -B build

Cloning into 'whisper.cpp'...
remote: Enumerating objects: 17367, done.[K
remote: Counting objects: 100% (444/444), done.[K
remote: Compressing objects: 100% (198/198), done.[K
remote: Total 17367 (delta 299), reused 246 (delta 246), pack-reused 16923 (from 2)[K
Receiving objects: 100% (17367/17367), 20.61 MiB | 19.13 MiB/s, done.
Resolving deltas: 100% (11939/11939), done.
Note: switching to 'v1.7.5'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 51c6961c release : v1.7.5
  Compatibility with CMake < 3.1



Optionally download some of the unmodified `whisper.cpp` models:

In [9]:
%%shell
# Download other upstream models for comparision
mkdir -p ./ggml-models/originals/
#bash ./whisper.cpp/models/download-ggml-model.sh tiny-q5_1 ./ggml-models/originals/
# bash ./whisper.cpp/models/download-ggml-model.sh tiny-q8_0 ./ggml-models/originals/
bash ./whisper.cpp/models/download-ggml-model.sh tiny ./ggml-models/originals/
# bash ./whisper.cpp/models/download-ggml-model.sh base-q8_0 ./ggml-models/originals/
bash ./whisper.cpp/models/download-ggml-model.sh base ./ggml-models/originals/
# bash ./whisper.cpp/models/download-ggml-model.sh small-q8_0 ./ggml-models/originals/
#bash ./whisper.cpp/models/download-ggml-model.sh small ./ggml-models/originals/
#bash ./whisper.cpp/models/download-ggml-model.sh large-v3-turbo-q8_0 ./ggml-models/originals/

Downloading ggml model tiny from 'https://huggingface.co/ggerganov/whisper.cpp' ...
Done! Model 'tiny' saved in './ggml-models/originals//ggml-tiny.bin'
You can now use it like this:

  $ ./build/bin/whisper-cli -m ./ggml-models/originals//ggml-tiny.bin -f samples/jfk.wav

Downloading ggml model base from 'https://huggingface.co/ggerganov/whisper.cpp' ...
Done! Model 'base' saved in './ggml-models/originals//ggml-base.bin'
You can now use it like this:

  $ ./build/bin/whisper-cli -m ./ggml-models/originals//ggml-base.bin -f samples/jfk.wav





### Python wrapper

Next, the `whisper.cpp` library is wrapped with a [ctypes](https://docs.python.org/3/library/ctypes.html)-based wrapper. This simplifies sending audio data to `whisper.cpp` from Python.

In [10]:
from pathlib import Path

whisper_wrapper_dir = Path('./whisper-wrapper').absolute()
whisper_wrapper_dir.mkdir(exist_ok=True)
whisper_source_dir = Path('./whisper.cpp').absolute()

library_content_h = '''
#include "whisper.h"

#ifdef __cplusplus
extern "C" {
#endif

whisper_context * openWhisperContext(char * modelPath);
char * transcribe(whisper_context * context, float * data, int dataLength, int audioCtx, char * language);
void freeWhisperContext(whisper_context * context);

#ifdef __cplusplus
}
#endif
'''

library_content_cxx = '''
#include <sstream>
#include <cstring>
#include <stdlib.h>
#include "whisper_wrapper.h"

extern "C"
whisper_context * openWhisperContext(char * modelPath) {
    whisper_context_params contextParams = whisper_context_default_params();
	return whisper_init_from_file_with_params(modelPath, contextParams);
}

extern "C"
char * transcribe(
    whisper_context * pContext,
    float * data,
    int dataLength,
    int audioCtx,
    char * language
) {
    whisper_full_params params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);

    params.print_realtime = false;
	params.print_timestamps = false;
	params.no_timestamps = true;
	params.translate = false;
	params.offset_ms = 0;
	params.single_segment = true;

    params.audio_ctx = audioCtx;
    params.language = language;

    // Resets internal Whisper performance counter information
    whisper_reset_timings(pContext);

    int whisper_result = whisper_full(pContext, params, data, dataLength);
    if (whisper_result != 0) {
        // Error!
        return NULL;
    }

    unsigned int segmentCount = whisper_full_n_segments(pContext);

	std::stringstream results;
	for (int i = 0; i < segmentCount; i++) {
		results << " " << whisper_full_get_segment_text(pContext, i);
	}

    std::string result = results.str();
    // +1 for the null terminator
    int resultCharCount = result.length() + 1;
    char *pResult = static_cast<char *>(malloc(resultCharCount * sizeof(char)));
    strncpy(pResult, result.c_str(), resultCharCount);
	return pResult;
}

extern "C"
void freeWhisperContext(whisper_context * pContext) {
    whisper_free(pContext);
}
'''

library_build_script = f'''
cmake_minimum_required(VERSION 3.22.1)
project("whisper_wrapper")

add_library(${{CMAKE_PROJECT_NAME}} SHARED
    whisper_wrapper.cpp
)

set(WHISPER_LIB_DIR {whisper_source_dir})

# Whisper: See https://stackoverflow.com/a/76290722
add_subdirectory(${{WHISPER_LIB_DIR}} ./whisper)

# Directories for header files
target_include_directories(
	${{CMAKE_PROJECT_NAME}}
	PUBLIC
	${{WHISPER_LIB_DIR}}/include
)


# Specifies libraries CMake should link to your target library. You
# can link libraries from various origins, such as libraries defined in this
# build script, prebuilt third-party libraries, or Android system libraries.
target_link_libraries(${{CMAKE_PROJECT_NAME}}
    PRIVATE whisper
)

'''

(whisper_wrapper_dir / 'CMakeLists.txt').write_text(library_build_script)
(whisper_wrapper_dir / 'whisper_wrapper.cpp').write_text(library_content_cxx)
(whisper_wrapper_dir / 'whisper_wrapper.h').write_text(library_content_h)

307

In [11]:
%%shell
cd whisper-wrapper
rm -rf ./whisper-wrapper/build/
# GPU: Add -DGGML_CUDA=1 to the build setup command:
# cmake -B build -DGGML_CUDA=1
cmake -B build
cmake --build build --config Release

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
  Compatibility with CMake < 3.10 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

[0m
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCES



In [12]:
!find ./whisper-wrapper/ -name 'libwhisper_wrapper.so'
# List symbols in the shared library
!nm -gD ./whisper-wrapper/build/libwhisper_wrapper.so

./whisper-wrapper/build/libwhisper_wrapper.so
                 w __cxa_finalize@GLIBC_2.2.5
00000000000025b0 T freeWhisperContext
                 w __gmon_start__
                 U __gxx_personality_v0@CXXABI_1.3
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 U malloc@GLIBC_2.2.5
0000000000002540 T openWhisperContext
                 U __stack_chk_fail@GLIBC_2.4
                 U strlen@GLIBC_2.2.5
                 U strncpy@GLIBC_2.2.5
00000000000025c0 T transcribe
                 U _Unwind_Resume@GCC_3.0
                 U whisper_context_default_params
                 U whisper_free
                 U whisper_full
                 U whisper_full_default_params
                 U whisper_full_get_segment_text
                 U whisper_full_n_segments
                 U whisper_init_from_file_with_params
                 U whisper_reset_timings
                 U _ZdlPvm@CXXABI_1.3.9
                 U _ZNSt6localeC1E

In [13]:
# https://github.com/ggerganov/whisper.cpp/issues/9#issuecomment-1272555209
import ctypes
from pathlib import Path
import numpy as np

libpath = Path('./whisper-wrapper/build/libwhisper_wrapper.so').absolute()
whisper = ctypes.CDLL(libpath)


whisper.transcribe.restype = ctypes.c_char_p
whisper.transcribe.argtypes = [
    ctypes.c_void_p, # Context pointer
    ctypes.POINTER(ctypes.c_float), # Data
    ctypes.c_int, # Data count
    ctypes.c_int, # Audio context
    ctypes.c_char_p, # Language
]

# whisper_init_from_file_with_params(path, params)->whisper_context
whisper.openWhisperContext.restype = ctypes.c_void_p
whisper.openWhisperContext.argtypes = [ctypes.c_char_p]

class WhisperCppModel:
    """ A wrapper around a `whisper.cpp` model. """
    def __init__(self, path: Path, language_code = 'fr', dynamic_context: bool = None):
        self.path = path
        path_string = str(path.absolute())
        path_bytes = ctypes.c_char_p(path_string.encode('utf-8'))
        self.ctx = whisper.openWhisperContext(path_bytes)
        self.language = language_code
        # dynamic_context: Should be true for whisper-acft-style models
        if dynamic_context != None:
            self.dynamic_context = dynamic_context
        else:
            self.dynamic_context = path_string.endswith('.dynamic_ctx.bin')

    def transcribe(self, data: np.array):
        audio_ctx = 0
        if self.dynamic_context:
            # See https://github.com/futo-org/whisper-acft/issues/6
            sample_rate = 16_000
            duration_seconds = len(data) / sample_rate
            ctx_units_per_second = 1500 / 30 # 30 seconds = 1500 units
            audio_ctx = int(ctx_units_per_second * duration_seconds + 64)
            # audio_ctx can't be longer than 30 seconds
            audio_ctx = min(1500, audio_ctx)

        float_data = data.astype('float32')
        # See https://stackoverflow.com/a/3671889 and the example Whisper.cpp
        # ctypes usage linked above
        raw_data = float_data.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
        result = whisper.transcribe(self.ctx, raw_data, len(float_data), audio_ctx, self.language.encode('utf-8'))
        return result.decode('utf-8', 'replace')
    def free(self):
        if self.ctx is None:
            raise Exception('Already freed!')
        whisper.freeWhisperContext(ctypes.c_void_p(self.ctx))
        self.ctx = None
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_value, traceback):
        self.free()

Let's test it. Start by making a list of all models that we can test with:

In [14]:
#tiny_q5_1_model_path = Path('./ggml-models/originals/ggml-tiny-q5_1.bin')
#tiny_q8_0_model_path = Path('./ggml-models/originals/ggml-tiny-q8_0.bin')
tiny_model_path = Path('./ggml-models/originals/ggml-tiny.bin')
#base_q8_0_model_path = Path('./ggml-models/originals/ggml-base-q8_0.bin')
base_model_path = Path('./ggml-models/originals/ggml-base.bin')
#small_q8_0_model_path = Path('./ggml-models/originals/ggml-small-q8_0.bin')
#small_model_path = Path('./ggml-models/originals/ggml-small.bin')
#turbo_model_path = Path('./ggml-models/originals/ggml-large-v3-turbo-q8_0.bin')

# Add a few upstream model paths to the ones downloaded from the Joplin release.
model_paths = joplin_model_paths + [
#    tiny_q5_1_model_path,
#    tiny_q8_0_model_path,
    tiny_model_path,
#    base_q8_0_model_path,
    base_model_path,
#    small_q8_0_model_path, # Exclude for now: the small models take a long time to run, so just keep one.
#    small_model_path,
#    turbo_model_path,
]


## Preparing an evaluation function

For the pruposes of this evaluation, punctuation and capitalization are ignored while computing character and word error rates. This will be done using `normalize_text`:

In [23]:
import re, unicodedata

def remove_combining(text: str):
    # Moves accents to separate chars, then removes combining characters.
    # See https://stackoverflow.com/a/517974
    text = unicodedata.normalize('NFKD', text) # e.g. é -> ´ + e
    # unicodedata.combining returns 0 if c is not a combining character:
    return ''.join([ c for c in text if unicodedata.combining(c) == 0 ])

punctuation_regex = re.compile(r'[.,?:!";]')
space_regex = re.compile(r'\s+')

def normalize_text(text: str):
    """ Deduplicates spaces, removes accents, and otherwise normalizes [text].
    """
    text = text.strip()
    text = text.replace('ʼ', '\'')
    # Some of the Joplin French models represent œ as [oe]
    text = text.replace('[oe]', 'oe')
    text = remove_combining(text)
    text = text.lower()
    text = punctuation_regex.sub(' ', text)
    text = space_regex.sub(' ', text)
    text = text.replace('—', ' ') # Remove dashes (but not -s)
    return text.strip()

print('Normalized', normalize_text('Test: öéúçʼ '))

Normalized test oeuc'


`normalize_text` should cause, for example, `Tést! Testing.` and `Test, testing.` to be considered the same (rather than completely different).

In [24]:
import evaluate, time

wer_metric = evaluate.load('wer')
cer_metric = evaluate.load('cer')

def get_word_error_rate(prediction: str, reference: str):
    return wer_metric.compute(
        predictions=[prediction], references=[reference]
    )
def get_char_error_rate(prediction: str, reference: str):
    return cer_metric.compute(
        predictions=[prediction], references=[reference]
    )

def compute_metrics(model: WhisperCppModel, data):
    """ Computes the word error rate (WER), character error rate (CER), and
        transcription time to transcribe the given sample [data] using
        [model].
        [data] should be a row of data from the test dataset.
    """
    audio = data['audio']['array']
    true_text = normalize_text(data['transcript'])
    if true_text == '':
        # Skip?
        return None

    prediction_start_time = time.monotonic()
    raw_prediction = model.transcribe(audio)
    prediction_end_time = time.monotonic()
    predicted_text = normalize_text(raw_prediction)

    return {
        'wer': get_word_error_rate(predicted_text, true_text),
        'cer': get_char_error_rate(predicted_text, true_text),
        'duration': prediction_end_time - prediction_start_time
    }


### Trying the evaluation function

Next, transcribe the same sample with each model and log the output:

In [25]:
import time
from tqdm import tqdm
from IPython.display import Audio as AudioDisplay, display
import pandas as pd


def run_models_on_sample(sample, lang: str):
    """ [lang] should be the language of the [sample] """
    transcription_times = []
    transcriptions = []
    word_error_rates = []
    character_error_rates = []
    true_label = sample['transcript']

    for path in tqdm(model_paths):
        with WhisperCppModel(path, lang) as test_model:
            # Get the starting time within the `with` to avoid including model loading in
            # the transcription time.
            start_time = time.monotonic()
            result = test_model.transcribe(sample['audio']['array'])
        end_time = time.monotonic()
        transcribe_time = end_time - start_time

        transcription_times.append(transcribe_time)
        transcriptions.append(result)
        word_error_rates.append(get_word_error_rate(
            prediction=normalize_text(result),
            reference=normalize_text(true_label),
        ) * 100) # Convert to percent
        character_error_rates.append(get_char_error_rate(
            prediction=normalize_text(result),
            reference=normalize_text(true_label),
        ) * 100)

    print()

    display(pd.DataFrame({
        'Path': [ path.name for path in model_paths ],
        'Recognized': transcriptions,
        'Time (s)': transcription_times,
        'WER (%)': word_error_rates,
        'CER (%)': character_error_rates,
    }))
    display(AudioDisplay(sample['audio']['array'], rate=sample['audio']['sampling_rate']))

    print()
    print('True label:')
    print(' ', true_label)
    print()
    print('All recognized text (trimmed):')
    for transcription in transcriptions:
        print(' ', transcription.strip())

run_models_on_sample(sample_data_fr, 'fr')

100%|██████████| 10/10 [01:15<00:00,  7.56s/it]







Unnamed: 0,Path,Recognized,Time (s),WER (%),CER (%)
0,whisper-tiny-q4_0.dynamic_ctx.bin,"Pendant le second siècle, je suis sermendu v...",1.470247,28.888889,16.269841
1,whisper-tiny-q8_0.dynamic_ctx.bin,"Pendant le second siècle, je fissèrement d'o...",2.500531,22.222222,11.507937
2,whisper-base-q4_0.dynamic_ctx.bin,"Pas mal de second siècle, je fissèrement d'o...",3.654042,31.111111,10.714286
3,whisper-base-q8_0.dynamic_ctx.bin,"Pendant le second siècle, je fissèrement d'o...",3.654287,24.444444,10.714286
4,whisper-small-q5_0.dynamic_ctx.bin,"Pendant le second siècle, je fis serment d'o...",17.49306,13.333333,5.555556
5,whisper-small-q8_0.dynamic_ctx.bin,"Pendant le second siècle, je fis serment d'o...",14.750956,11.111111,4.761905
6,whisper-base-q8_0.fr.dynamic_ctx.bin,"Pendant le second siècle, je fis serment d'ou...",3.688046,17.777778,5.952381
7,whisper-small-q8_0.fr.dynamic_ctx.bin,"Pendant le second siècle, je fis serment d'ou...",13.580944,11.111111,4.761905
8,ggml-tiny.bin,"Pendant le second siècle, je fissèrement d'o...",3.798114,22.222222,8.730159
9,ggml-base.bin,"Pendant le second siècle, je fissèrement d'o...",9.332302,24.444444,11.111111



True label:
  pendant le second siècle je fis serment d'ouvrir tous les trésors de la terre à quiconque me mettrait en liberté mais je ne fus pas plus heureux dans le troisième je promis de faire puissant monarque mon libérateur d'être toujours près de lui en esprit

All recognized text (trimmed):
  Pendant le second siècle, je suis sermendu vers tous les trésors de la terre à qui concle mettrait en liberté. Mais je ne suis pas plus heureux. Dans notre troisième, je promis de faire 800 monarchs pour les berateurs d'être toujours très de lui en esprit.
  Pendant le second siècle, je fissèrement d'ouvrir tous les trésors de la Terre, à qui compte ne m'étrer en liberté. Mais je ne suis pas plus heureux. Dans notre troisième, je promis de faire 800 monarchs mon libérateur, d'être toujours près de lui en esprit.
  Pas mal de second siècle, je fissèrement d'ouvrir tous les trésors de la Terre à qui compte mes maitres et en liberté. Mais je ne suis pas plus heureux. Dans le troisième, je pro

Above, the "Time" column contains the time in seconds it took for Whisper.cpp to transcribe the sample audio. "WER" is the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) of the transcription with the true label and "CER" is the [character error rate](https://huggingface.co/spaces/evaluate-metric/cer). A smaller error rate is better.

#### Additional examples

For comparison, also include another few samples:

In [26]:
dataset_iterator_fr = iter(dataset_fr_librispeech.shuffle(seed=1235).take(3))
for sample in dataset_iterator_fr:
    run_models_on_sample(sample, 'fr')

100%|██████████| 10/10 [00:59<00:00,  5.95s/it]







Unnamed: 0,Path,Recognized,Time (s),WER (%),CER (%)
0,whisper-tiny-q4_0.dynamic_ctx.bin,Il n'est fut pas longtemps à le trouver qui ...,1.264121,33.333333,12.068966
1,whisper-tiny-q8_0.dynamic_ctx.bin,Elle ne fut pas longtemps à le trouver qui d...,1.294127,15.151515,5.747126
2,whisper-base-q4_0.dynamic_ctx.bin,elle ne fut pas longtemps à le trouver qui d...,3.612273,15.151515,5.172414
3,whisper-base-q8_0.dynamic_ctx.bin,Elle ne fut pas longtemps à le trouver qui d...,2.626268,12.121212,5.172414
4,whisper-small-q5_0.dynamic_ctx.bin,"Elle ne fut pas longtemps à le trouver, qui ...",13.176771,3.030303,0.574713
5,whisper-small-q8_0.dynamic_ctx.bin,"Elle ne fut pas longtemps à le trouver, qui ...",10.495454,3.030303,0.574713
6,whisper-base-q8_0.fr.dynamic_ctx.bin,"Elle ne fut pas longtemps à le trouver, qui d...",2.644859,9.090909,5.172414
7,whisper-small-q8_0.fr.dynamic_ctx.bin,"Elle ne fut pas longtemps à le trouver, qui d...",10.355901,3.030303,0.574713
8,ggml-tiny.bin,Elle ne fut pas longtemps à le trouver qui d...,3.52827,9.090909,2.873563
9,ggml-base.bin,Elle ne fut pas longtemps à le trouver qui d...,8.959092,12.121212,5.172414



True label:
  elle ne fut pas longtemps à le trouver qui dormait sur ses deux oreilles derrière le puits et qui ronflait de toutes ses forces attends brigand dit la mère chèvre tu vas voir

All recognized text (trimmed):
  Il n'est fut pas longtemps à le trouver qui dormait sur ces deux horats et derrière le puit et qui ont fait de toutes ces forces. Attends, brigant, dit la mer chèvre, tu vas voir.
  Elle ne fut pas longtemps à le trouver qui dormait sur ces deux oreilles derrière le pui et qui ont fait de toutes ses forces. Attends, brillant, dit la mère chevre, tu vas voir.
  elle ne fut pas longtemps à le trouver qui dormait sur ses deux oreilles derrière le puits et qui ronflait de toutes ses forces. Attends, brigant, d'y la mer chèvre, qu'il va voir.
  Elle ne fut pas longtemps à le trouver qui dormait sur ses deux oreilles derrière le puits et qui auront flé de toutes ses forces. "Attend, brigant, dit la mère chèvre, tu vas voir."
  Elle ne fut pas longtemps à le trouver, qui d

100%|██████████| 10/10 [01:19<00:00,  7.90s/it]







Unnamed: 0,Path,Recognized,Time (s),WER (%),CER (%)
0,whisper-tiny-q4_0.dynamic_ctx.bin,Ben Mosellnach a collectionné tous les texte...,1.576818,32.758621,16.0
1,whisper-tiny-q8_0.dynamic_ctx.bin,Mène Mosellnach à collectionner tous les tex...,2.012618,27.586207,14.0
2,whisper-base-q4_0.dynamic_ctx.bin,"Manozelna, chaque collection, est tous les t...",3.057073,18.965517,8.333333
3,whisper-base-q8_0.dynamic_ctx.bin,Manmosel Nash a collectionné tous les textes...,5.319859,12.068966,6.0
4,whisper-small-q5_0.dynamic_ctx.bin,Mademoiselle Nash a collectionné tous les te...,17.987414,12.068966,6.0
5,whisper-small-q8_0.dynamic_ctx.bin,Mademoiselle Nash a collectionné tous les te...,14.210053,10.344828,5.666667
6,whisper-base-q8_0.fr.dynamic_ctx.bin,Mademoiselle Nash a collectionné tous les tex...,5.523364,22.413793,14.0
7,whisper-small-q8_0.fr.dynamic_ctx.bin,Mlle.Nash a collectionné tous les textes de s...,14.435498,8.62069,3.0
8,ggml-tiny.bin,Mène Mosellnach à collectionner tous les tex...,4.002108,20.689655,11.333333
9,ggml-base.bin,Manmosel Nash a collectionné tous les textes...,9.45018,12.068966,6.0



True label:
  mlle nash a collectionné tous les textes de ses sermons depuis qu'il est arrivé à hartfield je me souviens de la première fois que je l'ai vu comme je me doutais peu à ce moment-là de ce qui arriverait les deux abotts et moi nous avions couru dans le salon pour le regarder passer à travers le rideau

All recognized text (trimmed):
  Ben Mosellnach a collectionné tous les textes ces armons depuis qu'il a été arrivé à Routtefield. Je me souviens de la première fois que je les vu comme je me douteais peu ce moment-là de ce qui arriverait les deux abottes et mois nous avions courué en silence pour le regarder passer à travers leur idône.
  Mène Mosellnach à collectionner tous les textes ces armons depuis qu'il est arrivé à hautefilde. Je me souviens de la première fois que je l'ai vu. Comme je me douteais peu, c'est bon, moi-là, de ce qu'il y arriverait. Les deux à boite et moi, nous avions couru dans le salon pour le regarder passer à travers le rido.
  Manozelna, chaque co

100%|██████████| 10/10 [01:33<00:00,  9.38s/it]







Unnamed: 0,Path,Recognized,Time (s),WER (%),CER (%)
0,whisper-tiny-q4_0.dynamic_ctx.bin,Et je n'ai pas le droit de la livre public m...,1.977402,31.481481,14.589666
1,whisper-tiny-q8_0.dynamic_ctx.bin,Et je n'ai pas le droit de la livraire publi...,2.429536,31.481481,15.197568
2,whisper-base-q4_0.dynamic_ctx.bin,"et je n'ai pas le droit de la livraubublie, ...",5.462757,24.074074,10.942249
3,whisper-base-q8_0.dynamic_ctx.bin,Et je n'ai pas le droit de la livrape publiq...,4.835024,25.925926,10.334347
4,whisper-small-q5_0.dynamic_ctx.bin,— Et je n'ai pas le droit de la livrer au pu...,23.053261,7.407407,2.431611
5,whisper-small-q8_0.dynamic_ctx.bin,— Et je n'ai pas le droit de la livrer au pu...,17.330055,5.555556,1.823708
6,whisper-base-q8_0.fr.dynamic_ctx.bin,Je n'ai pas le droit de la démocratie publiqu...,6.257263,20.37037,7.902736
7,whisper-small-q8_0.fr.dynamic_ctx.bin,Et je n'ai pas le droit de la livrer au publi...,17.205817,5.555556,1.823708
8,ggml-tiny.bin,Et je n'ai pas le droit de la livraire publi...,4.055459,31.481481,14.589666
9,ggml-base.bin,Et je n'ai pas le droit de la livrape publiq...,9.717048,22.222222,9.422492



True label:
  et je n'ai pas le droit de la livrer au public mais peut-être ne vous déplaira-t-il pas d'en prendre connaissance ce discours s'adressait plus particulièrement à emma qui ne s'en étonna pas elle comprenait que dans cette circonstance décisive m elton préférât éviter le regard d'harriet il prit congé au bout de quelques instants

All recognized text (trimmed):
  Et je n'ai pas le droit de la livre public mais peut-être nous vous déplacerà-t-il pas à d'en prendre connaissance? Ce discours s'adresser plus particulièrement a éma qui ne s'en étant n'appara. Elle comprenait que dans cette circonstance décisive monsieur Alton préférera éviter le regard arrrière. Il préconjait au bout de quelque instant.
  Et je n'ai pas le droit de la livraire publique, mais peut-être nous vous déprératez-le pas à d'en prendre connaissance ? Ce discours s'adresser plus particulièrement à Emma, qui ne s'en est en a pas, elle comprenait que dans cette circonstance décisive, monsieur Alton préfera

**Notes**:
- In Joplin (and also before computing word/character error rates below), the `[oe]`s (if any) would be replaced with `œ`s.
- `whisper-small-q5_0` is the largest of the models tested above and is hypothesized to be the most accurate. `whisper-tiny-q4_0` is the smallest of the models tested and hypothesized to be the least accurate.
- `whisper-base-q8_0.fr` is a version of the base model that has been fine-tuned on French. It is predicted to perform better than the default `whisper-base` models on French, but worse than `whisper-small-q5_0`.

For English,

In [27]:
run_models_on_sample(sample_data_en, 'en')

100%|██████████| 10/10 [00:34<00:00,  3.43s/it]







Unnamed: 0,Path,Recognized,Time (s),WER (%),CER (%)
0,whisper-tiny-q4_0.dynamic_ctx.bin,In position of switching of the life support...,0.529795,30.769231,5.617978
1,whisper-tiny-q8_0.dynamic_ctx.bin,In position of switching of the life-support...,0.576192,46.153846,6.741573
2,whisper-base-q4_0.dynamic_ctx.bin,in position of switching of the life-support...,0.837429,46.153846,6.741573
3,whisper-base-q8_0.dynamic_ctx.bin,In position of switching of the life-support...,1.248175,46.153846,6.741573
4,whisper-small-q5_0.dynamic_ctx.bin,In position of switching of the life-support...,6.910414,38.461538,4.494382
5,whisper-small-q8_0.dynamic_ctx.bin,Imposition of switching of the life-supporti...,4.127739,23.076923,2.247191
6,whisper-base-q8_0.fr.dynamic_ctx.bin,In position au suite gynop de l'iPhone suppor...,1.266594,100.0,52.808989
7,whisper-small-q8_0.fr.dynamic_ctx.bin,Imposition of switching of the life-supporti...,5.589011,23.076923,2.247191
8,ggml-tiny.bin,In position of switching of the life-support...,3.310286,46.153846,6.741573
9,ggml-base.bin,In position of switching of the life support...,8.472915,30.769231,5.617978



True label:
  imposition of switching off the life supporting treatment is nothing else than euthanasia.

All recognized text (trimmed):
  In position of switching of the life supporting treatment is nothing else than utanasia.
  In position of switching of the life-supporting treatment is nothing else than utanasia.
  in position of switching of the life-supporting treatment is nothing else than utanasia.
  In position of switching of the life-supporting treatment is nothing else than utanasia.
  In position of switching of the life-supporting treatment is nothing else than euthanasia.
  Imposition of switching of the life-supporting treatment is nothing else than euthanasia.
  In position au suite gynop de l'iPhone supporting turc, en disant Ossingels, de l'iPhone.
  Imposition of switching of the life-supporting treatment is nothing else than euthanasia.
  In position of switching of the life-supporting treatment is nothing else than utanasia.
  In position of switching of the life

### Creating the test set & evaluation loop

We're running the models on the CPU, which can be slow. To speed up testing, select only a small subset of the test set:

In [28]:
test_set_size = 128
def make_test_set(source_dataset):
    return source_dataset.take(test_set_size), test_set_size

In [29]:
from tqdm import tqdm

def evaluate_model(model: WhisperCppModel, test_dataset, test_dataset_size: int):
    """ Returns error rates and average evaluation time resulting from evaluating
        [model] on [test_dataset].
    """
    wer = 0
    cer = 0
    total_time = 0
    total = 0

    iterator = tqdm(enumerate(test_dataset), total=test_dataset_size)
    for idx,sample in iterator:
        metrics = compute_metrics(model, sample)
        if metrics:
            total += 1
            wer += metrics['wer']
            cer += metrics['cer']
            total_time += metrics['duration']
            iterator.set_postfix({ 'model': model.path.name, 'wer': wer / total * 100 })
    iterator.close()
    wer /= total
    cer /= total
    time_per_transcription = total_time / total
    return { 'wer': wer, 'cer': cer, 'avg_time': time_per_transcription }



## Results

<!-- Anchor for linking to -->
<a id="results"></a>

Next, let's run the evaluation:

In [30]:
from IPython.display import HTML
def evaluate_all(base_dataset, model_paths: list[Path], language_code: str):
    model_wers = []
    model_cers = []
    model_avg_times = []
    model_names = []

    dataset,dataset_size = make_test_set(base_dataset)
    for path in model_paths:
        with WhisperCppModel(path, language_code) as test_model:
            results = evaluate_model(test_model, dataset, dataset_size)
            model_names.append(test_model.path.name)
        model_wers.append(results['wer'] * 100) # Convert to %
        model_cers.append(results['cer'] * 100)
        model_avg_times.append(results['avg_time'])

    return pd.DataFrame({
        'Path': model_names,
        'WER (%)': model_wers,
        'CER (%)': model_cers,
        'avg_time (s)': model_avg_times,
    })

def print_results(label_html, results):
    display(HTML(f'<h4>{label_html}</h4>'))
    display(results)

print_results('French (Librispeech)', evaluate_all(dataset_fr_librispeech, model_paths, 'fr'))

print()
print('---')

print()
print('---')

print_results('English (Voxpopuli)', evaluate_all(dataset_en_voxpopuli, model_paths, 'en'))


100%|██████████| 128/128 [03:40<00:00,  1.73s/it, model=whisper-tiny-q4_0.dynamic_ctx.bin, wer=44.9]
100%|██████████| 128/128 [03:59<00:00,  1.87s/it, model=whisper-tiny-q8_0.dynamic_ctx.bin, wer=38.6]
100%|██████████| 128/128 [06:25<00:00,  3.01s/it, model=whisper-base-q4_0.dynamic_ctx.bin, wer=31.1]
100%|██████████| 128/128 [08:03<00:00,  3.78s/it, model=whisper-base-q8_0.dynamic_ctx.bin, wer=27.4]
100%|██████████| 128/128 [33:20<00:00, 15.63s/it, model=whisper-small-q5_0.dynamic_ctx.bin, wer=16.5]
100%|██████████| 128/128 [25:38<00:00, 12.02s/it, model=whisper-small-q8_0.dynamic_ctx.bin, wer=15.9]
100%|██████████| 128/128 [07:59<00:00,  3.74s/it, model=whisper-base-q8_0.fr.dynamic_ctx.bin, wer=20.7]
100%|██████████| 128/128 [25:40<00:00, 12.04s/it, model=whisper-small-q8_0.fr.dynamic_ctx.bin, wer=15.1]
100%|██████████| 128/128 [08:59<00:00,  4.21s/it, model=ggml-tiny.bin, wer=37.3]
100%|██████████| 128/128 [19:03<00:00,  8.94s/it, model=ggml-base.bin, wer=26.4]


Unnamed: 0,Path,WER (%),CER (%),avg_time (s)
0,whisper-tiny-q4_0.dynamic_ctx.bin,44.90118,21.349186,1.639813
1,whisper-tiny-q8_0.dynamic_ctx.bin,38.553923,17.34801,1.782242
2,whisper-base-q4_0.dynamic_ctx.bin,31.075379,14.176596,2.926584
3,whisper-base-q8_0.dynamic_ctx.bin,27.363593,11.811963,3.692116
4,whisper-small-q5_0.dynamic_ctx.bin,16.54155,6.362238,15.541416
5,whisper-small-q8_0.dynamic_ctx.bin,15.915518,6.108343,11.938813
6,whisper-base-q8_0.fr.dynamic_ctx.bin,20.725669,8.496503,3.65578
7,whisper-small-q8_0.fr.dynamic_ctx.bin,15.120342,5.560538,11.954452
8,ggml-tiny.bin,37.349238,15.454571,4.128008
9,ggml-base.bin,26.365507,10.653699,8.853954



---

---


100%|██████████| 128/128 [01:43<00:00,  1.24it/s, model=whisper-tiny-q4_0.dynamic_ctx.bin, wer=17]
100%|██████████| 128/128 [02:10<00:00,  1.02s/it, model=whisper-tiny-q8_0.dynamic_ctx.bin, wer=15.2]
100%|██████████| 128/128 [03:24<00:00,  1.60s/it, model=whisper-base-q4_0.dynamic_ctx.bin, wer=13.2]
100%|██████████| 128/128 [04:23<00:00,  2.06s/it, model=whisper-base-q8_0.dynamic_ctx.bin, wer=12.3]
100%|██████████| 128/128 [19:41<00:00,  9.23s/it, model=whisper-small-q5_0.dynamic_ctx.bin, wer=10.7]
100%|██████████| 128/128 [14:47<00:00,  6.93s/it, model=whisper-small-q8_0.dynamic_ctx.bin, wer=10.4]
100%|██████████| 128/128 [07:50<00:00,  3.68s/it, model=whisper-base-q8_0.fr.dynamic_ctx.bin, wer=124]
100%|██████████| 128/128 [14:48<00:00,  6.94s/it, model=whisper-small-q8_0.fr.dynamic_ctx.bin, wer=10.1]
100%|██████████| 128/128 [07:52<00:00,  3.69s/it, model=ggml-tiny.bin, wer=15.1]
100%|██████████| 128/128 [17:11<00:00,  8.06s/it, model=ggml-base.bin, wer=11.4]


Unnamed: 0,Path,WER (%),CER (%),avg_time (s)
0,whisper-tiny-q4_0.dynamic_ctx.bin,16.978301,9.161407,0.755053
1,whisper-tiny-q8_0.dynamic_ctx.bin,15.232603,8.093447,0.971793
2,whisper-base-q4_0.dynamic_ctx.bin,13.18516,7.503358,1.54916
3,whisper-base-q8_0.dynamic_ctx.bin,12.270667,6.730349,2.009279
4,whisper-small-q5_0.dynamic_ctx.bin,10.697131,6.208771,9.175423
5,whisper-small-q8_0.dynamic_ctx.bin,10.397441,6.046165,6.881213
6,whisper-base-q8_0.fr.dynamic_ctx.bin,123.537591,85.850268,3.626028
7,whisper-small-q8_0.fr.dynamic_ctx.bin,10.083281,5.743634,6.895328
8,ggml-tiny.bin,15.06255,8.123391,3.637139
9,ggml-base.bin,11.369856,6.466883,8.006273


In the "English (Voxpopuli)" table, the `-fr` finetuned model all has a high error rate. The finetuning process uses French-language training data. As such, these results are expected. Similar English accuracy decreases are also present for other fine-tuning attempts. [See the previous model evaluation results for details](https://gist.github.com/personalizedrefrigerator/3413146963e6a1635cc61b889bdb2329).

Based on the results, the default models perform better on English than French. For example, the default `whisper-base` model has a WER of 11.2% on Voxpopuli-English dataset, but a WER of 26% and 21% on the Librispeech-French and Voxpopuli-French datasets.

Interestingly, the quantized (`-q8_0` models) have nearly the same word error rates as the non-quantized versions. Although this can be seen better in the [GPU evaluation notebook](https://gist.github.com/personalizedrefrigerator/3413146963e6a1635cc61b889bdb2329#file-whisper_eval_gpu-ipynb), the `whisper-base-fr-v2` and `whisper-base-fr-v2-q8_0` have WERs of 20.8% and 20.38%, respectively in the Librispeech-French dataset.

#### More French-language output

The lower word error rates in the **next** table (French Voxpopuli) for the `-fr` models **may be the result of overfitting**. The `-fr` models were finetuned on French-Voxpopuli (`train` split), which includes audio transcription from European parliment meetings. Although this evaluation is on the `test` split of the data, these recordings are likely to have a large amount of language specific to this setting (e.g. "Monsieur le Président"). As a result, the finetuned `-fr` models are more likely to understand this domain-specific vocabulary and may perform much better in French-Voxpopuli than the other datasets.

Be aware of this if using the French-Voxpopuli table for comparing the `-fr` finetuned models with the non-finetuned models.

In [31]:
print_results('French (Voxpopuli), see note', evaluate_all(dataset_fr_voxpopuli, model_paths, 'fr'))

100%|██████████| 128/128 [02:29<00:00,  1.16s/it, model=whisper-tiny-q4_0.dynamic_ctx.bin, wer=40.8]
100%|██████████| 128/128 [02:54<00:00,  1.36s/it, model=whisper-tiny-q8_0.dynamic_ctx.bin, wer=32.2]
100%|██████████| 128/128 [04:15<00:00,  2.00s/it, model=whisper-base-q4_0.dynamic_ctx.bin, wer=26.6]
100%|██████████| 128/128 [05:40<00:00,  2.66s/it, model=whisper-base-q8_0.dynamic_ctx.bin, wer=23.6]
100%|██████████| 128/128 [23:31<00:00, 11.03s/it, model=whisper-small-q5_0.dynamic_ctx.bin, wer=15.1]
100%|██████████| 128/128 [18:04<00:00,  8.48s/it, model=whisper-small-q8_0.dynamic_ctx.bin, wer=15]
100%|██████████| 128/128 [05:27<00:00,  2.56s/it, model=whisper-base-q8_0.fr.dynamic_ctx.bin, wer=12.3]
100%|██████████| 128/128 [18:13<00:00,  8.54s/it, model=whisper-small-q8_0.fr.dynamic_ctx.bin, wer=15.2]
100%|██████████| 128/128 [08:32<00:00,  4.00s/it, model=ggml-tiny.bin, wer=31.1]
100%|██████████| 128/128 [18:34<00:00,  8.70s/it, model=ggml-base.bin, wer=22.8]


Unnamed: 0,Path,WER (%),CER (%),avg_time (s)
0,whisper-tiny-q4_0.dynamic_ctx.bin,40.796356,19.771157,1.098012
1,whisper-tiny-q8_0.dynamic_ctx.bin,32.207995,15.571725,1.309695
2,whisper-base-q4_0.dynamic_ctx.bin,26.560328,12.882937,1.946184
3,whisper-base-q8_0.dynamic_ctx.bin,23.62841,11.465192,2.611852
4,whisper-small-q5_0.dynamic_ctx.bin,15.05934,8.006197,10.976995
5,whisper-small-q8_0.dynamic_ctx.bin,14.968593,7.996346,8.422948
6,whisper-base-q8_0.fr.dynamic_ctx.bin,12.34609,6.455489,2.505554
7,whisper-small-q8_0.fr.dynamic_ctx.bin,15.207588,8.059434,8.491384
8,ggml-tiny.bin,31.10724,14.628687,3.950077
9,ggml-base.bin,22.774681,10.716909,8.656325


<!--**Important**: The lower word error rates in the second table for the `-fr` models **may be the result of overfitting**. The `-fr` finetuned models were finetuned on the `train` and parameters were selected based on the `test` set. As a result, the "French (Voxpopuli)" includes data formated similarly to the finetuning data and may even include a subset of the test data. Avoid using the "French (**Voxpopuli**)" table for comparing the `-fr` finetuned models with the non-finetuned models.-->

### About the models

Results from several different models are listed in the chart above:
- **`ggml-*-*` models** (e.g. `ggml-base.bin`): These are the upstream Whisper models [packaged for `whisper.cpp`](https://huggingface.co/ggerganov/whisper.cpp). Many of these are also available from the Joplin releases page [with a slightly modified vocabulary](https://github.com/joplin/voice-typing-models/blob/main/whisper_vocab_cleanup.ipynb). For example, the `whisper-tiny-q8_0.zip` mode from Joplin's release page is mostly equivalent to the `ggml-tiny-q8_0.bin` model listed above.
- **`whisper-*-fr.bin` models** (e.g. `whisper-base-fr-q8_0.bin`): Whisper models that have been fine-tuned on French. See also: [the fine-tuning notebook for `whisper-tiny-fr-v2.bin`](https://huggingface.co/personalizedrefrigerator/whisper-tiny-fr/blob/23876a7a3325a50bd4bcdaa376221af9b9e5d3f9/whisper-finetune-fr.ipynb).
    - **Note**: `whisper-tiny-fr.bin` is an older version of `whisper-tiny-fr-v2.bin` trained with fewer iterations and a modified vocabulary.
- **`whisper-*.dynamic_ctx.bin` models** (e.g. `whisper-base-q8_0.dynamic_ctx.bin`): Whisper models fine-tuned to support more efficient processing of audio shorter than 30 seconds (see [whisper-acft](https://github.com/futo-org/whisper-acft) for details).
    - Unlike `whisper-acft`, the `.dynamic_ctx.bin` models used French-language input (rather than English) during the fine-tuning process.