If using GPU for inferencing, ensure that the appropriate NVIDIA drivers and cuda-11.x are installed.

Install all required dependencies via the CLI:

```bash
sudo apt install ffmpeg
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```

Choose the Whisper model identifier that you would like to use.

In [5]:
desired_whisper_model = input("Whisper model version (e.g., base.en, large):")
inference_solution = int(
    input("Choose one using the number: (1) pywhispercpp or (2) faster-whisper")
)
inference_solution = "cpp" if inference_solution == 1 else "faster"
gpu_on = "no" if inference_solution == "cpp" else input("Inference with GPU (yes or no)?").lower()

print(f"Desired Whisper Model: {desired_whisper_model}")
print(f"GPU inferencing: {gpu_on}")
print(f"Selected inference solution: {inference_solution}")

Desired Whisper Model: large
GPU inferencing: yes
Selected inference solution: faster


Read in all available audio files from `samples/audio`. Assumes that the audio files all have a matching file within `sample/truth`.

In [6]:
import os

audio_dir = "samples/audio/"
transcription_dir = "samples/transcription/"
truth_dir = "samples/truth/"

files = os.listdir(audio_dir)

filenames = []
for file in files:
    name, ext = os.path.splitext(file)
    if (ext):
        filenames.append(name)

print(f"Files: {filenames}")

Files: ['0min12sec', '3min47sec', '13min56sec']


Use the python binding for whisper.cpp to inference whisper GGML. Cross-platform and cross-language.

Use faster-whisper for CTransformers2 inferencing acceleration. Python only,

In [7]:
if inference_solution == "cpp":
    from pywhispercpp.model import Model

    model = Model(desired_whisper_model, models_dir="./models")

    for filename in filenames:
        audio_file = f"{audio_dir}{filename}.wav"
        print(f"Filename: {audio_file}")
        segments = model.transcribe(audio_file, speed_up=True)

        transcript = ""
        for segment in segments:
            transcript = " ".join([seg.text for seg in segments])

        transcript_file = f"{transcription_dir}{filename}-{inference_solution}-{desired_whisper_model}.txt"
        with open(transcript_file, "w") as f:
            f.write(transcript)

else:
    from faster_whisper import WhisperModel
    import time

    model = WhisperModel(
        desired_whisper_model,
        device="cuda" if gpu_on == "yes" else "cpu",
        compute_type="int8_float16" if gpu_on == "yes" else "int8",
        download_root="./models",
    )

    for filename in filenames:
        audio_file = f"{audio_dir}{filename}.wav"
        segments, info = model.transcribe(audio_file)

        print(f"Filename: {audio_file}")
        start = time.perf_counter()
        segments = list(segments)
        end = time.perf_counter()
        print(f"Transcribed in {end - start:0.4f} seconds")

        transcript = ""
        for segment in segments:
            transcript = " ".join([seg.text for seg in segments])

        transcript_file = f"{transcription_dir}{filename}-{inference_solution}-{desired_whisper_model}.txt"
        with open(transcript_file, "w") as f:
            f.write(transcript)

[2023-10-19 16:11:21,085] {transcribe.py:263} INFO - Processing audio with duration 00:12.669
[2023-10-19 16:11:21,281] {transcribe.py:317} INFO - Detected language 'en' with probability 0.98
Filename: samples/audio/0min12sec.wav
Transcribed in 0.3175 seconds
[2023-10-19 16:11:21,678] {transcribe.py:263} INFO - Processing audio with duration 03:47.527
[2023-10-19 16:11:21,872] {transcribe.py:317} INFO - Detected language 'en' with probability 0.96
Filename: samples/audio/3min47sec.wav
Transcribed in 6.9232 seconds
[2023-10-19 16:11:28,895] {transcribe.py:263} INFO - Processing audio with duration 13:58.191
[2023-10-19 16:11:29,421] {transcribe.py:317} INFO - Detected language 'en' with probability 1.00
Filename: samples/audio/13min56sec.wav
Transcribed in 23.7832 seconds


Calculate Word Error Rate (WER) and Word Information Loss (WIL). WER measures word-level accuracy. WIL measures semantic fidelity. WER compares words. WIL compares meaning.

Calculate the cosine-similarity of the paragraphs using a sentence embedding model.

In [8]:
from jiwer import wer, process_words
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


for filename in filenames:
    truth = ""
    transcript = ""

    transcript_file = f"{transcription_dir}{filename}-{inference_solution}-{desired_whisper_model}.txt"
    with open(transcript_file, "r") as f:
        transcript = f.read()

    truth_file = f"{truth_dir}{filename}.txt"
    with open(truth_file, "r") as f:
        truth = f.read()

    output = process_words(truth, transcript)

    wer = output.wer
    wil = output.wil

    truth_embedding = model.encode(truth, convert_to_tensor=True)
    transcript_embedding = model.encode(transcript, convert_to_tensor=True)

    document_similarity = util.pytorch_cos_sim(
        truth_embedding, transcript_embedding
    ).item()

    print(f"[{filename}] Word Error Rate: {wer}")
    print(f"[{filename}] Word Information Loss: {wil}")
    print(f"[{filename}] Document Cosine-Similarity: {document_similarity}")

[2023-10-19 16:11:53,211] {SentenceTransformer.py:66} INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
[2023-10-19 16:11:53,291] {SentenceTransformer.py:105} INFO - Use pytorch device: cuda


Batches: 100%|██████████| 1/1 [00:00<00:00, 490.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 547.13it/s]


[0min12sec] Word Error Rate: 0.08571428571428572
[0min12sec] Word Information Loss: 0.1394957983193278
[0min12sec] Document Cosine-Similarity: 0.9969894886016846


Batches: 100%|██████████| 1/1 [00:00<00:00, 388.79it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 399.08it/s]


[3min47sec] Word Error Rate: 0.07142857142857142
[3min47sec] Word Information Loss: 0.11541518879078305
[3min47sec] Document Cosine-Similarity: 0.9909672737121582


Batches: 100%|██████████| 1/1 [00:00<00:00, 180.62it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 210.38it/s]

[13min56sec] Word Error Rate: 0.06532222709338009
[13min56sec] Word Information Loss: 0.10532975919927212
[13min56sec] Document Cosine-Similarity: 0.9494907855987549



