# Installing Whisper

The commands below will install the Python packages needed to use Whisper models and evaluate the transcription results.

In [None]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-f9z0vhx0
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-f9z0vhx0
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==

# Loading the LibriSpeech dataset

The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

In [None]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
class LibriSpeech(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, split="test-clean", device=DEVICE):
        self.dataset = torchaudio.datasets.LIBRISPEECH(
            root=os.path.expanduser("~/.cache"),
            url=split,
            download=True,
        )
        self.device = device

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        audio, sample_rate, text, _, _, _ = self.dataset[item]
        assert sample_rate == 16000
        audio = whisper.pad_or_trim(audio.flatten()).to(self.device)
        mel = whisper.log_mel_spectrogram(audio)

        return (mel, text)

In [None]:
dataset = LibriSpeech("test-clean")
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

100%|██████████| 331M/331M [00:20<00:00, 16.9MB/s]


# Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

In [None]:
model = whisper.load_model("base.en")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|███████████████████████████████████████| 139M/139M [00:03<00:00, 44.9MiB/s]


Model is English-only and has 71,825,408 parameters.


In [None]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

In [None]:
hypotheses = []
references = []

for mels, texts in tqdm(loader):
    results = model.decode(mels, options)
    hypotheses.extend([result.text for result in results])
    references.extend(texts)

  0%|          | 0/164 [00:00<?, ?it/s]

In [None]:
data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
data

Unnamed: 0,hypothesis,reference
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...
...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...


# Calculating the word error rate

Now, we use our English normalizer implementation to standardize the transcription and calculate the WER.

In [None]:
import jiwer
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()

In [None]:
data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]]
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data

Unnamed: 0,hypothesis,reference,hypothesis_clean,reference_clean
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM,stuffered into you his belly counseled him,stuff it into you his belly counseled him
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...,after early nightfall the yellow lamps would l...,after early nightfall the yellow lamps would l...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND,hello bertie any good in your mind,hello bertie any good in your mind
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...,number 10 fresh nelly is waiting on you good n...,number 10 fresh nelly is waiting on you good n...
...,...,...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...,0 to shoot my soul is full meaning into future...,0 to shoot my soul is full meaning into future...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...,then i long tried by natural ills received the...,then i long tried by natural ills received the...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...,i love thee freely as men strive for right i l...,i love thee freely as men strive for right i l...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...,i love thee with the passion put to use in my ...,i love thee with the passion put to use in my ...


In [None]:
wer = jiwer.wer(list(data["reference_clean"]), list(data["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")

WER: 4.28 %


this code calculates the Word Error Rate (WER) between two sets of text data (reference and hypothesis) stored in a Python dictionary or DataFrame called data. The calculated WER value is then printed as a percentage with two decimal places.
The Word Error Rate is a common evaluation metric used in speech recognition and machine translation tasks. It measures the edit distance between the reference and hypothesis text, taking into account insertions, deletions, and substitutions of words. A lower WER indicates better performance, with a WER of 0% representing perfect accuracy.

# Now generating all the new SRT files from the Whisper

In [None]:
#@title 1.1 Install

import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip install faster-whisper
!pip install srt requests tqdm googletrans==4.0.0rc1 httpx aiometer
# https://stackoverflow.com/a/77671445
!apt install libcublas11

Collecting faster-whisper
  Downloading faster_whisper-1.0.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting av==11.* (from faster-whisper)
  Downloading av-11.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ctranslate2<5,>=4.0 (from faster-whisper)
  Downloading ctranslate2-4.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.7/36.7 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.16,>=0.13 (from faster-whisper)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m86.9 MB/s[

In [None]:
#@title GPU Check
!nvidia-smi

Tue Apr 23 02:40:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### 1.2 Config

### Whisper
- `device`: `cuda` or `cpu`. Whether to use GPU.
- `model_size`: Name of model. `distil` models are faster with lower quality.
- `compute_type`: `float16` is FP16 by default; `int8_float16` is INT8 on GPU; `int8` is INT8 on CPU
- `beam_size`: Whisper was trained with this - do not change unless you know what you are doing

### Silero VAD
- `vad_filter`: Whether to use VAD. Recommended to reduce false positive.
- `threshold`: Probability of non-speech. Higher = stricter.
- `min_speech_duration_ms`: as name suggests.
- `max_speech_duration_s`: Max duration of single speach. Reduced from infinite to 12s.
- `min_silence_duration_ms`: In the end of each speech chunk wait for this before separating it
- `window_size_samples`: Do not change unless you know what you are doing.
- `speech_pad_ms`: Add this to the beginning and end of VAD chunk to reduce false negative.

### SRT Generation

_This setup is very much ACICFG opinionated._

The following combination of setup should achive:

1. Any single line of subtitle should not become too long to show in a single line per default font and size setup; AND,
2. Any single line of subtitle should be long enough to give viewers enough time to recognize.

- `max_text_len`: Maximum characters per line to avoid out of vision. Best-effort basis. See `max_segment_interval`. Address point 1.
- `max_segment_interval`: Consider the next chunk of sentence if the length of current line is less than this amount of time. Address point 2.


In [None]:
#@title Settings

# Whisper
device = "cuda" #@param ["cuda", "cpu"]
model_size = 'large-v3' #@param ["large-v3", "distil-large-v2", "distil-medium.en"]
compute_type = "float16" #@param ["float16", "int8_float16", "int8"]
beam_size = 5 #@param {type:"integer"}
whisper_debug = True #@param {type: "boolean"}
# Silero VAD
vad_filter = True #@param {type:"boolean"}
threshold = 0.5 #@param {type:"number"}
min_speech_duration_ms = 250 #@param {type:"integer"}
max_speech_duration_s = 12 #@param {type:"number"}
min_silence_duration_ms = 2000  #@param {type:"integer"}
window_size_samples = 1024 #@param [512, 1024, 1536]
speech_pad_ms = 400 #@param {type:"integer"}
# SRT Generation
use_whisper_sentence_segment = False #@param {type: "boolean"}
max_text_len = 110 #@param {type:"integer"}
max_segment_interval = 1.5 #@param {type:"number"}
# transcription_cutoff_char = 80 #@param {type:"integer"}
# align_extend = 2 #@param {type:"integer"}
# align_from_prev = True #@param {type:"boolean"}




In [None]:
#@title 1.3 Load Model
from faster_whisper import WhisperModel

import logging

logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)

model = WhisperModel(model_size, device=device, compute_type=compute_type)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

## Step 2: Transcribe and Alignment

Download an sample audio file from https://audio-samples.github.io/#section-1 and name it as audio.mp3

In [None]:
#@title 2.1 Setup filename

filename = "audio.mp3" #@param {type:"string"}
transcribed_srt_name = 'transcribed.srt' #@param {type:"string"}

In [None]:
#@title 2.2 Transcribe! Speed: ~10x

segments, info = model.transcribe(filename,
                                  beam_size=beam_size,
                                  word_timestamps=True,
                                  vad_filter=vad_filter,
                                  vad_parameters={'threshold': threshold,
                                                  'min_speech_duration_ms': min_speech_duration_ms,
                                                  'max_speech_duration_s': max_speech_duration_s,
                                                  'min_silence_duration_ms': min_silence_duration_ms,
                                                  'window_size_samples': window_size_samples,
                                                  'speech_pad_ms': speech_pad_ms},
                                  )
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
segments = [i for i in segments]  # force run generator


INFO:faster_whisper:Processing audio with duration 00:10.043
INFO:faster_whisper:VAD filter removed 00:00.000 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.000 -> 00:10.043]
INFO:faster_whisper:Detected language 'en' with probability 1.00
DEBUG:faster_whisper:Processing segment at 00:00.000


Detected language 'en' with probability 1.000000


In [None]:
#@title 2.3 Generate SRT


import copy
import srt
from datetime import timedelta

def sentence_segments_merger(segments, max_text_len=80, max_segment_interval=2.0):
    """
    Merge sentence segments to one segment, if the length of the text is less than max_text_len.
    :param segments: [{"text": "Hello, World!", "start": 1.1, "end": 4.4}, {"text": "Hello, World!", "start": 1.1, "end": 4.4}]
    :type segments: list of dicts
    :param max_text_len: Max length of the text
    :type max_text_len: int
    :return: Segments, but with merged sentences.
    :rtype: list of dicts  [{"text": "Hello, World! Hello, World!", "start": 1.1, "end": 4.4}]
    """
    if not segments:
        return []

    merged_segments = []
    current_segment = {"text": "", "start": 0, "end": 0}
    current_segment_template = {"text": "", "start": 0, "end": 0}
    is_current_segment_empty = True

    for i, segment in enumerate(segments):
        # remove empty lines
        segment_text = segment["text"].strip()
        if not segment_text:
            continue

        if is_current_segment_empty:
            current_segment["start"] = segment["start"]
            current_segment["end"] = segment["end"]
            current_segment["text"] = segment["text"].strip()
            is_current_segment_empty = False
            continue

        if segment["start"] - current_segment["end"] < max_segment_interval and \
                len(current_segment["text"] + " " + segment_text) < max_text_len:
            current_segment["text"] += " " + segment_text
            current_segment["text"] = current_segment["text"].strip()
            current_segment["end"] = segment["end"]
        else:
            current_segment["text"] = current_segment["text"].strip()
            merged_segments.append(copy.deepcopy(current_segment))
            current_segment = copy.deepcopy(current_segment_template)
            is_current_segment_empty = True

    return merged_segments


segments_lst = []
for i in segments:
    for j in i.words:
        if j.word.strip():  # not empty string
            segments_lst.append({"text": j.word.strip(), "start": j.start, "end": j.end})

result_merged = sentence_segments_merger(segments_lst,
                                         max_text_len=max_text_len,
                                         max_segment_interval=max_segment_interval)

result_srt_list = []

# if use_whisper_sentence_segment:
#     for i, v in enumerate(segments):
#         result_srt_list.append(srt.Subtitle(index=i,
#                                         start=timedelta(seconds=v.start),
#                                         end=timedelta(seconds=v.end),
#                                         content=v.text.strip()))
# else:
for i, v in enumerate(result_merged):
    result_srt_list.append(srt.Subtitle(index=i,
                                        start=timedelta(seconds=v['start']),
                                        end=timedelta(seconds=v['end']),
                                        content=v['text'].strip()))

composed_transcription = srt.compose(result_srt_list)

with open(transcribed_srt_name, 'w') as f:
    f.write(composed_transcription)

1. The `sentence_segments_merger` function takes a list of segments (dictionaries with `"text"`, `"start"`, and `"end"` keys), a maximum text length (`max_text_len`), and a maximum segment interval (`max_segment_interval`).
2. The function merges consecutive segments if their text length combined is less than `max_text_len` and the time interval between them is less than `max_segment_interval`.
3. The merged segments are returned as a list of dictionaries with updated `"text"`, `"start"`, and `"end"` values.

4. The code then processes a list of segments (`segments`) by extracting the text, start time, and end time from each segment's `words` attribute and appending them as dictionaries to a new list `segments_lst`.
5. The `sentence_segments_merger` function is called with `segments_lst`, `max_text_len`, and `max_segment_interval` to obtain the merged segments (`result_merged`).

6. A new list `result_srt_list` is created, where each element is an `srt.Subtitle` object constructed from the merged segments. The `srt.Subtitle` objects contain an index, start time (`timedelta(seconds=start)`), end time (`timedelta(seconds=end)`), and content (`text`).

7. The `srt.compose` function is used to combine the `result_srt_list` into a single string representation of the SRT file (`composed_transcription`).

8. Finally, the `composed_transcription` is written to a file specified by `transcribed_srt_name`.

In summary, this code takes a list of segments with text, start time, and end time, and generates an SRT file by merging consecutive segments based on text length and time interval constraints. The resulting SRT file contains subtitles with the merged text and the corresponding start and end times.

The commented-out section suggests that there might have been an option to use `use_whisper_sentence_segment` to control whether the original segments or the merged segments are used for generating the SRT file.

### You should see a srt file generated with desired name: right click and download the file named as transcribed.srt

In [None]:
#@title 2.4 Optional: Peek the SRT file
print(composed_transcription)

1
00:00:00,000 --> 00:00:06,140
my thought i have nobody by a beauty and will as you've poured mr rochester is sub and that so don't find


