# Transcribing videos

Why might you transcribe videos? Because watching each and every video takes *so long!* To see a use case, check out Lam Thuy Vo's [Misinformation on TikTok: How 'Documented' Examined Hundreds of Videos in Different Languages](https://pulitzercenter.org/misinformation-tiktok-how-documented-examined-hundreds-videos-different-languages), or any of the ten thousand projects local newsrooms are doing about transcribing community meetings from YouTube.

> YouTube transcripts are awful. Terrible. So bad. Don't ever trust them.

## Our favorite (local, DIY) transcription tool: Whisper

OpenAI has released other AI tools besides ChatGPT – one of the most popular is [Whisper](https://openai.com/research/whisper), a model that can **transcribe audio**. The fact, technical name for this is "speech to text."

Unlike GPT, **you can actually download and use Whisper**. Python programmers can bop on over to [the GitHub repo](https://github.com/openai/whisper) and coding with it minutes.

Because Whisper is freely available to use and adapt, you'll see all sorts of Whisper-powered tools out there. [MacWhisper](https://goodsnooze.gumroad.com/l/macwhisper) allows you to transcribe audio from the safety of your mac - powered by Whisper! [This random website](https://whisperui.com/) allows to drag-and-drop audio files and transcribe them on the web – powered by Whisper!

And now we'll do the exact same thing right here, in Python – powered by Whisper!

## But... Whisper is actually bad!

[According to everyone](https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14), and the excellently-named paper [Careless Whisper: Speech-to-Text Hallucination Harms](https://dl.acm.org/doi/10.1145/3630106.3658996), Whiper makes *a lot of bad mistakes.*

> In an example they uncovered, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”
>
> But the transcription software added: “He took a big piece of a cross, a teeny, small piece ... I’m sure he didn’t have a terror knife so he killed a number of people.”

One of the biggest problems is **silence**. Like human beings, Whisper isn't very good at dealing with silence! It's trained to transcribe transcribe transcribe, so when there's silence it tends to start writing regardless of what's going on.

One way to fix this is **voice activity detection**, which cuts out silences before it transcribes.

Even though we can use [the original Whisper](https://github.com/openai/whisper) for transcription, other people have build *other* Python tools on top of it. As a result, we have great libraries like [WhisperX](https://github.com/m-bain/whisperX) which had add-ons like VAD, speaker diarization (splitting speakers!) and more. It's a little more unwieldy to use, but it's worth it.

In [1]:
%pip install --quiet --upgrade "yt-dlp[default]"
%pip install --quiet --upgrade whisperx "torch<2.6" torchaudio torchvision

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyppeteer 2.0.0 requires websockets<11.0,>=10.0, but you have websockets 14.2 which is incompatible.
gradio-client 0.10.1 requires websockets<12.0,>=10.0, but you have websockets 14.2 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Downloading our video

Maybe we've talked about [yt-dlp](https://github.com/yt-dlp/yt-dlp) already, I'm covering it in like *every single session*! I usually show off the command-line version, but in this situation I'll show you what it looks like when you're using **pure Python.** We're going to automatically download the mp3 of the audio instead of the visual part of the video.

> I refuse to memorize these commands, I always [look them up with an LLM](https://chatgpt.com/share/67c73c6b-5424-800d-9cf0-ef403a9a8410). As long as you keep yt-dlp updated, it's simple to do things like "download this whole playlist" or "the most recent 5 videos on this account."

In [6]:
import yt_dlp

url = "https://www.youtube.com/watch?v=s-4yh3XY5wU"

ydl_opts = {
    "format": "bestaudio/best",
    "outtmpl": "output.mp3",
    "postprocessors": [
        {
            "key": "FFmpegExtractAudio",
            "preferredcodec": "mp3",
            "preferredquality": "192",
        }
    ],
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

[youtube] Extracting URL: https://www.youtube.com/watch?v=s-4yh3XY5wU
[youtube] s-4yh3XY5wU: Downloading webpage
[youtube] s-4yh3XY5wU: Downloading tv client config
[youtube] s-4yh3XY5wU: Downloading player 5ae7d525
[youtube] s-4yh3XY5wU: Downloading tv player API JSON
[youtube] s-4yh3XY5wU: Downloading ios player API JSON
[youtube] s-4yh3XY5wU: Downloading m3u8 information
[info] s-4yh3XY5wU: Downloading 1 format(s): 251
[download] output.mp3 has already been downloaded
[download] 100% of   10.00MiB
[ExtractAudio] Not converting audio output.mp3; file is already in target format mp3


## Transcribe with WhisperX

Just like any other AI thing, Whisper isn't just one piece of software - it's a *collection of models* with different sizes and names that you have to download separately.

You can see [the models here](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). We're going to start with `tiny.en`, an English-only model that is the smallest and fastest.

In [61]:
%%time

import whisperx
import torch

audio_file = "output.mp3"

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16 if device == "cuda" else 8
compute_type = "float16" if device == "cuda" else "float32" 

model = whisperx.load_model("tiny.en", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print("Transcribed")

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print("Aligned")

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../../../../../../.pyenv/versions/3.10.13/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1. Bad things might happen unless you revert torch to 1.x.
Transcribed
Aligned
CPU times: user 3min 52s, sys: 1min 1s, total: 4min 53s
Wall time: 52.7 s


We can look at the output with timecodes...

In [62]:
import pandas as pd

df = pd.json_normalize(result['segments'])
df.head()

Unnamed: 0,start,end,text,words
0,0.271,2.054,Hello and welcome to Vancouver Carpenter.,"[{'word': 'Hello', 'start': 0.271, 'end': 0.53..."
1,2.634,7.841,"If you're new to drywall, picking the right mu...","[{'word': 'If', 'start': 2.634, 'end': 2.694, ..."
2,7.881,12.506,"We've got light mud, we've got all purpose, we...","[{'word': 'We've', 'start': 7.881, 'end': 8.06..."
3,13.087,15.09,So it's hard to know exactly which mud to choose.,"[{'word': 'So', 'start': 13.087, 'end': 13.187..."
4,15.67,18.592,So I'm going to help break that down for you ...,"[{'word': 'So', 'start': 15.67, 'end': 15.87, ..."


...or we can just grab the text.

In [63]:
tiny_en_text = ' '.join([segment['text'] for segment in result['segments']])
print(tiny_en_text)

 Hello and welcome to Vancouver Carpenter. If you're new to drywall, picking the right mud can be kind of a daunting task with so many different types. We've got light mud, we've got all purpose, we've got heavyweight, we've got topping, we've got joint. So it's hard to know exactly which mud to choose.  So I'm going to help break that down for you so you can know which one to pick. We're also going to do this in the order that we tape with. So first we're going to start with quick set mugs. So these are the powdered mugs that are bought in bag form. So one of the ones we use a lot here in Western Canada and this isn't available everywhere but it's called concrete fill. So the specific purpose of this mud right here  is actually for skimming out concrete ceilings. So it's got really good adhesion and it's got some light aggregate in it, which I believe is something like pear light. So it's squishy, not like sand.  So that's for skimming out ceilings. It has a really great floatability,

Around one minute to transcribe 7 minutes of audio. Not awful, I guess!

Whisper models go all the way up to **large-v3**, but it's pretty slow! OpenAI recommends you use their new-ish **turbo** model, which is just about as good as the large models but much much faster.

In [64]:
%%time

import whisperx
import torch

audio_file = "output.mp3"

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16 if device == "cuda" else 8
compute_type = "float16" if device == "cuda" else "float32" 

model = whisperx.load_model("turbo", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print("Transcribed")

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print("Aligned")

No language specified, language will be first be detected for each audio file (increases inference time).


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../../../../../../.pyenv/versions/3.10.13/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1. Bad things might happen unless you revert torch to 1.x.
Detected language: en (1.00) in first 30s of audio...
Transcribed
Aligned
CPU times: user 7min 13s, sys: 2min, total: 9min 13s
Wall time: 2min 19s


It took two minutes on the "turbo" setting.

In [65]:
turbo_text = ' '.join([segment['text'] for segment in result['segments']])

# First 1500 characters
print(turbo_text[:1500])

 Hello and welcome to Vancouver Carpenter. If you're new to drywall, picking the right mud can be kind of a daunting task with so many different types. We've got light mud, we've got all purpose, we've got heavy weight, we've got topping, we've got joint. So it's hard to know exactly which mud to choose.  So, I'm going to help break that down for you so you can know which one to pick. We're also going to do this in the order that we tape with. So, first we're going to start with quick set muds. So, these are the powdered mugs that are bought in bag form. So, one of the ones we use a lot here in Western Canada, and this isn't available everywhere, but it's called concrete fill. So, the specific purpose of this mud, right here,  is actually for skimming out concrete ceilings. So it's got really good adhesion and it's got some light aggregate in it, which I believe is something like perlite. So it's squishy, not like sand.  So that's for skimming out ceilings. It has really great floatabi

## Comparing transcripts

There's no good comparison library in Python! I like to just throw files into [VS Code](https://code.visualstudio.com/) and do it manually, but we'll do a little DIY situation to compare here.

In [89]:
%pip install --quiet --upgrade rich diff_match_patch

^C
ERROR: Operation cancelled by user
Note: you may need to restart the kernel to use updated packages.


In [88]:
from rich.console import Console
from rich.markup import escape
from diff_match_patch import diff_match_patch

console = Console(record=True, width=100)

def pretty_diff_rich(text1, text2):
    dmp = diff_match_patch()
    diffs = dmp.diff_main(text1, text2)
    dmp.diff_cleanupSemantic(diffs)  # Clean up to make the diff more readable

    formatted_output = []
    for op, text in diffs:
        safe_text = escape(text)  # Prevent issues with brackets in Rich

        if op == -1:
            formatted_output.append(f"[black on #ffcccc]{safe_text}[/black on #ffcccc]")  # Light red background for deletions
        elif op == 1:
            formatted_output.append(f"[black on #ccffcc]{safe_text}[/black on #ccffcc]")  # Light green background for additions
        else:
            formatted_output.append(safe_text)  # Normal text

    console.print("".join(formatted_output))


pretty_diff_rich(tiny_en_text, turbo_text)


# Download and transcribe many videos

In [8]:
import pandas as pd

urls = [
    'eIK50QLHpOc',
    's-4yh3XY5wU',
    'T4g-OBXCy1k',
    'GIvmfBuAQIw',
    'CzrnOujf8YA'
]
df = pd.DataFrame({'video_id': urls})
df['url'] = 'https://www.youtube.com/watch?v=' + df['video_id']
df

Unnamed: 0,video_id,url
0,eIK50QLHpOc,https://www.youtube.com/watch?v=eIK50QLHpOc
1,s-4yh3XY5wU,https://www.youtube.com/watch?v=s-4yh3XY5wU
2,T4g-OBXCy1k,https://www.youtube.com/watch?v=T4g-OBXCy1k
3,GIvmfBuAQIw,https://www.youtube.com/watch?v=GIvmfBuAQIw
4,CzrnOujf8YA,https://www.youtube.com/watch?v=CzrnOujf8YA


In [13]:
from yt_dlp import YoutubeDL
from pathlib import Path

download_dir = Path("downloads")

video_dir = download_dir / video_dir
audio_dir = download_dir / audio_dir

video_dir.mkdir(exist_ok=True, parents=True)
audio_dir.mkdir(exist_ok=True, parents=True)

video_opts = {
    'format': 'bestvideo[height<=720]+bestaudio',
    'outtmpl': str(video_dir / '%(id)s.%(ext)s'),  # Using / operator for paths
    'quiet': True,
    'ignoreerrors': True,
    'no_warnings': False
}

audio_opts = {
    'format': 'bestaudio',
    'outtmpl': str(audio_dir / '%(id)s.%(ext)s'),
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
        'preferredquality': '192',
    }],
    'quiet': False,
    'ignoreerrors': True
}

try:
    with YoutubeDL(audio_opts) as ydl:
        ydl.download(df.url)
except Exception as e:
    print(f"Error during download: {e}")

[youtube] Extracting URL: https://www.youtube.com/watch?v=eIK50QLHpOc
[youtube] eIK50QLHpOc: Downloading webpage
[youtube] eIK50QLHpOc: Downloading tv client config
[youtube] eIK50QLHpOc: Downloading player 9c6dfc4a
[youtube] eIK50QLHpOc: Downloading tv player API JSON
[youtube] eIK50QLHpOc: Downloading ios player API JSON
[youtube] eIK50QLHpOc: Downloading m3u8 information
[info] eIK50QLHpOc: Downloading 1 format(s): 251
[download] Destination: downloads/downloads/audio/eIK50QLHpOc.webm
[download] 100% of    2.16MiB in 00:00:00 at 7.05MiB/s   
[ExtractAudio] Destination: downloads/downloads/audio/eIK50QLHpOc.mp3
Deleting original file downloads/downloads/audio/eIK50QLHpOc.webm (pass -k to keep)
[youtube] Extracting URL: https://www.youtube.com/watch?v=s-4yh3XY5wU
[youtube] s-4yh3XY5wU: Downloading webpage
[youtube] s-4yh3XY5wU: Downloading tv client config
[youtube] s-4yh3XY5wU: Downloading tv player API JSON
[youtube] s-4yh3XY5wU: Downloading ios player API JSON
[youtube] s-4yh3XY5wU

In [15]:
df['audio_path'] = "downloads/audio/" + df['video_id'] + ".mp3"
df

Unnamed: 0,video_id,url,audio_path
0,eIK50QLHpOc,https://www.youtube.com/watch?v=eIK50QLHpOc,downloads/audio/eIK50QLHpOc.mp3
1,s-4yh3XY5wU,https://www.youtube.com/watch?v=s-4yh3XY5wU,downloads/audio/s-4yh3XY5wU.mp3
2,T4g-OBXCy1k,https://www.youtube.com/watch?v=T4g-OBXCy1k,downloads/audio/T4g-OBXCy1k.mp3
3,GIvmfBuAQIw,https://www.youtube.com/watch?v=GIvmfBuAQIw,downloads/audio/GIvmfBuAQIw.mp3
4,CzrnOujf8YA,https://www.youtube.com/watch?v=CzrnOujf8YA,downloads/audio/CzrnOujf8YA.mp3


In [31]:
%%time
import whisperx
import torch

from tqdm.notebook import tqdm
tqdm.pandas()

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16 if device == "cuda" else 4
compute_type = "float16" if device == "cuda" else "int8" 

model = whisperx.load_model("tiny.en", device, compute_type=compute_type)

def get_text(video_id):
    try:
        audio_file = f"downloads/audio/{video_id}.mp3"
        audio = whisperx.load_audio(audio_file)
        result = model.transcribe(audio, batch_size=batch_size)
        
        model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
        result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
        
        text = '\n'.join([segment['text'] for segment in result['segments']])
        return text
    except Exception as e:
        print(f"Error with {video_id}: {e}")
        return None

df['text'] = df.video_id.progress_apply(get_text)

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../../../../../.pyenv/versions/3.10.13/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1. Bad things might happen unless you revert torch to 1.x.


  0%|          | 0/5 [00:00<?, ?it/s]

CPU times: user 12min 54s, sys: 2min 52s, total: 15min 46s
Wall time: 2min 32s


Unnamed: 0,video_id,url,audio_path,text
0,eIK50QLHpOc,https://www.youtube.com/watch?v=eIK50QLHpOc,downloads/audio/eIK50QLHpOc.mp3,Welcome to Vancouver Carpenter.\nSo are those...
1,s-4yh3XY5wU,https://www.youtube.com/watch?v=s-4yh3XY5wU,downloads/audio/s-4yh3XY5wU.mp3,Hello and welcome to Vancouver Carpenter.\nIf...
2,T4g-OBXCy1k,https://www.youtube.com/watch?v=T4g-OBXCy1k,downloads/audio/T4g-OBXCy1k.mp3,Welcome back to Vancouver Carpenter.\nSo I ge...
3,GIvmfBuAQIw,https://www.youtube.com/watch?v=GIvmfBuAQIw,downloads/audio/GIvmfBuAQIw.mp3,Hello and welcome to Vancouver Carpenter.\nSo...
4,CzrnOujf8YA,https://www.youtube.com/watch?v=CzrnOujf8YA,downloads/audio/CzrnOujf8YA.mp3,Welcome to Vancouver Carpenter.\nToday's vide...


In [33]:
df.head()

Unnamed: 0,video_id,url,audio_path,text
0,eIK50QLHpOc,https://www.youtube.com/watch?v=eIK50QLHpOc,downloads/audio/eIK50QLHpOc.mp3,Welcome to Vancouver Carpenter.\nSo are those...
1,s-4yh3XY5wU,https://www.youtube.com/watch?v=s-4yh3XY5wU,downloads/audio/s-4yh3XY5wU.mp3,Hello and welcome to Vancouver Carpenter.\nIf...
2,T4g-OBXCy1k,https://www.youtube.com/watch?v=T4g-OBXCy1k,downloads/audio/T4g-OBXCy1k.mp3,Welcome back to Vancouver Carpenter.\nSo I ge...
3,GIvmfBuAQIw,https://www.youtube.com/watch?v=GIvmfBuAQIw,downloads/audio/GIvmfBuAQIw.mp3,Hello and welcome to Vancouver Carpenter.\nSo...
4,CzrnOujf8YA,https://www.youtube.com/watch?v=CzrnOujf8YA,downloads/audio/CzrnOujf8YA.mp3,Welcome to Vancouver Carpenter.\nToday's vide...
