<a href="https://colab.research.google.com/github/lgyz/ext-saladict/blob/master/%E2%80%9Cyoutube_whisper_ipynb%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Video Transcription with OpenAI's Whisper

[![License](https://img.shields.io/github/license/kazuki-sf/youtube-whisper)](https://github.com/kazuki-sf/youtube-whisper)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kazuki-sf/youtube-whisper/blob/main/youtube_whisper.ipynb)

## How to Use the Notebook
Feel free to `Copy to Drive` the notebook or run it directly.
1. Enter the URL of the YouTube video or shorts you want to transcribe.
2. Choose the whisper model you want to use.
3. Run the code cell (Step 1-3) and wait for the transcription to complete.

## Notes
* `T4 GPU` or higher is recommended for running the notebook. You can change the runtime type by going to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `GPU`.
* Whenever you change the YouTube URL or Whisper Model, please run the `Step 1` and then run `Step 3` (You can skip `Step 2` if you already ran it before)
* When you run `Step 3`, the website might ask you a permission to download multiple files.
* This project is not affiliated with OpenAI. The code provided here is for educational purposes only.
* Here's a list of whisper model and the relative speed of each model. For more information, please visit the official GitHub page: https://github.com/openai/whisper#available-models-and-languages
---

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |



In [None]:
# @title Step 1: Enter URL & Choose Whisper Model

# @markdown Enter the URL of the YouTube video
YouTube_URL = "https://www.youtube.com/watch?v=xQrBGcdUdVg" #@param {type:"string"}

# @markdown Choose the whisper model you want to use
whisper_model = "small" # @param ["tiny", "base", "small", "medium", "large", "large-v2", "large-v3"]

# @markdown Save the transcription as text (.txt) file?
text = True #@param {type:"boolean"}

# @markdown Save the transcription as an SRT (.srt) file?
srt = False #@param {type:"boolean"}


In [None]:
# Step 2: Install Dependencies (this may take about 2-3 min)

!pip install yt-dlp
!pip install -q git+https://github.com/openai/whisper.git

import os, re
import torch
from pathlib import Path
import yt_dlp
import whisper
from whisper.utils import get_writer

Collecting yt-dlp
  Downloading yt_dlp-2024.7.9-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli (from yt-dlp)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
Collecting mutagen (from yt-dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pycryptodomex (from yt-dlp)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests<3,>=2.32.2 (fr

In [None]:
# Step 3: Transcribe the video/audio data

device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model(whisper_model).to(device)

# Util function to change name
def to_snake_case(name):
    return name.lower().replace(" ", "_").replace(":", "_").replace("__", "_")

# Download the audio data from YouTube video

def download_audio_from_youtube(url,  file_name = None, out_dir = "."):
    print(f"\n==> Downloading audio...")
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl':  os.path.join(out_dir, '%(title)s.%(ext)s'),
        'quiet': True
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(url, download=True)
        filename = ydl.prepare_filename(info_dict)
    return filename


# Transcribe the audio data with Whisper
def transcribe_audio(model, file, text, srt):
    print("\n=======================")
    print(f"\n🔗 YouTube URL: {YouTube_URL}")
    print(f"\n🤖 Whisper Model: {whisper_model}")
    print("\n=======================")

    file_path = Path(file)
    output_directory = file_path.parent

    # Run Whisper to transcribe audio
    print(f"\n==> Transcribing audio")
    result = model.transcribe(file, verbose = False)

    if text:
        print(f"\n==> Creating .txt file")
        txt_path = file_path.with_suffix(".txt")
        with open(txt_path, "w", encoding="utf-8") as txt:
            txt.write(result["text"])
    if srt:
        print(f"\n==> Creating .srt file")
        srt_writer = get_writer("srt", output_directory)
        srt_writer(result, str(file_path.stem))

    # Download the transcribed files locally
    from google.colab import files

    colab_files = Path("/content")
    stem = file_path.stem

    for colab_file in colab_files.glob(f"{stem}*"):
        if colab_file.suffix in [".txt", ".srt"]:
            files.download(str(colab_file))

    print("\n✨ All Done!")
    print("=======================")
    return result

# Download & Transcribe the audio data
audio = download_audio_from_youtube(YouTube_URL)
result = transcribe_audio(model, audio, text, srt)

100%|███████████████████████████████████████| 461M/461M [00:12<00:00, 37.7MiB/s]



==> Downloading audio...


🔗 YouTube URL: https://www.youtube.com/watch?v=xQrBGcdUdVg

🤖 Whisper Model: small


==> Transcribing audio
Detected language: Chinese


 26%|██▋       | 91174/346107 [01:19<03:19, 1280.95frames/s]