<a href="https://colab.research.google.com/github/naingwinkyaw/IT123-Project_ID4288602M_/blob/main/L19/transcribe_audio_whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech to Text using Whisper

This notebook shows how to use OpenAI's Whisper to transcribe audio and audiovisual files, and how to save that transcription as a plain text file or as a VTT/SRT caption file.


# Settings

* `input_format`: The source of the audio/video file to be transcribed
  * `youtube`: A YouTube video
    * The transcribed file(s) are saved to this Colab, and will be deleted when the Colab runtime is disconnected.
  * `gdrive`: A file in your Google Drive account
    * If you select this option, you will need to allow this notebook to connect to your Google Drive account.
    * The transcribed file(s) are saved to the same folder as the original file.
  * `local`: A local file that you have uploaded to this Colab
    * If you select this option, you will need to first upload the file to the Files tab (see Step 1 [here](https://wandb.ai/wandb_fc/gentle-intros/reports/How-to-transcribe-your-audio-to-text-for-free-with-SRTs-VTTs---VmlldzozMzc1MzU3)).
    * The transcribed file(s) are saved to this Colab, and will be deleted when the Colab runtime is disconnected.
* `file`: The URL of the YouTube video or the path of the audio file to be transcribed.
  * Example: `file = "https://www.youtube.com/watch?v=AUDIO"` (transcribing a YouTube video)
  * Example: `file = "/content/drive/My Drive/AUDIO.mp3"` (transcribing a Google Drive file)
  * Example: `file = "/content/AUDIO.mp3"` (transcribing a local file)
* `plain`: Whether to save the transcription as a text file or not.
* `srt`: Whether to save the transcription as an SRT file or not.
* `vtt`: Whether to save the transcription as a VTT file or not.
* `tsv`: Whether to save the transcription as a TSV (tab-separated values) file or not.
* `download`: Whether to download the transcribed file(s) or not.


In [None]:
# @title Change the values in this section

# @markdown Select the source of the audio/video file to be transcribed
input_format = "youtube" #@param ["youtube", "gdrive", "local"]

# @markdown Enter the URL of the YouTube video or the path of the audio file to be transcribed
file = "https://youtu.be/wqut2ZYdKfs?si=qvp3v5eVRciNakxk" #@param {type:"string"}

#@markdown Click here if you'd like to save the transcription as text file
plain = True #@param {type:"boolean"}

# @markdown Click here if you'd like to save the transcription as an SRT file
srt = True #@param {type:"boolean"}

#@markdown Click here if you'd like to save the transcription as a VTT file
vtt = True #@param {type:"boolean"}

#@markdown Click here if you'd like to save the transcription as a TSV file
tsv = True #@param {type:"boolean"}

#@markdown Click here if you'd like to download the transcribed file(s) locally
download = True #@param {type:"boolean"}

# Set Up

The blocks below install all of the necessary Python libraries (including Whisper), configures Whisper, and contains code for various helper functions.



## Dependencies

In [None]:
# Dependencies

!pip install -q pytubefix
!pip install -q git+https://github.com/openai/whisper.git

import os, re
import torch
from pathlib import Path
from pytubefix import YouTube

import whisper
from whisper.utils import get_writer

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.2/60.2 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


## Whisper configuration

This Colab use `medium.en`, [the medium-sized, English-only](https://github.com/openai/whisper#available-models-and-languages) Whisper model.


In [None]:
# Use CUDA, if available
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load the desired model
model = whisper.load_model("medium.en").to(DEVICE)

100%|█████████████████████████████████████| 1.42G/1.42G [00:18<00:00, 83.3MiB/s]


## YouTube helper functions

Code for helper functions when running Whisper on a YouTube video.

In [None]:
def to_snake_case(name):
    return name.lower().replace(" ", "_").replace(":", "_").replace("__", "_")

def download_youtube_audio(url,  file_name = None, out_dir = "."):
    "Download the audio from a YouTube video"
    yt = YouTube(url)
    ys = yt.streams.get_audio_only()
    return ys.download()

# Transcribing with Whisper

Ultimately, calling Whisper is as easy as one line!
* `result = model.transcribe(file)`

The majority of this new `transcribe_file` function is actually just for exporting the results of the transcription as a text, VTT, or SRT file.

In [None]:
def transcribe_file(model, file, plain, srt, vtt, tsv, download):
    """
    Runs Whisper on an audio file

    Parameters
    ----------
    model: Whisper
        The Whisper model instance.

    file: str
        The file path of the file to be transcribed.

    plain: bool
        Whether to save the transcription as a text file or not.

    srt: bool
        Whether to save the transcription as an SRT file or not.

    vtt: bool
        Whether to save the transcription as a VTT file or not.

    tsv: bool
        Whether to save the transcription as a TSV file or not.

    download: bool
        Whether to download the transcribed file(s) or not.

    Returns
    -------
    A dictionary containing the resulting text ("text") and segment-level details ("segments"), and
    the spoken language ("language"), which is detected when `decode_options["language"]` is None.
    """
    file_path = Path(file)
    print(f"Transcribing file: {file_path}\n")

    output_directory = file_path.parent

    # Run Whisper
    result = model.transcribe(file, verbose = False, language = "en")

    if plain:
        txt_path = file_path.with_suffix(".txt")
        print(f"\nCreating text file")

        with open(txt_path, "w", encoding="utf-8") as txt:
            txt.write(result["text"])
    if srt:
        print(f"\nCreating SRT file")
        srt_writer = get_writer("srt", output_directory)
        srt_writer(result, str(file_path.stem))

    if vtt:
        print(f"\nCreating VTT file")
        vtt_writer = get_writer("vtt", output_directory)
        vtt_writer(result, str(file_path.stem))

    if tsv:
        print(f"\nCreating TSV file")

        tsv_writer = get_writer("tsv", output_directory)
        tsv_writer(result, str(file_path.stem))

    if download:
        from google.colab import files

        colab_files = Path("/content")
        stem = file_path.stem

        for colab_file in colab_files.glob(f"{stem}*"):
            if colab_file.suffix in [".txt", ".srt", ".vtt", ".tsv"]:
                print(f"Downloading {colab_file}")
                files.download(str(colab_file))

    return result

# Whisper it!

This block actually calls `transcribe_file`


In [None]:
if input_format == "youtube":
    # Download the audio stream of the YouTube video
    audio = download_youtube_audio(file)
    print(f"Downloading audio stream: {audio}")

    # Run Whisper on the audio stream
    result = transcribe_file(model, audio, plain, srt, vtt, tsv, download)
elif input_format == "gdrive":
    # Authorize a connection between Google Drive and Google Colab
    from google.colab import drive
    drive.mount('/content/drive')

    # Run Whisper on the specified file
    result = transcribe_file(model, file, plain, srt, vtt, tsv, download)
elif input_format == "local":
    # Run Whisper on the specified file
    result = transcribe_file(model, file, plain, srt, vtt, tsv, download)

Downloading audio stream: /content/i visited Singapore's newest MRT stations!! (TEL 4 Preview vlog).m4a
Transcribing file: /content/i visited Singapore's newest MRT stations!! (TEL 4 Preview vlog).m4a



100%|█████████▉| 23376/23449 [00:39<00:00, 590.81frames/s]


Creating text file

Creating SRT file

Creating VTT file

Creating TSV file
Downloading /content/i visited Singapore's newest MRT stations!! (TEL 4 Preview vlog).srt





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading /content/i visited Singapore's newest MRT stations!! (TEL 4 Preview vlog).vtt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading /content/i visited Singapore's newest MRT stations!! (TEL 4 Preview vlog).tsv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading /content/i visited Singapore's newest MRT stations!! (TEL 4 Preview vlog).txt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>