<a href="https://colab.research.google.com/github/ptiszai/ADuCM355_modul/blob/main/Whisper_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a target="_blank" href="https://colab.research.google.com/github/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial/blob/main/Whisper_Tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a target="_blank" href="https://youtu.be/i4Sgg-ptRzs">
  <img src="https://upload.wikimedia.org/wikipedia/commons/e/ef/Youtube_logo.png" alt="Tutorial Video" width="24" height="24"/>
</a>


- Click the above button to open this notebook in Google Colab.
- We recommend then making a copy of the notebook to your Google Drive so you can save your work (File -> Save a copy in Drive).
- You may also view the accompanying supplamentary tutorial video [here](https://youtu.be/i4Sgg-ptRzs)

# Using OpenAI's Whisper for Transcription, Translation, and Creating Caption Files
OpenAI's Whisper is a general-purpose speech recognition model described in their 2022 [paper](https://arxiv.org/abs/2212.04356). This notebook is a practical introduction on how to use Whisper in Google Colab.

Before diving into Whisper, it's important to set up your environment correctly. This guide will walk you through the process, ensuring that even if you're not technically inclined, you'll be able to use Whisper effectively. Note that you don't need to run Whisper in Colab, we are using it here for convenience and the ability to run the model on a GPU.

## Setting Up Google Colab
Google Colab provides a convenient platform to run Python code in the cloud, with access to powerful computing resources, including GPUs. To ensure your Colab notebook runs smoothly, it's recommended to enable GPU acceleration which will speed up your transcription. Note however that a GPU is not strictly necessary to use Whisper. Here's how you can enable it on Colab:


1.   Click on 'Runtime' in the top menu.
2.   Select 'Change runtime type'.
3.   In the dialog that appears, under 'Hardware accelerator', choose 'GPU' (the type doesn't matter so much right now)
4.   Click 'Save'.

By enabling the GPU, your notebook will run more efficiently, especially when dealing with large models like Whisper.

We can also see the size of our GPU's VRAM with the `!nvidia-smi` command in order to see how large of a model we can use.

In [1]:
!nvidia-smi

Sun Nov 30 09:51:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   51C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Important Note

In the following cells, you will often see an `!` symbol before the text/commands. This is because Colab cells expect *code*.

By using `!`, we are telling Colab we typing a command instead of a piece of code! If you did not include a `!`, Colab would assume you are running (Python) code

In [2]:
print('Hello world!')

Hello world!


## Getting Started with Whisper

Whisper is available through OpenAI's GitHub repository. To use Whisper, you need to install it along with its dependencies. This guide will take you through the process step-by-step, ensuring a smooth setup. You may follow along in parallel looking at the Github as well! First, to install Whisper:



1.   **Install Whisper**: Run the command !pip install -U openai-whisper in a Colab cell to install the latest release of Whisper.
2.   **Install FFMPEG**: Whisper requires FFMPEG for audio processing. Use the command !apt install ffmpeg in Colab to install it.
3.   **Additional Dependencies:** In some cases, you might need additional dependencies like setuptools-rust. If you encounter any installation errors, run !pip install setuptools-rust.



In [1]:
#!pip install -U openai-whisper
!pip install git+https://github.com/openai/whisper.git
!pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
!apt install ffmpeg
!pip install setuptools-rust

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-36lkzarn
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-36lkzarn
  Resolved https://github.com/openai/whisper.git to commit c0d2f624c09dc18e709e37c2ad90c039a4eb72a2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai-whisper: filename=openai_whisper-20250625-py3-none-any.whl size=803979 sha256=3e5e0858e3fd65abd189f2268cc558cb328003117d212b0a8e05f266a0348096
  Stored in directory: /tmp/pip-ephem-wheel-cache-4v45ysif/wheels/c3/03/25/5e0ba78bc27a3a089f137c9f1d92fdfce16d06996c071a016c
Successfully built openai-whisper
Installing collec

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!sudo apt update && sudo apt install ffmpeg

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
[33m0% [Connecting to archive.ubuntu.com] [1 InRelease 14.2 kB/129 kB 11%] [Connect[0m                                                                               Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
[33m0% [Connecting to archive.ubuntu.com (185.125.190.39)] [1 InRelease 43.1 kB/129[0m[33m0% [Connecting to archive.ubuntu.com (185.125.190.39)] [1 InRelease 129 kB/129 [0m                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 https:/

In [6]:
!apt install ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 94 not upgraded.


## Available Models

Whisper comes with several models, each offering a trade-off between speed and accuracy. Depending on your task, you can choose the model that best fits your needs. Importantly, the *size* of the model (Required VRAM) will be dictated by the GPU running the code **if** you are using a GPU.

In this case, we will use the *medium* sized model.

## Get your audio files ready!

Of course, our goal is to transcripe/translation an audio file. Thus, we will need to upload our files to colab. This can be achieved through

1. Clicking the folder icon on the left sidebar at the bottom.
2. Clicking the upload file button or dragging and dropping your file into the left side bar.
3. For this demonstration, upload the `AllStar.mp3` file and `Cupid_Fifty_Fifty_Korean_Version.mp3` file from the GitHub [repository](https://github.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial).

**Be aware**: <ins>Colab sessions *will not* save files that have been uploaded or created. Be sure to save files that are created during sessions before exiting them.</ins>

**File Upload**: in the case you are uploading large files, you may instead want to mount your Google Drive to colab. You can then move files from Drive into Colab, which is often faster than uploading them individually. This can be achieved by clicking the Mount Drive button at the top left sidebar.

## Using Whisper

Once again, following the Github documentation, we can run Whisper through command-line arguments.

Let's first test whisper through the `whisper --help` command-line argument:

In [7]:
!whisper --help

usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
               [--output_dir OUTPUT_DIR]
               [--output_format {txt,vtt,srt,tsv,json,all}]
               [--verbose VERBOSE] [--task {transcribe,translate}]
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,L

The above help command lists command-line arguments we can use. Most will be unnecessary for our case, while some may be more applicable such as `--output_format` or `--language`.

## Transcribing Audio

Let's now try and transcribe an audio file! The following command will transcribe speech in audio file `Audio File.mp3`. We will use the arguments

1.   `--model medium`: specify the model size
2.   `--task transcribe`: to do transcription
3.   `--output_dir transcription`: to save the files in the directory transcription
4.   `--output_format all`: give transcription in all formats. Otherwise, you may select the one your prefer.

Some caveats


*   If you are running this without a GPU, you should include: `--device cpu`
*   If you are interested in <ins>translating</ins> an audio file, you can use a command such as `!whisper japanese_audio_file.wav --language Japanese --task translate --model medium --output_dir translation --output_format all`


We must specify the location of the audio file. In this case it's in `/content/AllStar.mp3`. We can determine this by rightclicking the audiofile and Copying the Path.

In [None]:
!whisper /content/AllStar.mp3 --model medium --device gpu --language hu --task transcribe --output_dir transcription --output_format all

**Note**:
- Most of the time you won't want to use the `all` output format as it will generate some unnecessary files. Its best to determine which format works for you. Recall, you can find the available formats by running `whisper --help`.
- If you are running this on your own computer, the path for the file will be different. You can find the path by right-clicking the file and selecting "Copy Path".

## Translating Audio

Let's now try and translate an audio file! Make sure you uploaded the `Cupid_Fifty_Fifty_Korean_Version.mp3` file to Colab. To see which languages Whisper supports, you can run `!whisper --help` and look at the `--language` argument.

In [None]:
!whisper --help

Suppose we have a Korean audio file and we want to translate it to English. We can use the following command:

In [None]:
!whisper /content/Cupid_Fifty_Fifty_Korean_Version.mp3 --language Korean --task translate --model medium --output_dir translation --output_format all

## Creating a Caption File (SRT)

Suppose we want to take an audio file and create a caption file (SRT) for it. SRT files are used to display subtitles in videos and are commonly used in video editing software or on YouTube. Let's try creating an SRT file for an audio file. We will use the following command:

In [None]:
!whisper /content/AllStar.mp3 --model medium --task transcribe --output_dir transcription --output_format srt

What if the audio is in a different language? We can use the `--language` argument to specify the language of the audio file. For example, if the audio file is in Korean, we can use the following command:

In [None]:
!whisper /content/Cupid_Fifty_Fifty_Korean_Version.mp3 --language Korean --task translate --model medium --output_dir translation --output_format srt

## Saving Your Files

Once you've ran all the cells you're interested in, we should download the transcription files that are saved to the `transcription` directory.

Colab doesn't allow you to download entire directories so we can instead `zip` it and download the zip file.

In [None]:
!zip -r transcriptions.zip transcription
!zip -r translations.zip translation

## Helpful Commands



*   Often, you fill note want the transcription in every format. It may instead be better to select the most usable one for your case and just use that as the `--output_format`
*   You can do multiple audio files at a time. You would do the same command, but include two or more audio files (e.g. `!whisper audio1.mp3 audio2.mp3 ...`)
* If you are running whisper without a GPU, you should include `--device cpu` in the command
* If you want each caption segment to be 3 words max opposed to sentences you could do something like
```bash
  whisper audio_file.mp3 --task transcribe --output_format srt --word_timestamps True --max_words_per_line 3 # for transcription`
```


## Closing Remarks

Using Whisper in Colab is convenient because it allows you to utilize a GPU for free! However, you can also use Whisper without a GPU on your local computer. To do this, you should follow the steps in Whisper's [Github Setup](https://github.com/openai/whisper#setup). If you choose to do so, some considerations



1.   You will need to install Python to run whisper. This can be done in your terminal by first installing [brew](https://docs.brew.sh/Installation) with the command `/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`. We recommend following `HomeBrew's [installation instructions.](https://brew.sh/) or this [guide](https://docs.python-guide.org/starting/install3/osx/).
2.   Assuming brew is installed, you can install python with running `brew install python` inside your terminal.
3.   With Python and brew installed, we recommend making a directory to work in. Inside your terminal, move to your desktop and create a directory: `cd Desktop; mkdir Whisper; cd Whisper`.

Note: if you do wish to work on your personal macbook and do install brew, you will need to also install `Xcode` tools. If given the option, we recommend instead installing `Command Line Tools` as it is a much smaller package.

Finally, while this document is primarily for mac users, the installation will be very similar across platforms. See Whisper's Github for more information.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
     <img src="https://media.istockphoto.com/id/1294688589/photo/red-cat-with-blurred-the-poster-in-the-frame-with-the-words-thank-you.jpg?s=612x612&w=0&k=20&c=T84nHSu52sOQvrmnksdDNo2UByqJ7yXn1srkuodXdps=" alt="Page Image">
    <br>
</body>
</html>