<a href="https://colab.research.google.com/github/nsriv/whisper-transcription/blob/main/Whisper_Transcription.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper AI Transcription

# About

This Colab Notebook is designed to transcribe a given audio file, of any language, into text. To accomplish this, it uses the code contained in this notebook, which leverages open-source and free of cost Python packages and machine learning models. The transcription of the audio file is contained within the Google Drive account mounted in the section on "Saving," and upon exit of the runtime, all uploaded data is permanently deleted, thus not compromising user or patient privacy.

This notebook runs in a 'runtime environment' on a virtual machine hosted on the Google Cloud Platform. This runtime is temporary and destructible, but is connected to a GPU that provides the muscle for the computation. Once the runtime is disconnected, all data except the text output is deleted. No data is sent back to Google except for the resources and running time requested by the code being executed.

# Setup

**Please read this code notebook in its entirety before intial run.**

This will allow you to understand a few decisions made and empower you to modify them in the future.


# Runtime Connection

From the top menu (File, Edit, etc...):

* Select 'Runtime' > 'Change Runtime Type'

* From the dropdown menu under 'Hardware Accelerator' select 'GPU'

* Confirm by clicking 'Save'

You should now see, on the right side of the screen below the "Comment" button, an orange indicator that will run through "Allocating, Connecting, Initializing, Connected" before turning green and showing a green checkmark while displaying RAM and Disk usage of our runtime.

If you do not see this, or it says Reconnect, you should repeat the steps above and click Reconnect before proceeding. While it should connect the first time you run a code cell, it needs to know you want GPU connection, or else code execution will take an inordinately long time.

# Installing Whisper (and some goodies)

Commands below will install the Python packages necessary to run Whisper models.

* Whisper.git will fetch the most up-to-date version of the [open-source Whisper package from OpenAI
](https://github.com/openai/whisper)
* [JiWER](https://github.com/jitsi/jiwer) is a package used to appoximate error rates for the model. It is not necessary in the code as is, but is included here for future reportage use.
* ipython-autotime is an extension that will display the time of execution after each command is run.

**Click the play button to run commands. It will turn into a running indicator/Stop button while code is executing and will turn into a green checkmark upon completion.**

In [None]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer
! pip install ipython-autotime
%load_ext autotime

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-9p60mosj
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-9p60mosj
  Resolved https://github.com/openai/whisper.git to commit 248b6cb124225dd263bb9bd32d060b6517e067f8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken==0.3.3 (from openai-whisper==20230314)
  Downloading tiktoken-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.

# Setting Up File Workspace: Saving

Create a folder called "Whisper" in your Google Drive, then run the cell below to connect it to this code notebook.

Upon running, you should see a "Permissions" popup asking you to select and connect the Google Drive account you would like to use to store your text output.

This folder can be renamed later if necessary, but you'll need to change the path file in an upcoming line of code. (See Notes at the end).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/
time: 22 s (started: 2022-11-20 21:23:20 +00:00)


# Setting up File Workspace: Upload

To upload a file, notice that there is a folder icon in the toolbar to the left. Click it to open the left pane.

1. Drag your audio file here. All common audio formats work: .m4a, .mp3, .wav, .flac, etc.
2. Wait for it to fully upload. This can be finicky on a spotty connection.


# Usage


1. Once upload is complete, click on the three dots to the right of the filename and select "Copy Path"
2. Paste the path in between the quotes in the cell below ("/content/your_audio_file_here") and run the cell.

In [None]:
!whisper "/content/your_audio_file_here" --model medium --output_dir /content/drive/MyDrive/Whisper

# Notes
* By default, Whisper outputs 3 files of text per audio file input: a .txt file with no time stamps contained in it, and both an .srt and .vtt file which are repsectively a subtitles and video subtitles/closed-captioning file, which include timestamps and can be replayed using the free and open-source [VLC media player.](https://www.videolan.org/vlc/)

* OpenAI's Whisper package has several sizes of model, tiny, base, small, medium, and large. Each one has a correspondingly large download time before audio analysis can begin, but yields greater accuracy. To balance these, I've used the flag "--model medium."

* If, as mentioned earlier, you wish to save the transcribed output into a different folder, simply change "Whisper" in the file path following "--output_dir" to the name of your chosen folder.

* There is no specification in the above command of a flag for the language, so Whisper analyzes the first 30 seconds of audio to autodetect the language. If you want to very slightly speed up the process, you can give it a flag such as "--Spanish."