<a href="https://colab.research.google.com/github/meghavarshini/template-audiogram/blob/main/Subtitles_With_Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1>Speech-to-Text with Whisper AI</h1></center>


![](https://images.ctfassets.net/kftzwdyauwt9/18ff9c06-7853-4e3b-d849bc901978/2b49cdd19fcdf22f689f606fdf2dc8d6/asr-details-desktop.svg?w=1920&q=90)



# Some terminology
- **Speech-To-Text (STT)**: A task for taking an audio file with speech as input, and returning the words and sentences sporken as the output, usually with timestamps.
- **Transcripts**: A file with all the audio saved in a text format.
- **(Close) Captions**: Text that follows the audio, and may include descriptions of the audio and video content.
- **Subtitles**: translations of captions into another language.
- **Speaker**: tag in the file for source of speech.
- **Content**: tag or title in the transcript for the transcriptions.
- **Timestamps**: may include just the start time, or boht start time and end time. Format can be **HH:MM:SS.MS**, or rounded up.


# Transcription formats and content



1. VTT (WebVTT)
WebVTT is commonly used for displaying timed text tracks in HTML5 videos.

```
WEBVTT

00:00:00.000 --> 00:00:02.500
Hello, and welcome to today's workshop

00:00:02.500 --> 00:00:05.000
where we will discuss speech recognition.
```

2. SRT (SubRip Subtitle)
SRT is one of the most widely used subtitle formats, for video players, social media sites and disks.

```
SRT
1
00:00:00,000 --> 00:00:02,500
Hello, and welcome to today's workshop

2
00:00:02,500 --> 00:00:05,000
where we will discuss speech recognition.
```

3. JSON can be useful for storing structured data, including transcription with timestamps.

```
{
    "transcriptions": [
        {
            "start": "00:00:00.000",
            "end": "00:00:02.500",
            "text": "Hello, and welcome to our video."
        },
        {
            "start": "00:00:02.500",
            "end": "00:00:05.000",
            "text": "Today, we will discuss the basics of speech recognition."
        }
    ]
}
```

# Web-scale Supervised Pretraining for Speech Recognition (Whisper)

<img src="https://raw.githubusercontent.com/openai/whisper/main/approach.png" width="600" />

[image source](https://raw.githubusercontent.com/openai/whisper/main/approach.png)

- Powerful audio transformer model from OpenAI.
- This model maps utterances and their transcribed form across multiple languages.
- It can be downloaded and used on one's own setup (GPU needed) without sending data through the web.
- Its training data includes may different recording conditions noisy and quiet environments, audio with and without speech, songs, etc.
- So it performs well on both quiet and noisy environments.
- Whisper used a  **sequence-to-sequence transformer** model.
- It also uses weak supervision for training on transcripts (that is, not all of the transcripts are labelled or even generated by humans).
- Its speech model uses a 'multitask training format' and a set of special tokens that can understand the audio data collectively for a lot of tasks.
- It is powerful because the model has been pre-trained on many speech processing tasks, such as multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

When we call the model to process a file, it makes predictions for the set of tasks as a whole, instead of sending the data through different stages.


# Whisper Pipeline for generating captions
2. Upload files using the folder icon on the left, and the upload file icon. ALL FILES WILL BE DELETED UPON COMPLETION,SO DOWNLOAD ALL NECESSARY FILES BEFORE CLOSING THIS NOTEBOOK OR DELETING/DISCONNECTING RUNTIME.

3. Check that your files are visible. In the code, they can be accessed in the `/content/` folder, example `"/content/your-file.mp3"`


In [17]:
# Check for GPU availability:
!nvidia-smi

Thu Oct 31 01:21:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0              32W /  70W |    545MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
# Run this cell to setup Colab and avoid encoding errors
import locale
print(locale.getpreferredencoding())
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

UTF-8


In [2]:
# install whisper from the Github repository:
!pip install git+https://github.com/openai/whisper.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.5/209.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


In [3]:
# Other tools fr processing audio files:
!apt install ffmpeg
!pip install setuptools-rust

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Collecting setuptools-rust
  Downloading setuptools_rust-1.10.2-py3-none-any.whl.metadata (9.2 kB)
Collecting semantic-version<3,>=2.8.2 (from setuptools-rust)
  Downloading semantic_version-2.10.0-py2.py3-none-any.whl.metadata (9.7 kB)
Downloading setuptools_rust-1.10.2-py3-none-any.whl (26 kB)
Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: semantic-version, setuptools-rust
Successfully installed semantic-version-2.10.0 setuptools-rust-1.10.2


In [4]:
# Load the model
import whisper
model = whisper.load_model("base")

100%|███████████████████████████████████████| 139M/139M [00:01<00:00, 76.1MiB/s]
  checkpoint = torch.load(fp, map_location=device)


In [5]:
# Download Sample audio for testing (English and Korean)
!wget -O mary.mp3 https://raw.githubusercontent.com/petewarden/openai-whisper-webapp/main/mary.mp3
!wget -O Cupid_Fifty_Fifty_Korean_Version.mp3 https://raw.githubusercontent.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial/main/Cupid_Fifty_Fifty_Korean_Version.mp3


--2024-10-31 01:16:49--  https://raw.githubusercontent.com/petewarden/openai-whisper-webapp/main/mary.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100483 (98K) [audio/mpeg]
Saving to: ‘mary.mp3’


2024-10-31 01:16:50 (5.64 MB/s) - ‘mary.mp3’ saved [100483/100483]

--2024-10-31 01:16:50--  https://raw.githubusercontent.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial/main/Cupid_Fifty_Fifty_Korean_Version.mp3
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4867062 (4.6M) [audio/mpeg]
Saving to: ‘Cupid_Fifty_Fifty_K

In [10]:
# save audio name in variable
audio_file = "/content/mary.mp3"
audio_file2 = "/content/Cupid_Fifty_Fifty_Korean_Version.mp3"

In [7]:
#Player for playing the audio and checking if it works:
from IPython.display import Audio
Audio(audio_file)

In [14]:
from IPython.display import Audio
Audio(audio_file2)

Some parameters to play with:
```
 --output_format (srt, vtt)
 --max_words_per_line (4, 6, 7...)
 --language (en, hi)
```

In [16]:
# Sample transcription: outputs transcription for file mary.mp3,
# in English, saved to directory "output", in srt with 6 words per line
#(this code will generate captions below, and save only a .srt file)
!whisper /content/mary.mp3 --model medium --task transcribe --language en --output_dir output --output_format srt --word_timestamps True --highlight_words True --max_words_per_line 6


  checkpoint = torch.load(fp, map_location=device)
[00:01.140 --> 00:08.240]  Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.


In [19]:
# Sample transcription: outputs transcription for file mary.mp3,
# in Korean, saved to directory "output_korean", in srt with 6 words per line
#(this code will generate captions below, and save only a .srt file)
!whisper /content/Cupid_Fifty_Fifty_Korean_Version.mp3 --model medium --task transcribe --language ko  --output_dir output_korean --output_format srt --word_timestamps True --highlight_words True --max_words_per_line 4


  checkpoint = torch.load(fp, map_location=device)
[00:04.580 --> 00:29.420]  아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아 아
[00:53.100 --> 00:54.500]  아
[00:54.500 --> 00:55.920] ح
[00:55.920 --> 00:55.940]  발의
[00:53.420 --> 00:56.420]  p
[00:59.420 --> 01:03.800]  또 꿈길을 걷는 every day
[01:03.800 --> 01:06.880]  눈 뜨면 다시 더 fluent
[01:06.880 --> 01:08.980]  Waiting around in the ways
[01:08.980 --> 01:11.000]  나 솔직히 지금이 편해
[01:11.000 --> 01:14.120]  상상만큼 짜릿한 걸까
[01:14.120 --> 01:16.400]  Now I'm so lonely
[01:16.400 --> 01:20.380]  매일 꿈속에서 연습했죠 kiss me
[01:20.380 --> 01:22.960]  사실 Crying in my room
[01:22.960 --> 01:25.120]  포기할까봐
[01:25.120 --> 01:29.380]  But still I want it more, more, more
[01:29.380 --> 01:33.200]  I give a second chance to be cute, babe
[01:

# Uncomment the following to run your uploaded files:

In [None]:
## Uncomment the following lines to play your audio file:
#audio_file = "/content/your-audio.mp3"
# from IPython.display import Audio
# Audio(audio_file)

In [None]:
# Transcribe your mp3 file
#(this code will generate captions below, and save only a .srt file)
# !whisper /content/your-audio.mp3 --model medium --language en --task transcribe --output_format srt --word_timestamps True --highlight_words True --max_words_per_line 5


In [None]:
# Check useage guide for more:
!whisper --help