## Exploring OpenAI's Whisper model for YouTube video transcription

In [12]:
# Dependencies
!pip install git+https://github.com/openai/whisper.git
! pip install git+https://github.com/pytube/pytube.git
! pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-6sanp4wl
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-6sanp4wl
  Resolved https://github.com/openai/whisper.git to commit b5851c6c40e753606765ac45b85b298e3ae9e00d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/pytube/pytube.git
  Cloning https://github.com/pytube/pytube.git to /tmp/pip-req-build-emj5vd4n
  Running command git clone --filter=blob:none --quiet https://github.com/pytube/pytube.git /tmp/pip-req-build-emj5vd4n
  Resolved https://github.com/pytube/pytube.git to commit 96315dff218c3c5c46

In [85]:
# imports
import re
import torch
import whisper
from pytube import YouTube
import pandas as pd

### Download the YT video as audio format
---
- Download the given `url` from YouTube as audio only format and save it locally.
- This audio file will be fed to the whisper model.
- `last()` selects the highets bit-rate audio file.

In [31]:
yt_url = "https://youtu.be/ORMx45xqWkA"
audio_input = YouTube(yt_url).streams.filter(only_audio=True).last()

In [32]:
audio_input

<Stream: itag="139" mime_type="audio/mp4" abr="48kbps" acodec="mp4a.40.5" progressive="False" type="audio">

In [33]:
audio_input = audio_input.download("audio")

In [34]:
audio_input

'/home/batman/fun/auto-annotate-YT/audio/PyTorch in 100 Seconds.mp4'

### Load the Whisper Model
---
- We will be using the `base` multi-language model.
- It has 74 M parameters.

- ***All other models***
|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

In [35]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = whisper.load_model("base", device=device)

### Transcribe the Video

In [36]:
transcription = model.transcribe(audio_input)

In [37]:
transcription

{'text': " The Pi Torch, an open-source deep learning framework used to build some of the world's most famous artificial intelligence products. It was created at the Meta AI Research Lab in 2016, but is actually derived from the Lua-based torch library that dates back to 2002. Fundamentally, it's a library for programming with tensors, which are basically just multidimensional arrays that represent data and parameters in deep neural networks. Sounds complicated, but its focus on usability will have you training machine learning models with just a few lines of Python. In addition, it facilitates high-performance parallel computing on a GPU, thanks to Nvidia's CUDA platform. Developers love prototyping with it, because it supports a dynamic computation graph, allowing models to be optimized at runtime. It does this by constructing a directed A-cyclic graph consisting of functions that keeps track of all the executed operations on the tensors, allowing you to change the shape, size, and o

In [129]:
pd.set_option('max_colwidth', 1000)
text_segment = pd.DataFrame(transcription['segments'], columns=['id','start', 'end', 'text'])
text_segment

Unnamed: 0,id,start,end,text
0,0,0.0,4.56,"The Pi Torch, an open-source deep learning framework used to build some of the world's most"
1,1,4.56,9.76,"famous artificial intelligence products. It was created at the Meta AI Research Lab in 2016,"
2,2,9.76,14.48,but is actually derived from the Lua-based torch library that dates back to 2002.
3,3,14.48,18.48,"Fundamentally, it's a library for programming with tensors, which are basically just"
4,4,18.48,22.96,multidimensional arrays that represent data and parameters in deep neural networks.
5,5,22.96,27.04,"Sounds complicated, but its focus on usability will have you training machine learning models"
6,6,27.04,31.12,"with just a few lines of Python. In addition, it facilitates high-performance parallel"
7,7,31.12,36.4,"computing on a GPU, thanks to Nvidia's CUDA platform. Developers love prototyping with it,"
8,8,36.4,41.04,"because it supports a dynamic computation graph, allowing models to be optimized at runtime."
9,9,41.04,46.32,It does this by constructing a directed A-cyclic graph consisting of functions that keeps track of


In [44]:
def get_fulltext(model_output):
    return model_output['text']

In [51]:
full_text = get_fulltext(transcription)

In [56]:
def write_srt(text, name):
    with open(f"{name}.srt", "w") as f:
        f.write(text)
        f.close()

In [57]:
write_srt(full_text, 'pytorch')

#### a hacky way to get the time-stamp
---
- Assumption : We all have a constant speech rate. Some speak fast. Some speak slow..

- Geeting the indices where the world `CUDA` is found in the entire transcription text

In [141]:
word_findings = [[i.start(), i.end()] for i in re.finditer('CUDA', transcription['text'])]
word_findings

[[657, 661], [1276, 1280]]

In [160]:
text_segment.iloc[0].text.find('CUDA')

-1

In [173]:
id_len = len(text_segment)
words_to_find = ['CUDA', 'PyTorch', 'GPU', 'MKBHD']
index = []
place_holder = []

for word in words_to_find:
    for ids in range(id_len):
        check = text_segment.iloc[ids].text.find(word)
        if check >= 0:
            index.append(ids)
    if [i.start() for i in re.finditer(word, transcription['text'])] == []:
        print(f' {word} is not found in the transcription')
    
    
print(index)

 MKBHD is not found in the transcription
[7, 13, 33, 7, 14, 31]


In [150]:
transcription

{'text': " The Pi Torch, an open-source deep learning framework used to build some of the world's most famous artificial intelligence products. It was created at the Meta AI Research Lab in 2016, but is actually derived from the Lua-based torch library that dates back to 2002. Fundamentally, it's a library for programming with tensors, which are basically just multidimensional arrays that represent data and parameters in deep neural networks. Sounds complicated, but its focus on usability will have you training machine learning models with just a few lines of Python. In addition, it facilitates high-performance parallel computing on a GPU, thanks to Nvidia's CUDA platform. Developers love prototyping with it, because it supports a dynamic computation graph, allowing models to be optimized at runtime. It does this by constructing a directed A-cyclic graph consisting of functions that keeps track of all the executed operations on the tensors, allowing you to change the shape, size, and o

In [92]:
def get_word_len(word):
    return len(word)

print(get_word_len('will'))

4


In [95]:
transcription['text'][484:484+4]

'will'

In [103]:
def find_word_indices(model_output, words):
    word_findings = []
    for word in words:
        word_findings += [i.start() for i in re.finditer(word, model_output['text'])]
    
    return word_findings

In [105]:
words = ['will', 'Thanks', 'watching']
list_ = find_word_indices(transcription, words)