# Whisper

This is a quick experiment to play with OpenAI's multimodal capability, specifically speech to text transformation. The speech recognition model used is called [Whisper](https://openai.com/research/whisper), and it's completely free and open-source.

Whisper is trained on 680,000 hours of multilingual data and is capable of dealing with complex inputs such as accents, background noise etc. You can take a look into the [technical details](https://github.com/openai/whisper).

There are a number of available models, and this is useful to know as it will be configured in the code.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	tiny.en	tiny	~1 GB	~32x
base	74 M	base.en	base	~1 GB	~16x
small	244 M	small.en	small	~2 GB	~6x
medium	769 M	medium.en	medium	~5 GB	~2x
large	1550 M	N/A	large	~10 GB	1x

For the purpose of this experiment, I will first download the famous `I have a particular set of skills` speech from `Liam Neeson`, I strongly encourage you to checkout the original film [Taken](https://en.wikipedia.org/wiki/Taken_(film)) if you haven't heard of it before. Then I will only extract audio from the clip, before transcribing the audio into text using Whisper.

In [None]:
!pip install -— upgrade pytube

In [1]:
import pytube

video = "https://www.youtube.com/watch?v=-LIIf7E-qFI" # I have a particular set of skills - Liam Neeson
data = pytube.YouTube(video)
audio = data.streams.get_audio_only()
audio.download(filename="sample.mp4")

'/Users/meng.lin/workspace/GenAI/OpenAI/sample.mp4'

In [None]:
!pip install git+https://github.com/openai/whisper.git -q

You may receive a warning `UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead")` during the transcribe, it's a nuance that should have been sorted out. You can dig deeper into this topic [here](https://www.quora.com/What-is-the-difference-between-FP16-and-FP32-when-doing-deep-learning) if this is the kind of topic that interests you. Otherwise, you can add `fp16=False` in code or `--fp16 False` in command line to get around this issue.

In [2]:
import whisper
model = whisper.load_model("tiny")
transcribe = model.transcribe("sample.mp4", fp16=False)
transcribe['text']

" I don't know who you are. I don't know what you want. If you're looking for ransom, I can tell you I don't have money, but what I do have. I'm a very particular set of skills. Skills I have acquired are for a very long career. The skills that make me and I matter people like you. If you let my daughter go now, that'll be the end of it. I will not look for you. I will not pursue you. But if you don't, I will look for you. I will find you. And I will kill you. Good luck."