# Whisper

This is a quick tutorial to showcase one of OpenAI's multimodal capability, specifically speech to text transformation. 

The speech recognition model used is called [Whisper](https://openai.com/research/whisper), and it's completely free and open-source. It is trained on 680,000 hours of multilingual data and is capable of dealing with complex inputs such as accents, background noise etc. You can take a closer look into the [technical details](https://help.openai.com/en/articles/7031512-whisper-api-faq) or [implementation](https://github.com/openai/whisper).

One thing to note, there are a number of available models in Whisper, and this is useful to know as it will be configured in the code.

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

## Transcription

To start with, I will ask Whisper to transcribe the famous `I have a particular set of skills` speech from `Liam Neeson`, I strongly encourage you to checkout the original film [Taken](https://en.wikipedia.org/wiki/Taken_(film)) if you haven't seen the whole movie before. 

I will first need to download the clip from YouTube, then extract only the audio from the clip.

In [None]:
!pip install -— upgrade pytube

In [5]:
import pytube

video = "https://www.youtube.com/watch?v=-LIIf7E-qFI"
data = pytube.YouTube(video)
audio = data.streams.get_audio_only()
audio.download(filename = "taken.mp4")

'/Users/meng.lin/workspace/GenAI/OpenAI/taken.mp4'

Once we have the audio file, we can start the transcription process.

In [None]:
!pip install git+https://github.com/openai/whisper.git -q

You may receive a warning `UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead")` during the transcribe, if you don't specify this flag `fp16-False`. It's a nuance that should have been sorted out, you can dig deeper into this topic [here](https://www.quora.com/What-is-the-difference-between-FP16-and-FP32-when-doing-deep-learning) if this is the kind of topic that interests you.

In [5]:
import whisper
model = whisper.load_model("large")
transcribe = model.transcribe("taken.mp4", fp16 = False)
transcribe['text']

100%|█████████████████████████████████████| 2.87G/2.87G [07:27<00:00, 6.90MiB/s]


" I don't know who you are. I don't know what you want. If you are looking for a ransom, I can tell you I don't have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don't, I will look for you. I will find you. And I will kill you. Good luck."

As you can see it actually works pretty well.

However, there are a couple of things to point out:
- The result looks pretty decent is because the `large` model was used in this case. You would experience some subpar results using smaller models. Such as, I got the result of `Skills I have acquired are for a very long career` instead of `Skills I have acquired over a very long career` when using `tiny` and `base` models.
- Even though larger model produces more accurate results, it also takes longer to get there.

## Transcription for non-English languages

Whisper does a great job transcribe English, what about other languages? It turns out Whisper currently supports 99 different languages.

I will use a movie clip from `Les Choristes`, or `The Chorus`, and see how well Whisper does in transcribing French.

In [3]:
import pytube
import whisper

video = "https://www.youtube.com/watch?v=qcHpow4k4v8"
data = pytube.YouTube(video)
audio = data.streams.get_audio_only()
audio.download(filename="the_chorus.mp4")

model = whisper.load_model("large")
transcribe = model.transcribe("the_chorus.mp4", fp16 = False)
transcribe['text']

" 30 janvier, début de l'expérience. Ils en déchapperont vivre les bretons... Ils en déchapperont vivre la Bretagne, Ils en déchapperont vivre les bretons... Soprano, à gauche. Leclerc. Il est né le divin enfant, Chantons tous son avènement... Oui, c'est presque juste. Alto, à gauche. De l'air. 3 km à pied, ça use, ça use, 3 km à pied, ça use et souillez. Oui, on va recommencer, là. 4 km à pied, ça use, ça use, 4 km à pied, ça use et souillez. C'est bien ce que je pensais. Basse, à droite. Riqueur. J'ai du doux. J'ai du bon tabac dans ma tabatière, J'ai du bon tabac, tu n'en auras pas. T'es interdit de fumer. Alto, à gauche. Il ouse. L'amour est enfant de Bohème, Il n'a jamais, jamais connu de loi. Soprano, évidemment, à gauche. Viens, Pépinot. Je connais pas de chansons. C'est pas grave, je vais t'en apprendre. En attendant, je te nomme assistant chef de chœur. Voilà. Boniface. Maréchal, nous voilà ! Mais qu'est-ce qui t'a appris ça, toi ? C'est mon grand-père. T'es un peu démodé. Bon

Looks pretty decent comparing the result to the subtitle in the clip. 

But what if you don't understand a word of French? There's another amazing capability of Whisper that can help with that.

## Translation into English

In addition to speech recognition and language identification, Whisper is also able to translate between different languages. However, the current model can only translate from any non-English languages to English for [some reason](https://github.com/openai/whisper/discussions/649).

Regardless, we will take a look at how this can be implemented and how well Whisper is at the task.

In [4]:
french_to_english = model.transcribe("the_chorus.mp4", fp16 = False, task = 'translate')
french_to_english["text"]

" The beginning of the experiment 30 January, the beginning of the experiment They have round hats, long live the Bretons They have round hats, long live Britain They have round hats, long live the Bretons Soprano, on the left Leclerc He was born the divine child Let's all sing his advent Almost right, alto on the left Delair 3 km on foot, it's getting worse 3 km on foot, it's getting worse Start again 4 km on foot, it's getting worse 4 km on foot, it's getting worse That's what I thought, bass on the right Ricœur I've got good tobacco in my tobacco box I've got good tobacco, you won't have any You're not allowed to smoke Alto on the left Illouz Love is a child of Bohemia He has never, never known of law Soprano, of course, on the left Come on, Pépinot I don't know any songs It's OK, I'll teach you some In the meantime, I'll call you assistant choir conductor Boniface Maréchal, here we are Where did you learn that? What did you learn? It's my grandfather You're a bit out of style On th

Even though I wouldn't say it's the best result, it is still really cool. I hope your journey doesn't stop here, but to continue experiment and build amazing things with these amazing features of OpenAI.