# Whisper - Speech to Text

---
**Description**

- <b>Whisper</b>, a single model that can perform <i>multiple speech processing tasks across different languages</i> with high accuracy and robustness

- It shows how to train Whisper on 680,000 hours of web data with weak supervision from transcripts without any manual annotation or filtering.

<img src ="https://openaicom.imgix.net/d9c13138-366f-49d3-b8bd-cb3f5a973a5b/asr-summary-of-model-architecture-desktop.svg?fm=auto&auto=compress,format&fit=min&w=3840&h=3103" width = "600px" height="400px" > </img>
<br>
    Figure 1. Whisper Architecture : Radford, A. et al. (2022)

<br>

---
**Reference**

\[1\] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460.https://arxiv.org/pdf/2006.11477v3.pdf

\[2\] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.https://cdn.openai.com/papers/whisper.pdf

\[3\]"Whisper", Openai, accesed 2023 05 10, https://github.com/openai/whisper.



**Install Library**

---

In [2]:
 !pip install git+https://github.com/openai/whisper.git
 !sudo apt update && sudo apt install ffmpeg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-gny6a_2h
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-gny6a_2h
  Resolved https://github.com/openai/whisper.git to commit 248b6cb124225dd263bb9bd32d060b6517e067f8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:6 http://ppa.launchpad.

# Upload ur recordings or media file on current directory or designated directory. 

> You can download file from my drive: https://drive.google.com/drive/folders/1HBzIBBgTWjxQRCVj0T5twpTOtenXKutp

# 5 different models : Tiny, Base, Small, Medium, Large
- Better accuracy with larger model size

**Speech to text in English**

In [16]:
# Original lyrics : "Coldplay - Yellow"
Yellow_Lyrics = '''
Look at the stars
Look how they shine for you
And everything you do
Yeah, they were all yellow
I came along
I wrote a song for you
And all the things you do
And it was called Yellow
So then I took my turn
Oh, what a thing to have done
And it was all yellow
Your skin, oh yeah, your skin and bones
Turn into something beautiful
And you know, you know I love you so
You know I love you so
I swam across
I jumped across for you
Oh, what a thing to do
'Cause you were all yellow
I drew a line
I drew a line for you
Oh, what a thing to do
And it was all yellow
And your skin, oh yeah, your skin and bones
Turn into something beautiful
And you know, for you, I'd bleed myself dry
For you, I'd bleed myself dry
It's true
Look how they shine for you
Look how they shine for you
Look how they shine for
Look how they shine for you
Look how they shine for you
Look how they shine
Look at the stars
Look how they shine for you
And all the things that you do
'''

In [15]:
# Tiny Model
!whisper  "/content/drive/MyDrive/Generative AI/Whisper - Speech to text/Coldplay_Yellow.mp3" --model tiny

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:26.160]  IN
[00:26.160 --> 00:41.160]  Look at the star, the car they shine for you
[00:41.160 --> 00:47.160]  Everything you do
[00:47.160 --> 00:52.160]  Yeah they were all yellow, I came along
[00:52.160 --> 00:58.160]  I wrote a song for you
[00:58.160 --> 01:04.160]  And all the things you did
[01:04.160 --> 01:08.160]  And it was called Yellow
[01:08.160 --> 01:14.160]  So then I took my looser
[01:14.160 --> 01:20.160]  Oh, what a thing to do
[01:21.160 --> 01:27.160]  And it was all yellow
[01:29.160 --> 01:33.160]  Yeah, I was blue, yeah, I was blue
[01:33.160 --> 01:37.160]  Oh, what a thing to do
[01:37.160 --> 01:43.160]  So then I took my looser looser looser
[01:43.160 --> 01:47.160]  Oh, what a thing to do
[01:48.160 --> 01:51.160]  And all the things you did
[01:51.160 --> 01:55.160]  And all the things you did
[01:55.160 --> 02:01.1

In [12]:
# Base Model
!whisper "/content/drive/MyDrive/Generative AI/Whisper - Speech to text/Coldplay_Yellow.mp3" --model base

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.240]  To be continued ...
[00:30.000 --> 00:40.000]  Look at the stars, look how they shine for you
[00:40.000 --> 00:50.000]  Everything you do, yeah they were all yellow
[00:50.000 --> 00:58.000]  I came alive, I wrote a song for you
[00:58.000 --> 01:09.000]  And all the things you do, and it was called yellow
[01:09.000 --> 01:20.000]  So then I took my hands and a lot of things have done
[01:20.000 --> 01:29.000]  And it was all yellow
[01:29.000 --> 01:34.000]  You're scaling, oh yeah you're scaling, oh
[01:34.000 --> 01:40.000]  You're turning to something to fall
[01:40.000 --> 01:46.000]  You know I love you so
[01:48.000 --> 01:51.000]  You know I love you so
[01:59.000 --> 02:21.000]  I swam across, I jumped across from the view
[02:21.000 --> 02:29.000]  A lot of things you do, because you are all yellow
[02:29.000 --> 02:37.000]  I d

In [18]:
# Small Model
!whisper "/content/drive/MyDrive/Generative AI/Whisper - Speech to text/Coldplay_Yellow.mp3" --model small

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:08.780]  MUSIC
[00:08.780 --> 00:11.620]  IN Locution
[00:11.620 --> 00:21.620]  oof
[00:21.620 --> 00:39.620]  Look at the stars, look how they shine for you
[00:39.620 --> 00:47.620]  And everything you do
[00:47.620 --> 00:52.620]  Yeah, they were all yellow, I came along
[00:52.620 --> 00:58.620]  I wrote a song for you
[00:58.620 --> 01:04.620]  And all the things you do
[01:04.620 --> 01:09.620]  And it was called yellow
[01:09.620 --> 01:15.620]  So then I took my turn
[01:15.620 --> 01:20.620]  And all the things I've done
[01:20.620 --> 01:28.620]  And it was all yellow
[01:28.620 --> 01:33.620]  Your skin, oh yeah, your skin
[01:33.620 --> 01:39.620]  Oh, it's turning into something to fall
[01:39.620 --> 01:47.620]  But you know, you know I love your soul
[01:47.620 --> 01:50.620]  You know I love your soul
[02:09.620 --> 02:20.620]  I sw

In [20]:
# Medium Model
!whisper "/content/drive/MyDrive/Generative AI/Whisper - Speech to text/Coldplay_Yellow.mp3" --model medium.en

[00:00.000 --> 00:07.040]  Music
[00:30.000 --> 00:40.000]  Look at the stars, look how they shine for you
[00:40.000 --> 00:50.000]  And everything you do, yeah they were all yellow
[00:50.000 --> 00:58.000]  I came along, I wrote a song for you
[00:58.000 --> 01:04.000]  And all the things you do
[01:04.000 --> 01:09.000]  And it was called Yellow
[01:09.000 --> 01:15.000]  So then I took my turn
[01:15.000 --> 01:21.000]  Oh what have things I've done
[01:21.000 --> 01:29.000]  And it was all yellow
[01:29.000 --> 01:35.000]  Your skin, oh yeah your skin goes
[01:35.000 --> 01:40.000]  Turning into something beautiful
[01:40.000 --> 01:48.000]  Do you know, you know I love you so
[01:48.000 --> 01:51.000]  You know I love you so
[02:11.000 --> 02:20.000]  I swam across, I jumped across for you
[02:20.000 --> 02:26.000]  Oh what a thing to do
[02:26.000 --> 02:29.000]  Cause you were all yellow
[02:29.000 --> 02:37.000]  I drew a line, I drew a line for you
[02:38.000 --> 02:43.000] 

In [10]:
# Large Model
!whisper "/content/drive/MyDrive/Generative AI/Whisper - Speech to text/Coldplay_Yellow.mp3" --model large-v1

100%|█████████████████████████████████████| 2.87G/2.87G [01:41<00:00, 30.4MiB/s]
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:05.980]  Don't Forget To Subscribe For More Ahio Racing Videos!
[00:30.000 --> 00:40.000]  Look at the stars, look how they shine for you
[00:40.000 --> 00:50.000]  And everything you do, yeah they were all yellow
[00:50.000 --> 00:58.000]  I came along, I wrote a song for you
[00:58.000 --> 01:09.000]  And all the things you do, it was called yellow
[01:09.000 --> 01:21.000]  So then I took my turn, oh what a thing to have done
[01:21.000 --> 01:29.000]  And it was all yellow
[01:29.000 --> 01:34.000]  Your skin, oh yeah your skin and bones
[01:34.000 --> 01:40.000]  Turn into something beautiful
[01:40.000 --> 01:46.000]  And you know, you know I love you so
[01:48.000 --> 01:51.000]  You know I love you so
[01:59.000 --> 02:20.000]  I swam across, I jumped across for

# Results:
<img src = "https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdAgeQL%2FbtsiqgkZsNZ%2Fq7qYNri5TgF5Ns5dVLpd20%2Fimg.jpg">
