**Setup**

In [None]:
!pip install git+https://github.com/openai/whisper.git 
!pip install pydub
!pip install SpeechRecognition
!apt install ffmpeg
!pip install gradio
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-b7uee669
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-b7uee669
  Resolved https://github.com/openai/whisper.git to commit 3e1780fd37686666f568be9c99f5b5e3e4f2eb92
  Preparing metadata (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:4.2.7-0ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-w

**Method 1 : Simple (Error Prone)**

In [None]:
import whisper

model = whisper.load_model('base')
result = model.transcribe('what.mp3', fp16=False)

100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 109MiB/s]


In [None]:
print(result.keys())
result['text']

dict_keys(['text', 'segments', 'language'])


' What are you doing?'

**Method 2 : Low level access**

In [None]:
import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("what.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

Detected language: en
What are you doing?


**`Whisper` vs `Google speech_recognition`**

In [None]:
import speech_recognition as sr 

r = sr.Recognizer() 
with sr.AudioFile('what.wav') as source:
  audio_text = r.listen(source)

g = r.recognize_google(audio_text, show_all=True)
print(g.keys())

print(g['alternative'][0]['transcript'])

dict_keys(['alternative', 'final'])
what are you doing


--------

**For Longer Inputs** : Whisper can't work on file that is larger  than 25mb

In [None]:
from pydub import AudioSegment

song = AudioSegment.from_mp3("good_morning.mp3")

# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000

first_10_minutes = song[:ten_minutes]

first_10_minutes.export("good_morning_10.mp3", format="mp3")