# Web App Demonstrating OpenAI's Whisper Speech Recognition Model

This is a Colab notebook that allows you to record or upload audio files to [OpenAI's free Whisper speech recognition model](https://openai.com/blog/whisper/). This was based on [an original notebook by @amrrs](https://github.com/amrrs/openai-whisper-webapp), with added documentation and test files by [Pete Warden](https://twitter.com/petewarden).

To use it, choose `Runtime->Run All` from the Colab menu. If you're viewing this notebook on GitHub, follow [this link](https://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb) to open it in Colab first. After about a minute or so, you should see a button at the bottom of the page with a `Record from microphone` link. Click this, you'll be asked to give permission to access your mic, and then speak for up to 30 seconds. Once you're done, press `Stop recording`, and a transcript of the first 30 seconds of your speech should soon appear in the box to the right of the recording button. To transcribe more speech, click `Clear' in the left box and start over.

You can also upload your own audio samples using the folder icon on the left of this page. That gives you access to a file system you can upload to by dragging files into it. You can see examples of how to run the transcription in a couple of the cells below.

## Install the Whisper Code

In [None]:
! pip install git+https://github.com/openai/whisper.git -q

[K     |████████████████████████████████| 5.5 MB 4.3 MB/s 
[K     |████████████████████████████████| 7.6 MB 58.5 MB/s 
[K     |████████████████████████████████| 182 kB 61.8 MB/s 
[?25h  Building wheel for whisper (setup.py) ... [?25l[?25hdone


## Load the ML Model

In [None]:
import whisper

model = whisper.load_model("base")


100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 60.2MiB/s]


## Check we have a GPU

You should see the output `device(type='cuda', index=0)` below. If you don't, you may be on a CPU-only Colab instance which will run more slowly. Go to `Runtime->Change Runtime Type` to fix this.

In [None]:
model.device

device(type='cuda', index=0)

## Download Test Audio Files

This repository has a couple of pre-recorded MP3s to run through the transcribe function. You can listen to them with the audio widgets displayed below.

In [None]:
!git clone https://github.com/petewarden/openai-whisper-webapp

Cloning into 'openai-whisper-webapp'...
remote: Enumerating objects: 29, done.[K
remote: Counting objects:   3% (1/29)[Kremote: Counting objects:   6% (2/29)[Kremote: Counting objects:  10% (3/29)[Kremote: Counting objects:  13% (4/29)[Kremote: Counting objects:  17% (5/29)[Kremote: Counting objects:  20% (6/29)[Kremote: Counting objects:  24% (7/29)[Kremote: Counting objects:  27% (8/29)[Kremote: Counting objects:  31% (9/29)[Kremote: Counting objects:  34% (10/29)[Kremote: Counting objects:  37% (11/29)[Kremote: Counting objects:  41% (12/29)[Kremote: Counting objects:  44% (13/29)[Kremote: Counting objects:  48% (14/29)[Kremote: Counting objects:  51% (15/29)[Kremote: Counting objects:  55% (16/29)[Kremote: Counting objects:  58% (17/29)[Kremote: Counting objects:  62% (18/29)[Kremote: Counting objects:  65% (19/29)[Kremote: Counting objects:  68% (20/29)[Kremote: Counting objects:  72% (21/29)[Kremote: Counting objects:  75% (22/29)[Krem

In [None]:
from IPython.display import Audio
Audio("/content/openai-whisper-webapp/mary.mp3")

In [None]:
from IPython.display import Audio
Audio("/content/openai-whisper-webapp/daisy_HAL_9000.mp3")

In [None]:
# file upload while using Google Colab
from google.colab import files
uploaded = files.upload()

Saving 310-calling-all-tools-for-readmes.mp3 to 310-calling-all-tools-for-readmes.mp3


## Define the Transcribe Function

Now we've loaded the model, and have the code, this is the function that takes an audio file path as an input and returns the recognized text (and logs what it thinks the language is).

In [None]:
def transcribe(audio):
    
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)
    return result.text


## Test with Pre-Recorded Audio

Before we bring up the UI to allow you to record your own live audio, we're going to run the `transcribe()` function on a couple of MP3s we've downloaded. You should see `Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.` for `mary.mp3`, which I recorded as an example of clear audio. The second file is a lot harder to transcribe, with very distorted audio, but the model does a good job with `Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage`. You'll notice the transcript is cut off after 30 seconds, which is the default length for this notebook. It can be extended, but that's outside of the scope of this documentation.

In [None]:
!pip install pydub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
from pydub import AudioSegment
#importing file from location by giving its path
sound = AudioSegment.from_mp3("310-calling-all-tools-for-readmes.mp3")
#Selecting Portion we want to cut
StrtMin = 0
StrtSec = 0
EndMin = 0
EndSec = 60
# Time to milliseconds conversion
StrtTime = StrtMin*60*1000+StrtSec*1000
EndTime = StrtMin*60*1000+EndSec*1000
# Opening file and extracting portion of it
extract = sound[StrtTime:EndTime]
# Saving file in required location
extract.export("portion.mp3", format="mp3")

<_io.BufferedRandom name='portion.mp3'>

In [None]:
from IPython.display import Audio
Audio("portion.mp3")

In [None]:
import pprint
pp = pprint.PrettyPrinter(width=80, compact=True)
pp.pprint(result["text"])

(' Hello and welcome to Python Bites where we deliver Python news and '
 'headlines directly to your earbuds. This is episode 310 recorded November '
 "15th 2022. I'm Michael Kennedy and I am Brian Arkin and I'm Adam Hopkins. "
 "Welcome Adam. Great to have you here. Awesome. Thank you. I'm excited to be "
 'here. Yeah. People mostly know you. I would imagine through Santa your web '
 "framework and tell people. Yeah, that's correct. Well, first I just I just "
 'noticed episode 310. So two more episodes and you guys pass the Python '
 "version. So congrats. And thank you. That's a milestone. Six years. We just "
 "passed six years. Two years. Yeah, it's exciting. I remember when you guys "
 'started it. So this is this is a great resource for the community. Cool. So '
 "just to introduce myself. I'm Adam Hopkins. I am one of the developers of "
 "the Santa project. My day to day job. I'm a director of software engineering "
 'for packet fabric where we you know day in.')


In [None]:
result = model.transcribe("/content/openai-whisper-webapp/mary.mp3")
print(result["text"])

 Mary had a little lamb, its fleece was white as snow and everywhere that Mary went the lamb was sure to go.


In [None]:
easy_text = transcribe("/content/openai-whisper-webapp/mary.mp3")
print(easy_text)

hard_text = transcribe("/content/openai-whisper-webapp/daisy_HAL_9000.mp3")
print(hard_text)

Detected language: en
Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.
Detected language: en
Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage


# Question Answering

In [1]:
%%capture
!pip install transformers

In [2]:
from transformers import pipeline

In [None]:
nlp = pipeline('question-answering', model='deepset/roberta-base-squad2', tokenizer='deepset/roberta-base-squad2')

In [4]:
inp = ""

In [78]:
context = inp
#question = input('What is your Question:\n')
#question = "what is the patient's medical history?"
#question = "are their any recent falls"
#question = "pains?"
#question = "swelling?"
#question = "who is your primary doctor?"
question_set = {
        'context': context,
        'question': question
    }

out = nlp(question_set)
print("\nAnswer: " + out['answer'])


Answer: Xxx Yay


**FUTURE**

## Install the Web UI Toolkit

We'll be using gradio to provide the widgets we need to do audio recording.

In [None]:
! pip install gradio -q

[K     |████████████████████████████████| 5.3 MB 32.6 MB/s 
[K     |████████████████████████████████| 2.3 MB 51.5 MB/s 
[K     |████████████████████████████████| 84 kB 3.8 MB/s 
[K     |████████████████████████████████| 54 kB 3.2 MB/s 
[K     |████████████████████████████████| 270 kB 76.5 MB/s 
[K     |████████████████████████████████| 84 kB 2.8 MB/s 
[K     |████████████████████████████████| 57 kB 4.7 MB/s 
[K     |████████████████████████████████| 55 kB 3.8 MB/s 
[K     |████████████████████████████████| 112 kB 64.5 MB/s 
[K     |████████████████████████████████| 212 kB 72.4 MB/s 
[K     |████████████████████████████████| 63 kB 1.8 MB/s 
[K     |████████████████████████████████| 80 kB 7.7 MB/s 
[K     |████████████████████████████████| 68 kB 6.2 MB/s 
[K     |████████████████████████████████| 43 kB 1.7 MB/s 
[K     |████████████████████████████████| 4.0 MB 60.8 MB/s 
[K     |████████████████████████████████| 594 kB 70.8 MB/s 
[K     |████████████████████████████████|

In [None]:
import gradio as gr 
import time

## Web Interface

After running this script, you should see two widgets below that you can use to record live audio and see the transcription, as described in the introduction.

In [None]:

gr.Interface(
    title = 'OpenAI Whisper ASR Gradio Web UI', 
    fn=transcribe, 
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath")
    ],
    outputs=[
        "textbox"
    ],
    live=True).launch()

  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your components from gradio.components",


Hint: Set streaming=True for Audio component to use live streaming.
Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Your interface requires microphone or webcam permissions - this may cause issues in Colab. Use the External URL in case of issues.
Running on public URL: https://23121.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7f399e32da50>,
 'http://127.0.0.1:7860/',
 'https://23121.gradio.app')