<a href="https://colab.research.google.com/github/nancychenxizhong/openai-whisper-webapp/blob/modify-recording%2Fsaving%2Ftranslate/OpenAI_Whisper_ASR_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web App Demonstrating OpenAI's Whisper Speech Recognition Model

This is a Colab notebook that allows you to record or upload audio files to [OpenAI's free Whisper speech recognition model](https://openai.com/blog/whisper/). This was based on [an original notebook by @amrrs](https://github.com/amrrs/openai-whisper-webapp), and its updated version by [Pete Warden](https://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb), with modification by [nancychenxizhong](https://github.com/nancychenxizhong).

I was not able to run the gradio functions to record and transcribe, therefore added the new recording, saving and transcribing functions. Additionally, whisper is tested to directly transcribe and translate the Japanese audio to english text.

The added recording function is from [this gist by @
Korakot Chaovavanich](https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be).

## Install the Whisper Code

In [2]:
! pip install git+https://github.com/openai/whisper.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.[0m[31m
[0m

## Load the ML Model

In [3]:
import whisper

model = whisper.load_model("base")


100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 179MiB/s]


## Check we have a GPU

You should see the output `device(type='cuda', index=0)` below. If you don't, you may be on a CPU-only Colab instance which will run more slowly. Go to `Runtime->Change Runtime Type` to fix this.

In [4]:
model.device

device(type='cuda', index=0)

## Download Test Audio Files

This repository has a couple of pre-recorded MP3s to run through the transcribe function. You can listen to them with the audio widgets displayed below.

In [5]:
!git clone https://github.com/petewarden/openai-whisper-webapp

Cloning into 'openai-whisper-webapp'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects:   3% (1/32)[Kremote: Counting objects:   6% (2/32)[Kremote: Counting objects:   9% (3/32)[Kremote: Counting objects:  12% (4/32)[Kremote: Counting objects:  15% (5/32)[Kremote: Counting objects:  18% (6/32)[Kremote: Counting objects:  21% (7/32)[Kremote: Counting objects:  25% (8/32)[Kremote: Counting objects:  28% (9/32)[Kremote: Counting objects:  31% (10/32)[Kremote: Counting objects:  34% (11/32)[Kremote: Counting objects:  37% (12/32)[Kremote: Counting objects:  40% (13/32)[Kremote: Counting objects:  43% (14/32)[Kremote: Counting objects:  46% (15/32)[Kremote: Counting objects:  50% (16/32)[Kremote: Counting objects:  53% (17/32)[Kremote: Counting objects:  56% (18/32)[Kremote: Counting objects:  59% (19/32)[Kremote: Counting objects:  62% (20/32)[Kremote: Counting objects:  65% (21/32)[Kremote: Counting objects:  68% (22/32)[Krem

In [6]:
from IPython.display import Audio
Audio("/content/openai-whisper-webapp/mary.mp3")

In [6]:
from IPython.display import Audio
Audio("/content/openai-whisper-webapp/daisy_HAL_9000.mp3")

## Define the Transcribe Function

Now we've loaded the model, and have the code, this is the function that takes an audio file path as an input and returns the recognized text (and logs what it thinks the language is).

In [7]:
def transcribe(audio):

    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)
    return result.text


## Test with Pre-Recorded Audio

Before we bring up the UI to allow you to record your own live audio, we're going to run the `transcribe()` function on a couple of MP3s we've downloaded. You should see `Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.` for `mary.mp3`, which I recorded as an example of clear audio. The second file is a lot harder to transcribe, with very distorted audio, but the model does a good job with `Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage`. You'll notice the transcript is cut off after 30 seconds, which is the default length for this notebook. It can be extended, but that's outside of the scope of this documentation.

In [8]:
easy_text = transcribe("/content/openai-whisper-webapp/mary.mp3")
print(easy_text)

hard_text = transcribe("/content/openai-whisper-webapp/daisy_HAL_9000.mp3")
print(hard_text)

Detected language: en
Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.
Detected language: en
Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage


In [9]:
# uploaded japanses audio by author
test_text = transcribe("/content/test_whisper.mp3")

Detected language: ja


In [10]:
print(test_text)

まじかよーおー、数は大丈夫かあああの人がみんな離れしてるの知ってるけどもさすがにもしますね


In [18]:
# ref: https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode
from io import BytesIO
!pip -q install pydub
from pydub import AudioSegment

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec=3):
  display(Javascript(RECORD))
  s = output.eval_js('record(%d)' % (sec*1000))
  b = b64decode(s.split(',')[1])
  audio = AudioSegment.from_file(BytesIO(b))
  return audio

In [19]:
# all imports
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec=3):
  display(Javascript(RECORD))
  s = output.eval_js('record(%d)' % (sec*1000))
  b = b64decode(s.split(',')[1])
  with open('audio.wav','wb') as f:
    f.write(b)
  return 'audio.wav'

In [35]:
record(sec=10)

<IPython.core.display.Javascript object>

'audio.wav'

In [36]:
Audio("/content/audio.wav")

In [37]:
recording_text = transcribe("/content/audio.wav")

Detected language: ja


In [38]:
print(recording_text)

ナイクのー!行くぜー!お見えない!トラックネギー!さあ!アムスイティブのね!


In [42]:
!whisper /content/audio.wav --language Japanese --task translate

[00:00.000 --> 00:02.000]  Let's go!
[00:02.000 --> 00:04.000]  Let's go, you guys!
[00:04.000 --> 00:06.000]  Take it!
[00:07.000 --> 00:10.000]  Now, let's have fun!


## Web Interface (unable to make it working)

After running this script, you should see two widgets below that you can use to record live audio and see the transcription, as described in the introduction.

### Install the Web UI Toolkit

We'll be using gradio to provide the widgets we need to do audio recording.

In [None]:
!pip uninstall -y tensorflow-probability

[0m

In [None]:
!pip install typing-extensions==4.8.0



In [None]:
!pip install gradio -q

In [None]:
!pip install fastapi>4.0

In [None]:
import gradio as gr
import time

In [24]:
# gr.Interface(
#     title = 'OpenAI Whisper ASR Gradio Web UI',
#     fn=transcribe,
#     inputs=[
#         gr.inputs.Audio(source="microphone", type="filepath")
#     ],
#     outputs=[
#         "textbox"
#     ],
#     live=True).launch()

demo = gr.Interface(
    transcribe,
    gr.Audio(sources=["microphone"]),
    "text",
)

In [25]:
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://d943c2bf0ebc7723f7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [28]:
import gradio as gr
from transformers import pipeline
import numpy as np

transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base.en")

def transcribe(stream, new_chunk):
    sr, y = new_chunk
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    if stream is not None:
        stream = np.concatenate([stream, y])
    else:
        stream = y
    return stream, transcriber({"sampling_rate": sr, "raw": stream})["text"]


demo = gr.Interface(
    transcribe,
    ["state", gr.Audio(sources=["microphone"], streaming=True)],
    ["state", "text"],
    live=True,
)

demo.launch()


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://b75d4de190c0025f04.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [26]:
demo = gr.Interface(
    transcribe,
    ["state", gr.Audio(sources=["microphone"], streaming=True)],
    ["state", "text"],
    live=True,
)

