<a href="https://colab.research.google.com/github/keropudding/AudioInk/blob/main/audioink.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Whisper

The commands below will install the Python packages needed to use Whisper models and evaluate the transcription results.

In [1]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-nu8ljjjy
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-nu8ljjjy
  Resolved https://github.com/openai/whisper.git to commit 248b6cb124225dd263bb9bd32d060b6517e067f8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Loading the LibriSpeech dataset

The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

In [2]:
import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

In [3]:
model = whisper.load_model("base.en")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

Model is English-only and has 71,825,408 parameters.


# Record a snippet to feed into the model.

In [8]:
# Importing required modules
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode
import io
import torchaudio
from IPython.display import Audio

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec=10):
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    b = b64decode(s.split(',')[1])
    with io.open('audio.wav', 'wb') as f:
        f.write(b)
    return 'audio.wav'  # returns a string with the file name

print('Recording...')
# Record audio
audio_file = record(10)  # the function will return the name of the file

print('Done recording. Audio File location: ' + audio_file)

Recording...


<IPython.core.display.Javascript object>

Done recording. Audio File location: audio.wav


# Listen to the audio file (optional)

In [10]:
import torchaudio

from IPython.display import Audio

# Play the audio file
Audio('audio.wav')

# Transcribe audio to text using Whisper

In [9]:
# Load audio
waveform, sample_rate = torchaudio.load('audio.wav')

resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)

# Make sure it's mono
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0)

# Trim or pad audio
waveform = whisper.pad_or_trim(waveform)

# Transform audio waveform to mel spectrogram
mel = whisper.log_mel_spectrogram(waveform)

# Move the tensor to the same device as the model
mel = mel.to(next(model.parameters()).device)

# Now run the model
results = model.decode(mel, options)

# Print the transcription
print(results[0].text)


So here's a recording that I'm going to be transcribing. So here's this is an example of how the notebook works.
