# Speech Transcription on IPUs using Whisper - Inference

This notebook demonstrates speech transcription on the IPU using the [Whisper implementation in the Hugging Face Transformers library](https://huggingface.co/spaces/openai/whisper) alongside [Optimum Graphcore](https://github.com/huggingface/optimum-graphcore).

Whisper is a versatile speech recognition model that can transcribe speech as well as perform multi-lingual translation and recognition tasks.
It was trained on diverse datasets to give human-level speech recognition performance without the need for fine tuning. 

[🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) is the interface between the [🤗 Transformers library](https://huggingface.co/docs/transformers/index) and [Graphcore IPUs](https://www.graphcore.ai/products/ipu).
It provides a set of tools enabling model parallelization and loading on IPUs, training and fine-tuning on all the tasks already supported by Transformers while being compatible with the Hugging Face Hub and every model available on it out of the box.

> **Hardware requirements:** The Whisper models `whisper-tiny`, `whisper-base` and `whisper-small` can run two replicas on the smallest IPU-POD4 machine. The most capable model, `whisper-large`, will need to use either an IPU-POD16 or a Bow Pod16 machine. Please contact Graphcore if you'd like assistance running model sizes that don't work in this simple example notebook.

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies

IPU Whisper runs faster with the latest features available in SDK > 3.3.0.

In [None]:
!apt update -y
!apt install -y ffmpeg

In [None]:
import poptorch

In [None]:
import re
import warnings

sdk_version = !popc --version
if sdk_version and (version := re.search(r'\d+\.\d+\.\d+', sdk_version[0]).group()) >= '3.3':
    print(f"SDK check passed.")
    enable_sdk_features=True
else:
    warnings.warn("SDK versions lower than 3.3 do not support all the functionality in this notebook so performance will be reduced. We recommend you relaunch the Paperspace Notebook with the Pytorch SDK 3.3 image. You can use https://hub.docker.com/r/graphcore/pytorch-early-access", 
                  category=Warning, stacklevel=2)
    enable_sdk_features=False

If the above cell did not pass the SDK check, you can open a runtime with our SDK 3.3.0-EA enabled by clicking the Run on Gradient button below.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/kC8VBy)


Install the dependencies the notebook needs.

In [None]:
# Install optimum from source 
%pip install "optimum-graphcore>=0.6, <0.7"
%pip install soundfile==0.12.1 librosa==0.10.0.post2 tokenizers==0.12.1 gradio
%pip install matplotlib
%matplotlib inline

## Running Whisper on the IPU

We start by importing the required modules, some of which are needed to configure the IPU.


In [None]:
# Generic imports
from datasets import load_dataset
import matplotlib.pyplot as plt
import librosa
import IPython
import random

# IPU-specific imports
from optimum.graphcore import IPUConfig
from optimum.graphcore.modeling_utils import to_pipelined
from optimum.graphcore.models.whisper import WhisperProcessorTorch

# HF-related imports
from transformers import WhisperForConditionalGeneration

The Whisper model is available on Hugging Face in several sizes, from `whisper-tiny` with 39M parameters to `whisper-large` with 1550M parameters.

We download `whisper-tiny` which we will run using two IPUs.
The [Whisper architecture](https://openai.com/research/whisper) is an encoder-decoder Transformer, with the audio split into 30-second chunks.
For simplicity one IPU is used for the encoder part of the graph and another for the decoder part.
The `IPUConfig` object helps to configure the model to be pipelined across the IPUs.

In [None]:
model_spec = "openai/whisper-tiny.en"

# Instantiate processor and model
processor = WhisperProcessorTorch.from_pretrained(model_spec)
model = WhisperForConditionalGeneration.from_pretrained(model_spec)

# Adapt whisper-tiny to run on the IPU
ipu_config = IPUConfig(ipus_per_replica=2)
pipelined_model = to_pipelined(model, ipu_config)
pipelined_model = pipelined_model.parallelize(
    for_generation=True, 
    use_cache=True, 
    batch_size=1, 
    max_length=250,
    on_device_generation_steps=16, 
    use_encoder_output_buffer=enable_sdk_features).half()

Now we can load the dataset and process an example audio file.
If precompiled models are not available, then the first run of the model triggers two graph compilations.
This means that our first test transcription could take a minute or two to run, but subsequent runs will be much faster.

In [None]:
# load the dataset and read an example sound file
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
test_sample = ds[2]
sample_rate = test_sample['audio']['sampling_rate']

def whisper_transcribe(data, rate):
    input_features = processor(data, return_tensors="pt", sampling_rate=rate).input_features.half()

    # This triggers a compilation, unless a precompiled model is available.
    sample_output = pipelined_model.generate(
        input_features,
        use_cache=True,
        do_sample=False,
        max_length=448, 
        min_length=3)
    transcription = processor.batch_decode(sample_output, skip_special_tokens=True)[0]
    return transcription

test_transcription = whisper_transcribe(test_sample["audio"]["array"], sample_rate)

In the next cell, we compare the expected text from the dataset with the transcribed result from the model.
There will typically be some small differences, but even `whisper-tiny` does a great job! It even adds punctuation.

You can listen to the audio and compare the model result yourself using the controls below.

In [None]:
print(f"Expected: {test_sample['text']}\n")
print(f"Transcribed: {test_transcription}")


The model only needs to be compiled once. Subsequent inferences will be much faster.
In the cell below, we repeat the exercise but with a random example from the dataset.

You might like to re-run this next cell multiple times to get different comparisons.

In [None]:
idx = random.randint(0, ds.num_rows - 1)
data = ds[idx]["audio"]["array"]

print(f"Example #{idx}\n")
print(f"Expected: {ds[idx]['text']}\n")
print(f"Transcribed: {whisper_transcribe(data, sample_rate)}")

IPython.display.Audio(data, rate=sample_rate, autoplay=True)

# Running Flan-T5



In [None]:
import os

In [None]:
executable_cache_dir=os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/")
num_t5_ipus=4

from optimum.graphcore import pipeline, IPUConfig

size = {4: "large", 16: "xl"}
flan_t5 = pipeline(
    "text2text-generation",
    model=f"google/flan-t5-{size[num_t5_ipus]}",
    ipu_config=IPUConfig.from_pretrained(
        f"Graphcore/t5-{size[num_t5_ipus]}-ipu", executable_cache_dir=executable_cache_dir
    ),
    max_input_length=896,
)

questions = [
    "Solve the following equation for x: x^2 - 9 = 0",
    "At what temperature does nitrogen freeze?",
    "In order to reduce symptoms of asthma such as tightness in the chest, wheezing, and difficulty breathing, what do you recommend?",
    "Which country is home to the tallest mountain in the world?"
]
for out in flan_t5(questions):
    print(out)

# Running stable diffusion

In [None]:
number_of_stable_diffusion_ipus = 8
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/stablediffusion_to-image"


In [None]:
import torch
from diffusers import DPMSolverMultistepScheduler

from optimum.graphcore.diffusers import get_default_ipu_configs, INFERENCE_ENGINES_TO_MODEL_NAMES, IPUStableDiffusionPipeline

In [None]:
engine = "stable-diffusion-v1-5"  # maps to "runwayml/stable-diffusion-v1-5"
model_name = INFERENCE_ENGINES_TO_MODEL_NAMES[engine]
image_width = os.getenv("STABLE_DIFFUSION_TXT2IMG_DEFAULT_WIDTH", default=512)
image_height = os.getenv("STABLE_DIFFUSION_TXT2IMG_DEFAULT_HEIGHT", default=512)

unet_ipu_config, text_encoder_ipu_config, vae_ipu_config, safety_checker_ipu_config = \
get_default_ipu_configs(
    engine=engine, width=image_width, height=image_height, n_ipu=number_of_stable_diffusion_ipus, 
    executable_cache_dir=executable_cache_dir 
)
pipe = IPUStableDiffusionPipeline.from_pretrained(
    model_name,
    revision="fp16", 
    torch_dtype=torch.float16,
    requires_safety_checker=False,
    unet_ipu_config=unet_ipu_config,
    text_encoder_ipu_config=text_encoder_ipu_config,
    vae_ipu_config=vae_ipu_config,
    safety_checker_ipu_config=safety_checker_ipu_config
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

In [None]:
pipe("apple", height=image_height, width=image_width, guidance_scale=7.5);

# The demo

In [None]:
!gc-monitor --no-card-info

In [None]:
idx = 58

transcription = whisper_transcribe(data, sample_rate)
print(f"Example #{idx}\n")
print(f"Expected: {ds[idx]['text']}\n")
print(f"Transcribed: {transcription}")

IPython.display.Audio(data, rate=sample_rate, autoplay=True)

In [None]:
idx = 58
data = ds[idx]["audio"]["array"]
IPython.display.Audio(data, rate=sample_rate, autoplay=True)
transcription = whisper_transcribe(data, sample_rate)
print(transcription)
out = pipe(transcription, height=image_height, width=image_width, guidance_scale=7.5)
out.images[0]

In [None]:
sample_rate

In [None]:
import numpy as np
audio_prompts = (10, 30)
# style_prompt = "japanese, manga, high resolution, dynamic."
style_prompt = "modern art, smooth vibes"
LLM_prompt = lambda previous_story: f"""
[Long text]: {' '.join(previous_story)}
[summary]:
"""

def sample_audio(audio_prompts):
    return np.concatenate([ds[idx]["audio"]["array"] for idx in range(*audio_prompts)])

def generate_comic(audio, style_prompt, LLM_prompt=LLM_prompt, sample_rate=sample_rate, generate_image=True):
    generated_text = []
    print("Transcribing audio")
    
    if len(audio) > sample_rate * 30:
        print(f"  Audio was {len(audio) // sample_rate} seconds long, truncating at 30s")
        audio = audio[:sample_rate*30]
    transcription = whisper_transcribe(audio, sample_rate)

    generated_text = [f"{t}." for t in transcription.split(".")]
    print("Generating story")
    for i in range(5):
        prompt = LLM_prompt(generated_text)
        out_text = flan_t5(prompt)
        generated_text.append(out_text[0]['generated_text'])
    images = []
    if not generate_image:
        return transcription, generated_text, images, audio
    print("generating images")
    for prompt in generated_text:
        out = pipe(prompt + style_prompt, height=image_height, width=image_width, guidance_scale=7.5)
        images.append(out.images[0])
    print(generated_text)
    return transcription, generated_text, images, audio

transcription, generated_text, images, data = generate_comic(sample_audio(audio_prompts), style_prompt, LLM_prompt)
print("rendering audio")
print(transcription)
IPython.display.Audio(data, rate=sample_rate, autoplay=True)

In [None]:
from matplotlib import pyplot as plt
import pathlib

def comic_book_plotter(generated_text, images, style_prompt, image_per_page=6, line_break=40, fig_size=(8,15)):
    name = style_prompt.replace(" ","_").replace(".","_").replace(".","_").strip("_")
    comic_hash = hash("\n".join(generated_text)) + sum(hash(image.tobytes()) for image in images)
    figs = []
    paths = []
    for page_num, id_start in enumerate(range(0,len(generated_text), image_per_page)):
        comic_text = generated_text[id_start:id_start+image_per_page]
        comic_images = images[id_start:id_start+image_per_page]
        fig, axs = plt.subplots(image_per_page//2, 2)
        figs.append(fig)
        fig.set_size_inches(*fig_size)
        for image, prompt, ax in zip(comic_images, comic_text,axs.flatten()):
            ax.imshow(image)
            breaks = [0] + [prompt.find(" ", i) for i in range(line_break, len(prompt), line_break)] + [-1]
            formatted_prompt = "\n".join(prompt[i:j] for i, j in zip(breaks[:-1], breaks[1:]))
            ax.set_title(f"{formatted_prompt}")
            ax.axis("off")
        fig.suptitle(f"Style prompt: '{style_prompt}' page {page_num+1}", y=1.0)
        fig.tight_layout()

        pathlib.Path("/storage/comics/").mkdir(exist_ok=True)
        image_path = f"/storage/comics/whisper_to_image_{name}_{comic_hash}-page-{page_num+1}.png"
        paths.append(image_path)
        fig.savefig(image_path, dpi=150)
 
    return figs, paths

fig, image_path = comic_book_plotter(generated_text, images, style_prompt)

In [None]:
import gradio as gr
import librosa
def transcribe_to_comic(rate_and_audio, style_prompt):
    audio_from_mic = rate_and_audio
    sample_rate, audio= rate_and_audio
    target_sr=16000
    resample_audio = librosa.resample(y=audio.astype(float)/np.iinfo(audio.dtype).max, orig_sr=sample_rate, target_sr=target_sr)
    transcription, generated_text, images, data = generate_comic(resample_audio, style_prompt, sample_rate=target_sr)
    figs, paths = comic_book_plotter(generated_text, images, style_prompt, image_per_page=2, line_break=40, fig_size=(8,8))
    return transcription, paths

gr.Interface(
    fn=transcribe_to_comic,
    inputs=[
        gr.Audio(source="microphone", type="numpy"),
        "text",
    ], 
    outputs=["text", gr.Gallery(min_width=800, preview=True)]
).launch()

# Creating a tunnel

In [None]:
%pip install ngrok

In [None]:
# Required to start ngrok tunnel in a notebook environment
import nest_asyncio
nest_asyncio.apply()

# ngrok token
import os
# Insert your ngrok authentication token here:
os.environ['NGROK_AUTHTOKEN'] = ""
import ngrok

# Needed for coroutine's in Notebooks
import asyncio
loop = asyncio.get_event_loop()

tunnel = loop.run_until_complete(ngrok.werkzeug_develop())



In [None]:
import ngrok

import os
# Insert your ngrok authentication token here:
os.environ['NGROK_AUTHTOKEN'] = ""
public_url = ngrok.connect(port = '7872')
public_url

In [None]:
# On ngrok free tier only one active tunne is allowed at one time. This is a problem is the notebook times out (or runs in the background, in a closed page) as ngrok will fail to create a new tunnel.
# I aven't been able to find how to kill the old tunnels.  
# ngrok.disconnect("https://2ba3-38-83-162-251.ngrok-free.app/")