# Building a Simple ASR Application with Whisper 🤫💬
---

This notebook uses [Hugging Face](https://huggingface.co/docs/transformers/model_doc/whisper) and [Gradio](https://gradio.app/) to build a simple demo.

Note that this application also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system:

```bash
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
```

You can also check out the [demo](https://huggingface.co/spaces/openai/whisper) hosted on the [Hugging Face Spaces](https://huggingface.co/spaces/launch).

In [1]:
# !pip install transformers
!pip install gradio



In [2]:
from transformers import pipeline
import gradio as gr
import torch

if torch.cuda.is_available():
  device = "cuda:0"
else:
  device = "cpu"

print("Using device", device)
pipe = pipeline(model="openai/whisper-large", device=device)

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

Using device cuda:0


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


In [None]:
iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources="upload", type="filepath"), # swap source with "upload"
    outputs="text",
    title="Whisper App",
    description="Realtime demo for automatic speech recognition using a Whisper model.",
)

iface.launch(debug=True)

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://f224425f91ad53f71f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


## Model Information
---

You can also checkout Whisper's official [github repo](https://github.com/openai/whisper) for a more comprehensive (non-Hugging Face) tutorial.

## Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.


|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.

Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the `large-v2` model (The smaller the numbers, the better the performance).

![WER breakdown by language](https://raw.githubusercontent.com/openai/whisper/main/language-breakdown.svg)