# Transcribe audio to text

_This notebook is part of a tutorial series on [txtai](https://github.com/neuml/txtai), an AI-powered semantic search platform._

This notebook covers the transcription of audio files to text using models provided by Hugging Face.

# Install dependencies

Install `txtai` and all dependencies. Since this notebook is using optional pipelines, we need to install the pipeline extras package.

In [1]:
%%capture
!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]

# Get test data
!wget -N https://github.com/neuml/txtai/releases/download/v3.5.0/tests.tar.gz
!tar -xvzf tests.tar.gz

# Create a Transcription instance

The Transcription instance is the main entrypoint for transcribing audio to text. The pipeline abstracts transcribing audio into a one line call! 

The pipeline executes logic to read audio files into memory, run the data through a machine learning model and output the results to text.



In [2]:
%%capture

from txtai.pipeline import Transcription

# Create transcription model
transcribe = Transcription("facebook/wav2vec2-large-960h")

# Transcribe audio to text

The example below shows how to transcribe a list of audio files to text. Let's transcribe audio to text and look at each result.

In [3]:
from IPython.display import Audio, display

files = ["Beijing_mobilises.wav", "Canadas_last_fully.wav", "Maine_man_wins_1_mil.wav", "Make_huge_profits.wav", "The_National_Park.wav", "US_tops_5_million.wav"]
files = ["txtai/%s" % x for x in files]

for x, text in enumerate(transcribe(files)):
  display(Audio(files[x]))
  print(text)
  print()


Baging mobilizes invasion craft along coast as tiwan tensions escalates



Canada's last fully intact ice shelf has suddenly collapsed forming a manhatten sized ice berg



Main man wins from lottery ticket



Make huge profits without working make up to one hundred thousand dollars a day



National park service warns against sacrificing slower friends in a bare attack



U s virus cases top a million



Overall, the results are solid. Each result sounds phonetically like the audio.

# OpenAI Whisper

**Update**: In September 2022, [OpenAI Whisper](https://github.com/openai/whisper) was released bringing a dramatic improvement in transcription quality. 

These models will no doubt roll into the Hugging Face Transformers library and make it possible to use the standard txtai transcription pipeline. But as of now, the model is only available on GitHub. Let's give it a try. 

In [4]:
%%capture
!pip install git+https://github.com/openai/whisper.git git+https://github.com/neuml/txtai#egg=txtai[api]

Next we'll create a custom transcription pipeline using Whisper. We'll create the pipeline as a Python file so it can be used both in this Python instance and in an API instance.

In [5]:
%%writefile transcription.py
import whisper

# Create Whisper pipeline
class Transcription:
  def __init__(self, path="small"):
    self.model = whisper.load_model(path)

  def __call__(self, files):
    return [self.model.transcribe(f, fp16=False, language="english")["text"].strip() for f in files]

Writing transcription.py


Let's transcribe the files!

In [10]:
from transcription import Transcription

# Transcribe files
transcribe = Transcription()
for text in transcribe(files):
  print(text)

Beijing mobilizes invasion craft along coast as Taiwan tensions escalate.
Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan sized iceberg.
Maine Man wins from lottery ticket.
make huge profits without working, make up to $100,000 a day.
National Park Service warns against sacrificing slower friends in a bear attack.
US virus cases top a million.


Results were transcribed with near perfect accuracy, amazing!

This can also be run as a txtai application or API instance. Let's try a full indexing workflow with a txtai application.

In [7]:
%%writefile workflow.yml
writable: true

embeddings:
  path: sentence-transformers/nli-mpnet-base-v2
  content: true

transcription.Transcription:

workflow:
  index:
    tasks:
      - transcription.Transcription
      - index

Writing workflow.yml


In [11]:
from txtai.app import Application

app = Application("workflow.yml")

list(app.workflow("index", files))
app.search("feel good story", 1)

[{'id': '2',
  'text': 'Maine Man wins from lottery ticket.',
  'score': 0.12858600914478302}]

This workflow transcribed the input files, loaded the transcriptions into an embeddings index and finally ran a search. Last thing we'll do is run the workflow as an API instance.

In [9]:
import os
os.system('!CONFIG=workflow.yml uvicorn "txtai.api:app" &> api.log &')
!sleep 30

# Run indexing workflow
!curl -s -o /dev/null \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"index", "elements":["txtai/Beijing_mobilises.wav", "txtai/Canadas_last_fully.wav", "txtai/Maine_man_wins_1_mil.wav", "txtai/Make_huge_profits.wav", "txtai/The_National_Park.wav", "txtai/US_tops_5_million.wav"]}'

# Test API search
!curl "http://localhost:8000/search?query=feel+good+story&limit=1"

[{"id":"2","text":"Maine Man wins from lottery ticket.","score":0.12858600914478302}]

Once again, the same results as in Python and with an application.

# Wrapping up

There is a lot of development in the audio transcription space. Keep an eye out for when Whisper is mainstreamed as a Hugging Face model. In the meantime, it's only a couple of lines of code to create a custom pipeline and worth it given the accuracy bump!