# Tutorial: Search audio transcriptions with Elasticsearch

## What problem are we solving?

Your organization likely has a lot of unstructured data, such as audio from recorded meetings, which are difficult to search.
Tools like Zoom and Teams have audio transcription features today, but they have two major limitations:

- They are not very accurate, especially for technical terms and non-native English accents.
- They are not easily searchable outside of the meeting platform.

This tutorial will show you how to use a state-of-the-art AI model to generate accurate transcriptions from audio files and sync them to an Elasticsearch index.
You'll be able to scale this approach up to keep track of all your organization's audio data, and search it from a single place.
This is a powerful way to make an important part of your organization's knowledge base more accessible.
You'll be able to use this tutorial as a blueprint for building search experiences for other types of unstructured data, such as images, video, and text.

## What you'll learn

This tutorial will walk you through the following steps:

1. How to generate transcriptions from an audio file using the OpenAI [Whisper](https://openai.com/blog/whisper/) model [API](https://platform.openai.com/docs/api-reference/audio) in Python.
2. Sync the transcriptions to an Elasticsearch index, using use the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#auth-apikey).
3. Query the index to retrieve transcriptions, using a hybrid search (vector-based semantic search + keyword search) strategy.
4. Use an Elastic Search UI to easily search the transcriptions.
5. 🎁 **Bonus**: We'll show you how to summarize your transcription results using the Hugging Face [BART model](https://huggingface.co/transformers/model_doc/bart.html#bartsummarizationpipeline).
Use this to get a quick overview of the contents of your audio files, and to find the most relevant ones.
We can update the documents that contain transcriptions in the Elasticsearch index with a `summary` field, making these searchable.

First things first: let's import the libraries we'll need.

🏃🏽‍♀️Run this notebook:

- Locally using [jupyter](https://docs.jupyter.org/en/latest/install.html)
- Online using [Google Colab](https://colab.research.google.com/?hl=en)

## 🧰 Requirements

For this example, you will need:

- Python 3.6 or later
- An Elastic deployment
   - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?elektra=en-ess-sign-up-page))
- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)
- The [OpenAI Python client](https://github.com/openai/openai-python)
- An OpenAI API key
   - You can get one by [signing up for the OpenAI API](https://beta.openai.com/)
- (_Optional for bonus section_) The [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/quick-start)


## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?fromURI=%2Fhome) for a free trial.

- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page
   - Select **Create deployment**

## Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to `pip` install the following packages:

- `elasticsearch`


In [4]:
!pip install elasticsearch
!pip install --upgrade openai
!pip install huggingface-hub

You should consider upgrading via the '/Users/liamthompson/.pyenv/versions/3.9.7/bin/python3.9 -m pip install --upgrade pip' command.[0m
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 6.2 MB/s  eta 0:00:01
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-macosx_11_0_arm64.whl (338 kB)
[K     |████████████████████████████████| 338 kB 28.9 MB/s eta 0:00:01
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.9.2-cp39-cp39-macosx_11_0_arm64.whl (62 kB)
[K     |████████████████████████████████| 62 kB 5.6 MB/s  eta 0:00:01
[?25hCollecting aiosignal>=1.1.2
  Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp39-cp39-macosx_11_0_arm64.whl (35 kB)
Collecting attrs>=17.3.0
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-macosx_11_0_arm64.whl (29 kB)
Collecting async-timeo

Next we need to import the modules we need.

In [5]:
from elasticsearch import Elasticsearch, helpers
import openai
import huggingface_hub # optional for step 5
from urllib.request import urlopen
import getpass
import requests

## Transcribe audio file(s)

We need some sample audio files to transcribe.
We're going to use a podcast interview with Brian Kernighan available in MP3 format at this [URL](https://op3.dev/e/https://cdn.changelog.com/uploads/podcast/484/the-changelog-484.mp3). 
The interview is about 96 minutes long.
First let's download the file and save it locally.
In your organization you might have audio files stored in a cloud storage bucket, or in a database.
You can adapt the code below to read the audio file from your storage system.

In [11]:
import os # use this to get the current user's current working directory

url = "https://op3.dev/e/https://cdn.changelog.com/uploads/podcast/484/the-changelog-484.mp3"

# Download the file using the URL with the requests library
# File will be saved in the current working directory

pwd = os.getcwd()


r = requests.get(url)
with open("kernighan.mp3", "wb") as file:
    file.write(r.content)
print(f"Downloading file into {pwd}")




Downloading file into /Users/liamthompson/notebook-tests


# Transcribe audio file

Now we've got our sample audio file, let's transcribe it using the OpenAI API.
We'll use the [Whisper](https://openai.com/blog/whisper/) model.
The model is available via the OpenAI API.


In [14]:
openai.api_key = getpass.getpass("Enter your OpenAI API key: ")


audio_file= open("/Users/liamthompson/notebook-tests/kernighan.mp3", "rb") # change this to the path of your audio file

transcription = openai.Audio.transcribe("whisper-1", audio_file)

Let's see what our transcription looks like:

In [19]:
print(type(transcription))

# save the transcription to a file

with open("kernighan-transcription.json", "w") as file:
    file.write(str(transcription))

<class 'openai.openai_object.OpenAIObject'>


## Connect Elasticsearch client

Cool we have our transcription!
Let's connect our Elasticsearch Python client to our Elastic deployment, so we can sync the transcription to an index.

## Initialize the Elasticsearch client

Now we can instantiate the Elasticsearch client.
First we prompt the user for their password and Cloud ID.

🔐 NOTE: `getpass` enables us to securely prompt for credentials without echoing them to the terminal, or storing in memory.

Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.

In [20]:
# Found in the 'Manage Deployment' page
CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID:  ')

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password:  ')

# Create the client instance
client = Elasticsearch(
    cloud_id=CLOUD_ID,
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

Confirm that the client has connected with this test.

In [21]:
print(client.info())

{'name': 'instance-0000000000', 'cluster_name': '9dd1e5c0b0d64796b8cf0746cf63d734', 'cluster_uuid': 'VeYvw6JhQcC3P-Q1-L9P_w', 'version': {'number': '8.9.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'ac7d79178c3e57c935358453331efe9e9cc5104d', 'build_date': '2023-06-21T09:08:25.219504984Z', 'build_snapshot': True, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0', 'transport_version': '8500019'}, 'tagline': 'You Know, for Search'}


Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.

Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.

## Index the transcription into Elasticsearch

Now we can create an index to store our transcriptions and index our first document.

In [24]:
# client.indices.create(index="transcriptions", ignore=400)

client.index(index="transcriptions", id=1, body=str(transcription))

  client.index(index="transcriptions", id=1, body=str(transcription))


ObjectApiResponse({'_index': 'transcriptions', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})

## Aside: Pretty printing Elasticsearch responses

Your API calls will return hard-to-read nested JSON.
We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples.

In [40]:
def insert_newlines(string, every=64):
    return '\n'.join(string[i:i+every] for i in range(0, len(string), every))

def pretty_response(response):
    for hit in response['hits']['hits']:
        id = hit['_id']
        text = hit['_source']['text']
        higlight = hit['highlight']['text']
        pretty_output = (f"\nText: {text} \n\nHighlight: {higlight}")
        print(insert_newlines(pretty_output))



## Query the index

Now we can query the index to search our transcription.
Let's start with a simple keyword search.


In [43]:
response = client.search(index="transcriptions",
                         query= {
                             "match": {
                             "text": "generation"
                             }
                             },
                        highlight={
                            "fields": {
                                "text": {}
                                }
                                })
pretty_response(response)


Text: You know, is he a standout in terms of just like once in 
a generation kind of a software developer or Are there a lot of 
people that you've seen that have been just as good as he was bu
t he happened to have that Nugget, you know, he had to be the ri
ght place the right time with the right idea and the right peopl
e. I think He's a singularity. I have never seen anybody else wh
o's in the same league as him You know, I've certainly met a lot
 of programmers who are very good Yeah, and you know some of my 
students sure the people I worked with at Bell Labs very good Bu
t I can is in a different universe entirely as far as I can tell
 and it's a combination of a bunch of things I mean just being a
ble to write code very quickly that works Very very well done co
de but also this insight into solving the right problem in a The
 right way and just doing that repeatedly over all kinds of diff
erent domains I've never seen anybody remotely like that in any 
setting at all he You kno