# Transcribing and Indexing Audio and Video in Pixeltable

In this tutorial, we'll build an end-to-end workflow for creating and indexing audio transcriptions of video data. We'll demonstrate how Pixeltable can be used to:

1. Extract audio data from video files;
1. Transcribe the audio using OpenAI Whisper;
1. Build a semantic index of the transcriptions, using the Huggingface sentence_transformers models;
1. Search this index.

The tutorial assumes you're already somewhat familiar with Pixeltable. If this is your first time using Pixeltable, the [10-Minute Tour](https://docs.pixeltable.com/overview/ten-minute-tour) tutorial is a great place to start.

<div class="alert alert-block alert-info"><!-- mdx:none -->
<b>If you are running this tutorial in Colab:</b>
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the <code>Runtime -> Change runtime type</code> menu item at the top, then select the <code>GPU</code> radio button and click on <code>Save</code>.
</div>

## Create a Table for Video Data

Let's first install the Python packages we'll need for the demo. We're going to use the popular Whisper library, running locally. Later in the demo, we'll see how to use the OpenAI API endpoints as an alternative.

In [None]:
%pip install -q pixeltable openai openai-whisper sentence-transformers spacy
!python -m spacy download en_core_web_sm -q

Now we create a Pixeltable table to hold our videos.

In [None]:
import pixeltable as pxt

pxt.drop_dir('transcription_demo', force=True)  # Ensure a clean slate for the demo
pxt.create_dir('transcription_demo')

# Create a table to store our videos and workflow
video_table = pxt.create_table(
    'transcription_demo.video_table',
    {'video': pxt.Video}
)

video_table

Next let's insert some video files into the table. In this demo, we'll be using one-minute excerpts from a Lex Fridman podcast. We'll begin by inserting two of them into our new table. In this demo, our videos are given as `https` links, but Pixeltable also accepts local files and S3 URLs as input.

In [None]:
videos = [
    'https://github.com/pixeltable/pixeltable/raw/release/docs/resources/audio-transcription-demo/'
    f'Lex-Fridman-Podcast-430-Excerpt-{n}.mp4'
    for n in range(3)
]

video_table.insert({'video': video} for video in videos[:2])
video_table.show()

Now we'll add another column to hold extracted audio from our videos. The new column is an example of a _computed column_: it's updated automatically based on the contents of another column (or columns). In this case, the value of the `audio` column is defined to be the audio track extracted from whatever's in the `video` column.

In [None]:
from pixeltable.functions.video import extract_audio

video_table.add_computed_column(
    audio=extract_audio(video_table.video, format='mp3')
)
video_table.show()

If we look at the structure of the video table, we see that the new column is a computed column.

In [None]:
video_table

We can also add another computed column to extract metadata from the audio streams.

In [None]:
from pixeltable.functions.audio import get_metadata

video_table.add_computed_column(
    metadata=get_metadata(video_table.audio)
)
video_table.show()

## Create Transcriptions

Now we'll add a step to create transcriptions of our videos. As mentioned above, we're going to use the Whisper library for this, running locally. Pixeltable has a built-in function, `whisper.transcribe`, that serves as an adapter for the Whisper library's transcription capability. All we have to do is add a computed column that calls this function:

In [None]:
from pixeltable.functions import whisper

video_table.add_computed_column(
    transcription=whisper.transcribe(
        audio=video_table.audio,
        model='base.en'
    )
)

video_table.select(
    video_table.video,
    video_table.transcription.text
).show()

In order to index the transcriptions, we'll first need to split them into sentences. We can do this using Pixeltable's built-in `string_splitter` iterator.

In [None]:
from pixeltable.functions.string import string_splitter

sentences_view = pxt.create_view(
    'transcription_demo.sentences_view',
    video_table,
    iterator=string_splitter(
        video_table.transcription.text,
        separators='sentence'
    )
)

The `string_splitter` creates a new view, with the audio transcriptions broken into individual, one-sentence chunks.

In [None]:
sentences_view.select(
    sentences_view.pos,
    sentences_view.text
).show(8)

## Add an Embedding Index

Next, let's use the Huggingface `sentence_transformers` library to create an embedding index of our sentences, attaching it to the `text` column of our `sentences_view`.

In [None]:
from pixeltable.functions.huggingface import sentence_transformer

sentences_view.add_embedding_index(
    'text',
    embedding=sentence_transformer.using(model_id='intfloat/e5-large-v2')
)

We can do a simple lookup to test our new index. The following snippet returns the results of a nearest-neighbor search on the input "What is happiness?"

In [None]:
sim = sentences_view.text.similarity(string='What is happiness?')

(
    sentences_view
    .order_by(sim, asc=False)
    .limit(10)
    .select(sentences_view.text,similarity=sim)
    .collect()
)

## Incremental Updates

_Incremental updates_ are a key feature of Pixeltable. Whenever a new video is added to the original table, all of its downstream computed columns are updated automatically. Let's demonstrate this by adding a third video to the table and seeing how the updates propagate through to the index.

In [None]:
video_table.insert([{'video': videos[2]}])

In [None]:
video_table.select(
    video_table.video,
    video_table.metadata,
    video_table.transcription.text
).show()

In [None]:
sim = sentences_view.text.similarity(string='What is happiness?')

(
    sentences_view
    .order_by(sim, asc=False)
    .limit(20)
    .select(sentences_view.text, similarity=sim)
    .collect()
)

We can see the new results showing up in `sentences_view`.

## Using the OpenAI API

This concludes our tutorial using the locally installed Whisper library. Sometimes, it may be preferable to use the OpenAI API rather than a locally installed library. In this section we'll show how this can be done in Pixeltable, simply by using a different function to construct our computed columns.

Since this section relies on calling out to the OpenAI API, you'll need to have an API key, which you can enter below.

In [None]:
import os
import getpass

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

In [None]:
from pixeltable.functions import openai

video_table.add_computed_column(
    transcription_from_api=openai.transcriptions(
        video_table.audio,
        model='whisper-1'
    )
)

Now let's compare the results from the local model and the API side-by-side.

In [None]:
video_table.select(
    video_table.video,
    video_table.transcription.text,
    video_table.transcription_from_api.text
).show()

They look pretty similar, which isn't surprising, since the OpenAI transcriptions endpoint runs on Whisper.

One difference is that the local library spits out a lot more information about the internal behavior of the model. Note that we've been selecting `video_table.transcription.text` in the preceding queries, which pulls out just the `text` field of the transcription results. The actual results are a sizable JSON structure that includes a lot of metadata. To see the full output, we can select `video_table.transcription` instead, to get the full JSON struct. Here's what it looks like (we'll select just one row, since it's a lot of output):

In [None]:
video_table.select(
    video_table.transcription,
    video_table.transcription_from_api
).show(1)