# Transcribe audio files with Whisper

Convert speech to text locally using OpenAI's open-source Whisper model—no API key needed.


## Problem

You have audio or video files that need transcription. Long files are memory-intensive to process at once, so you need to split them into manageable chunks.

| File | Duration | Challenge |
|------|----------|-----------|
| podcast.mp3 | 60 min | Too long to process at once |
| interview.mp4 | 30 min | Need to extract audio first |
| meeting.wav | 2 hours | Must chunk for memory efficiency |


## Solution

**What's in this recipe:**
- Transcribe audio files locally with Whisper (no API key)
- Automatically chunk long files
- Extract and transcribe audio from videos

You create a view with AudioSplitter to break long files into chunks, then add a computed column for transcription. Whisper runs locally on your machine—no API calls needed.


### Setup


In [17]:
%pip install -qU pixeltable openai-whisper


Note: you may need to restart the kernel to use updated packages.


In [18]:
import pixeltable as pxt
from pixeltable.iterators import AudioSplitter
from pixeltable.functions import whisper


### Load audio files


In [19]:
# Create a fresh directory
pxt.drop_dir('audio_demo', force=True)
pxt.create_dir('audio_demo')


Created directory 'audio_demo'.


<pixeltable.catalog.dir.Dir at 0x37e827f20>

In [20]:
# Create table for audio files
audio = pxt.create_table('audio_demo.files', {'audio': pxt.Audio})


Created table 'files'.


In [21]:
# Insert a sample audio file (video files also work - audio is extracted automatically)
audio.insert([
    {'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4'}
])


Inserting rows into `files`: 1 rows [00:00, 548.78 rows/s]
Inserted 1 row with 0 errors.


1 row inserted, 2 values computed.

### Split into chunks

Create a view that splits audio into 30-second chunks with overlap:


In [22]:
# Split audio into chunks for transcription
chunks = pxt.create_view(
    'audio_demo.chunks',
    audio,
    iterator=AudioSplitter.create(
        audio=audio.audio,
        chunk_duration_sec=30.0,  # 30-second chunks
        overlap_sec=2.0,          # 2-second overlap for context
        min_chunk_duration_sec=5.0  # Drop chunks shorter than 5 seconds
    )
)


Inserting rows into `chunks`: 2 rows [00:00, 909.04 rows/s]


In [23]:
# View the chunks
print(f"Created {chunks.count()} chunks")
chunks.select(chunks.start_time_sec, chunks.end_time_sec).collect()


Created 2 chunks


start_time_sec,end_time_sec
0.0,30.0
28.003,58.003


### Transcribe with Whisper

Add a computed column that transcribes each chunk:


In [24]:
# Add transcription column (runs locally - no API key needed)
chunks.add_computed_column(
    transcription=whisper.transcribe(
        audio=chunks.audio_chunk,
        model='base.en'  # Options: tiny.en, base.en, small.en, medium.en, large
    )
)




Added 2 column values with 0 errors.


2 rows updated, 2 values computed.

In [25]:
# Extract just the text
chunks.add_computed_column(text=chunks.transcription.text)


Added 2 column values with 0 errors.


2 rows updated, 2 values computed.

In [26]:
# View transcriptions with timestamps
chunks.select(chunks.start_time_sec, chunks.end_time_sec, chunks.text).collect()


start_time_sec,end_time_sec,text
0.0,30.0,"of experiencing self versus remembering self. I was hoping you can give a simple answer of how we should live life. Based on the fact that our memories could be a source of happiness or could be the primary source of happiness, that an event when experienced bears its fruits the most when it's remembered over and over and over and over."
28.003,58.003,"over and over and over and over and maybe there is some wisdom in the fact that we can control to some degree how we remember how we evolve our memory of it such that it can maximize the long-term happiness of that repeated experience. Okay, well first I'll say I wish I could take you on the road with me. That was such a great description. Can I be your opening ax? Oh my God, no, I'm going to open for you dude. Otherwise it's like, you know, everybody leaves."


## Explanation

**Whisper models:**

| Model | Speed | Quality | Best for |
|-------|-------|---------|----------|
| `tiny.en` | Fastest | Basic | Quick tests |
| `base.en` | Fast | Good | General use |
| `small.en` | Medium | Better | Higher accuracy |
| `medium.en` | Slow | Great | Professional quality |
| `large` | Slowest | Best | Maximum accuracy |

Models ending in `.en` are English-only and faster. Remove `.en` for multilingual support.

**AudioSplitter parameters:**

| Parameter | Description |
|-----------|-------------|
| `chunk_duration_sec` | Duration of each chunk in seconds |
| `overlap_sec` | Overlap between chunks (helps with word boundaries) |
| `min_chunk_duration_sec` | Drop the last chunk if shorter than this |

**Video files work too:**

When you insert a video file, Pixeltable automatically extracts the audio track.


## See also

- [Iterators documentation](https://docs.pixeltable.com/datastore/iterators)
- [Whisper library](https://github.com/openai/whisper)
