# Transcribe audio files with Whisper

Convert speech to text locally using OpenAI's open-source Whisper model—no API key needed.

## Problem

You have audio or video files that need transcription. Long files are memory-intensive to process at once, so you need to split them into manageable chunks.

| File | Duration | Challenge |
|------|----------|-----------|
| podcast.mp3 | 60 min | Too long to process at once |
| interview.mp4 | 30 min | Need to extract audio first |
| meeting.wav | 2 hours | Must chunk for memory efficiency |

## Solution

**What's in this recipe:**
- Transcribe audio files locally with Whisper (no API key)
- Automatically chunk long files
- Extract and transcribe audio from videos

You create a view with AudioSplitter to break long files into chunks, then add a computed column for transcription. Whisper runs locally on your machine—no API calls needed.

### Setup

In [None]:
%pip install -qU pixeltable openai-whisper

In [None]:
import pixeltable as pxt
from pixeltable.iterators import AudioSplitter
from pixeltable.functions import whisper

### Load audio files

In [None]:
# Create a fresh directory
pxt.drop_dir('audio_demo', force=True)
pxt.create_dir('audio_demo')

In [None]:
# Create table for audio files
audio = pxt.create_table('audio_demo.files', {'audio': pxt.Audio})

In [None]:
# Insert a sample audio file (video files also work - audio is extracted automatically)
audio.insert([
    {'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4'}
])

### Split into chunks

Create a view that splits audio into 30-second chunks with overlap:

In [None]:
# Split audio into chunks for transcription
chunks = pxt.create_view(
    'audio_demo.chunks',
    audio,
    iterator=AudioSplitter.create(
        audio=audio.audio,
        chunk_duration_sec=30.0,  # 30-second chunks
        overlap_sec=2.0,          # 2-second overlap for context
        min_chunk_duration_sec=5.0  # Drop chunks shorter than 5 seconds
    )
)

In [None]:
# View the chunks
chunks.select(chunks.start_time_sec, chunks.end_time_sec).collect()

### Transcribe with Whisper

Add a computed column that transcribes each chunk:

In [None]:
# Add transcription column (runs locally - no API key needed)
chunks.add_computed_column(
    transcription=whisper.transcribe(
        audio=chunks.audio_chunk,
        model='base.en'  # Options: tiny.en, base.en, small.en, medium.en, large
    )
)

In [None]:
# Extract just the text
chunks.add_computed_column(text=chunks.transcription.text)

In [None]:
# View transcriptions with timestamps
chunks.select(chunks.start_time_sec, chunks.end_time_sec, chunks.text).collect()

## Explanation

**Whisper models:**

| Model | Speed | Quality | Best for |
|-------|-------|---------|----------|
| `tiny.en` | Fastest | Basic | Quick tests |
| `base.en` | Fast | Good | General use |
| `small.en` | Medium | Better | Higher accuracy |
| `medium.en` | Slow | Great | Professional quality |
| `large` | Slowest | Best | Maximum accuracy |

Models ending in `.en` are English-only and faster. Remove `.en` for multilingual support.

**AudioSplitter parameters:**

| Parameter | Description |
|-----------|-------------|
| `chunk_duration_sec` | Duration of each chunk in seconds |
| `overlap_sec` | Overlap between chunks (helps with word boundaries) |
| `min_chunk_duration_sec` | Drop the last chunk if shorter than this |

**Video files work too:**

When you insert a video file, Pixeltable automatically extracts the audio track.

## See also

- [Iterators documentation](https://docs.pixeltable.com/platform/iterators)
- [Whisper library](https://github.com/openai/whisper)