Skip to content

lmenta/runcast-intelligence

Repository files navigation

RunCast Intelligence

Semantic search across thousands of hours of running podcast transcripts. Ask a question in plain English — get answers with exact episode sources and timestamps.

Live: runcast-intelligence.vercel.app · Backend: Railway · Auth: Clerk (Google login)


What it does

Most podcast knowledge is locked behind episode titles and show notes. RunCast transcribes every episode with Whisper, embeds the content into a vector database, and lets you search across everything at once using natural language.

Ask: "How do elites taper for a marathon?" and get an answer synthesised from 4,000+ episodes, with the exact timestamp so you can jump straight to the source.

┌─────────────────────────────────────────────────────────────────┐
│                        User asks a question                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                    Embed query (OpenAI)
                             │
                             ▼
                  ┌─────────────────────┐
                  │  Supabase pgvector  │  ← similarity search
                  │  4,151 episodes     │
                  │  chunked + embedded │
                  └──────────┬──────────┘
                             │  top-k chunks with timestamps
                             ▼
                  ┌─────────────────────┐
                  │   LLM (OpenRouter)  │  ← RAG answer synthesis
                  └──────────┬──────────┘
                             │
                             ▼
              Answer + episode sources + timestamps

Architecture

RSS Feeds (9 podcasts, 4,151 episodes)
    │
    ▼
Crawler (scripts/crawl.py)
    │  stores episode metadata
    ▼
Supabase
    │
    ├── Transcription pipeline
    │       ffmpeg (split >25MB audio) → OpenAI Whisper → raw transcript
    │       stored in Supabase with chunk offsets
    │
    └── Embedding pipeline
            text-embedding-3-small → pgvector
            chunk size: 500 tokens, 50-token overlap

FastAPI (src/api/)
    ├── POST /search  →  embed query → pgvector → LLM → response
    └── GET  /health

Next.js frontend
    ├── Public homepage
    └── Search (Clerk auth required)

Podcasts indexed

Podcast Host
The Running Explained Podcast Elisabeth Scott
Ali on the Run Show Ali Feller
The Strength Running Podcast Jason Fitzgerald
The CITIUS MAG Podcast Chris Chavez
The Morning Shakeout Podcast Mario Fraioli
Run to the Top Runners Connect
Some Work, All Play David & Megan Roche
Real Talk Running
The Planted Runner

Tech stack

Layer Technology
Backend Python · FastAPI
Database Supabase (PostgreSQL + pgvector)
Transcription OpenAI Whisper (with ffmpeg chunking for >25MB files)
Embeddings OpenAI text-embedding-3-small
LLM OpenRouter
Frontend Next.js · TypeScript · Tailwind
Auth Clerk (Google login)
Backend hosting Railway
Frontend hosting Vercel

Local setup

Prerequisites

  • Python 3.11+
  • Node 20+
  • ffmpeg (brew install ffmpeg)
  • A Supabase project (free tier)
  • An OpenAI API key
  • An OpenRouter API key

1. Clone and install

git clone https://github.com/lmenta/runcast-intelligence
cd runcast-intelligence
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

2. Environment variables

cp .env.example .env

Fill in .env:

SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_KEY=your-service-role-key
OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-...

3. Set up the database

Open the Supabase SQL editor and run both migrations in order:

# Copy the contents of each file and run in Supabase SQL editor
supabase/migrations/001_initial_schema.sql
supabase/migrations/002_add_transcript.sql

This creates the episodes, chunks, and podcasts tables with pgvector enabled.

4. Seed podcasts and crawl RSS feeds

make setup        # seeds 9 podcasts and crawls all RSS feeds
make check-feeds  # verify all feeds are reachable

This populates the episodes table with metadata (title, date, audio URL) but no transcripts yet.

5. Transcribe episodes

make transcribe   # transcribes 3 episodes (~$0.15 in OpenAI credits)

For large audio files (>25MB), the pipeline automatically splits them with ffmpeg before sending to Whisper.

6. Generate embeddings

make embed   # chunks transcripts and stores embeddings in pgvector

7. Test search

make search
# Query: how do elites taper for a marathon?

8. Run locally

make api   # FastAPI on http://localhost:8000
make dev   # Next.js on http://localhost:3000 (in a second terminal)

Deployment

Backend → Railway

  1. Connect this repo to Railway
  2. Add all environment variables from .env
  3. Railway picks up railway.toml automatically — no extra config

Frontend → Vercel

  1. Connect the frontend/ directory to Vercel
  2. Add environment variables:
    • NEXT_PUBLIC_API_URL — your Railway backend URL
    • NEXT_PUBLIC_USE_MOCK=false
    • Clerk keys (NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY, CLERK_SECRET_KEY)

Transcription at scale → Modal

For processing large backlogs without keeping a local machine running:

pip install modal
modal secret create runcast \
  SUPABASE_URL=... \
  SUPABASE_SERVICE_KEY=... \
  OPENAI_API_KEY=...
modal deploy src/transcription/modal_worker.py

Deploys a serverless worker that transcribes new episodes on GPU. Cost: ~$0.05/hour of audio.


Cost estimate (low traffic)

Service Cost
Supabase Free tier
Railway ~$5/month
Vercel Free
OpenAI (embeddings) ~$0.02/episode
OpenAI (Whisper) ~$0.006/minute of audio
OpenRouter (search) ~$0.001/query

About

Semantic search across thousands of running podcast episodes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors