Utilities for scraping, indexing, and querying Gil's Arena podcast transcripts from PodScripts.co.
- Python 3.12+
- uv (recommended)
uv syncThis installs runtime deps (requests, beautifulsoup4, tqdm, sentence-transformers, faiss-cpu, numpy, fastapi, uvicorn, jinja2) and the default dev group (ruff, ty, pytest). Use uv sync --all-groups if you add other dependency groups later.
Run the scraper with Python (from the repo root):
# Quick test: few episodes, single listing page
uv run python scripts/scrape_podscripts.py --dry-run
# Full scrape (default output: gil/transcripts)
uv run python scripts/scrape_podscripts.py
# Skip episodes that already have a matching .txt in the output dir
uv run python scripts/scrape_podscripts.py --resume
# Page range and cap
uv run python scripts/scrape_podscripts.py --start-page 1 --end-page 5 --limit 10
# Debug logging
uv run python scripts/scrape_podscripts.py --verboseFor every option, see:
uv run python scripts/scrape_podscripts.py --helpBuild a local FAISS index over timestamped transcript chunks:
uv run python scripts/build_transcript_index.pyUseful flags:
# Build from a subset of episodes
uv run python scripts/build_transcript_index.py --limit 10
# Change chunk sizing
uv run python scripts/build_transcript_index.py --lines-per-chunk 6 --line-overlap 2
# Override the embedding model or output directory
uv run python scripts/build_transcript_index.py --model multi-qa-mpnet-base-cos-v1 --output-dir data/transcript_index_mpnetAfter building an index, search it with a free-text basketball query:
uv run python scripts/query_transcripts.py "What did they say about Giannis trade rumors?"Useful flags:
uv run python scripts/query_transcripts.py "Team USA basketball" --top-k 3
uv run python scripts/query_transcripts.py "Bronny James" --index-dir data/transcript_index
uv run python scripts/query_transcripts.py "Lakers chemistry" --metadata-dir data/metadata --team Lakers
uv run python scripts/query_transcripts.py "Dwight Howard" --metadata-dir data/metadata --guest "Dwight Howard" --rerankRun the tracked seed queries and compute hit rate and MRR:
uv run python scripts/evaluate_retrieval.py
uv run python scripts/evaluate_retrieval.py --metadata-dir data/metadata --rerank --jsonSummarize transcript, index, metadata, and scrape-failure state:
uv run python scripts/report_ingestion_status.py
uv run python scripts/report_ingestion_status.py --jsonAfter building the index, run the local FastAPI UI:
# Optional: copy env template and set secrets (loaded automatically from repo root)
cp .env.example .env # then edit .env — .env is gitignored
uv run uvicorn --app-dir src webapp.main:app --reload --host 127.0.0.1 --port 8000You can instead export the same variables listed in .env.example; see that file for OPENROUTER_* and GIL_INDEX_DIR.
Open http://127.0.0.1:8000. Toggle Retrieval only to skip the LLM.
The web UI also supports optional metadata-aware filters, cross-encoder reranking, streaming responses, /health, /ready, and /api/ingestion/status.
Write a JSON array of successfully scraped episodes (metadata + transcript URL + path relative to --output-dir):
uv run python scripts/scrape_podscripts.py --manifest gil/transcripts/manifest.jsonIf there are no episodes to scrape or none succeed, the file is still written as []. The manifest is not merged with previous runs; combine files yourself if you need a full index.
- Transcripts: one
.txtper episode under--output-dir(defaultgil/transcripts). - Failures: append-only log at
data/scrape_failures.log(default; seeScraperConfiginscripts/scrape_podscripts.py). - Manifest: optional JSON when
--manifestis set. - Index artifacts: local FAISS bundle under
data/transcript_indexby default:index.faisschunks.jsonconfig.json
- Evaluation seed queries:
evals/gil_queries.json
The script checks robots.txt, waits between list-page and episode requests (default 1s, override with --delay SEC), and retries with backoff. Tune rate_limit_seconds, total_pages, and related fields in ScraperConfig inside scripts/scrape_podscripts.py if the site policy or layout changes.
uv run ruff check src scripts tests
uv run ty check
uv run pytestTests import from src/, the repo root, and scripts/ with pythonpath = ["src", ".", "scripts"] in pyproject.toml.