Skip to content

nickth3man/neural

Repository files navigation

neural

Utilities for scraping, indexing, and querying Gil's Arena podcast transcripts from PodScripts.co.

Requirements

  • Python 3.12+
  • uv (recommended)

Setup

uv sync

This installs runtime deps (requests, beautifulsoup4, tqdm, sentence-transformers, faiss-cpu, numpy, fastapi, uvicorn, jinja2) and the default dev group (ruff, ty, pytest). Use uv sync --all-groups if you add other dependency groups later.

Usage

Scrape Transcripts

Run the scraper with Python (from the repo root):

# Quick test: few episodes, single listing page
uv run python scripts/scrape_podscripts.py --dry-run

# Full scrape (default output: gil/transcripts)
uv run python scripts/scrape_podscripts.py

# Skip episodes that already have a matching .txt in the output dir
uv run python scripts/scrape_podscripts.py --resume

# Page range and cap
uv run python scripts/scrape_podscripts.py --start-page 1 --end-page 5 --limit 10

# Debug logging
uv run python scripts/scrape_podscripts.py --verbose

For every option, see:

uv run python scripts/scrape_podscripts.py --help

Build A Semantic Index

Build a local FAISS index over timestamped transcript chunks:

uv run python scripts/build_transcript_index.py

Useful flags:

# Build from a subset of episodes
uv run python scripts/build_transcript_index.py --limit 10

# Change chunk sizing
uv run python scripts/build_transcript_index.py --lines-per-chunk 6 --line-overlap 2

# Override the embedding model or output directory
uv run python scripts/build_transcript_index.py --model multi-qa-mpnet-base-cos-v1 --output-dir data/transcript_index_mpnet

Query The Index

After building an index, search it with a free-text basketball query:

uv run python scripts/query_transcripts.py "What did they say about Giannis trade rumors?"

Useful flags:

uv run python scripts/query_transcripts.py "Team USA basketball" --top-k 3
uv run python scripts/query_transcripts.py "Bronny James" --index-dir data/transcript_index
uv run python scripts/query_transcripts.py "Lakers chemistry" --metadata-dir data/metadata --team Lakers
uv run python scripts/query_transcripts.py "Dwight Howard" --metadata-dir data/metadata --guest "Dwight Howard" --rerank

Evaluate Retrieval

Run the tracked seed queries and compute hit rate and MRR:

uv run python scripts/evaluate_retrieval.py
uv run python scripts/evaluate_retrieval.py --metadata-dir data/metadata --rerank --json

Ingestion Status Report

Summarize transcript, index, metadata, and scrape-failure state:

uv run python scripts/report_ingestion_status.py
uv run python scripts/report_ingestion_status.py --json

Chat Web App (citation-first RAG)

After building the index, run the local FastAPI UI:

# Optional: copy env template and set secrets (loaded automatically from repo root)
cp .env.example .env   # then edit .env — .env is gitignored

uv run uvicorn --app-dir src webapp.main:app --reload --host 127.0.0.1 --port 8000

You can instead export the same variables listed in .env.example; see that file for OPENROUTER_* and GIL_INDEX_DIR.

Open http://127.0.0.1:8000. Toggle Retrieval only to skip the LLM.

The web UI also supports optional metadata-aware filters, cross-encoder reranking, streaming responses, /health, /ready, and /api/ingestion/status.

Manifest (this run only)

Write a JSON array of successfully scraped episodes (metadata + transcript URL + path relative to --output-dir):

uv run python scripts/scrape_podscripts.py --manifest gil/transcripts/manifest.json

If there are no episodes to scrape or none succeed, the file is still written as []. The manifest is not merged with previous runs; combine files yourself if you need a full index.

Outputs

  • Transcripts: one .txt per episode under --output-dir (default gil/transcripts).
  • Failures: append-only log at data/scrape_failures.log (default; see ScraperConfig in scripts/scrape_podscripts.py).
  • Manifest: optional JSON when --manifest is set.
  • Index artifacts: local FAISS bundle under data/transcript_index by default:
    • index.faiss
    • chunks.json
    • config.json

Evaluation Seeds

  • Evaluation seed queries: evals/gil_queries.json

Etiquette

The script checks robots.txt, waits between list-page and episode requests (default 1s, override with --delay SEC), and retries with backoff. Tune rate_limit_seconds, total_pages, and related fields in ScraperConfig inside scripts/scrape_podscripts.py if the site policy or layout changes.

Contributing

uv run ruff check src scripts tests
uv run ty check
uv run pytest

Tests import from src/, the repo root, and scripts/ with pythonpath = ["src", ".", "scripts"] in pyproject.toml.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors