neural

Utilities for scraping, indexing, and querying Gil's Arena podcast transcripts from PodScripts.co.

Requirements

Python 3.12+
uv (recommended)

Setup

uv sync

This installs runtime deps (requests, beautifulsoup4, tqdm, sentence-transformers, faiss-cpu, numpy, fastapi, uvicorn, jinja2) and the default dev group (ruff, ty, pytest). Use uv sync --all-groups if you add other dependency groups later.

Usage

Scrape Transcripts

Run the scraper with Python (from the repo root):

# Quick test: few episodes, single listing page
uv run python scripts/scrape_podscripts.py --dry-run

# Full scrape (default output: gil/transcripts)
uv run python scripts/scrape_podscripts.py

# Skip episodes that already have a matching .txt in the output dir
uv run python scripts/scrape_podscripts.py --resume

# Page range and cap
uv run python scripts/scrape_podscripts.py --start-page 1 --end-page 5 --limit 10

# Debug logging
uv run python scripts/scrape_podscripts.py --verbose

For every option, see:

uv run python scripts/scrape_podscripts.py --help

Build A Semantic Index

Build a local FAISS index over timestamped transcript chunks:

uv run python scripts/build_transcript_index.py

Useful flags:

# Build from a subset of episodes
uv run python scripts/build_transcript_index.py --limit 10

# Change chunk sizing
uv run python scripts/build_transcript_index.py --lines-per-chunk 6 --line-overlap 2

# Override the embedding model or output directory
uv run python scripts/build_transcript_index.py --model multi-qa-mpnet-base-cos-v1 --output-dir data/transcript_index_mpnet

Query The Index

After building an index, search it with a free-text basketball query:

uv run python scripts/query_transcripts.py "What did they say about Giannis trade rumors?"

Useful flags:

uv run python scripts/query_transcripts.py "Team USA basketball" --top-k 3
uv run python scripts/query_transcripts.py "Bronny James" --index-dir data/transcript_index
uv run python scripts/query_transcripts.py "Lakers chemistry" --metadata-dir data/metadata --team Lakers
uv run python scripts/query_transcripts.py "Dwight Howard" --metadata-dir data/metadata --guest "Dwight Howard" --rerank

Evaluate Retrieval

Run the tracked seed queries and compute hit rate and MRR:

uv run python scripts/evaluate_retrieval.py
uv run python scripts/evaluate_retrieval.py --metadata-dir data/metadata --rerank --json

Ingestion Status Report

Summarize transcript, index, metadata, and scrape-failure state:

uv run python scripts/report_ingestion_status.py
uv run python scripts/report_ingestion_status.py --json

Chat Web App (citation-first RAG)

After building the index, run the local FastAPI UI:

# Optional: copy env template and set secrets (loaded automatically from repo root)
cp .env.example .env   # then edit .env — .env is gitignored

uv run uvicorn --app-dir src webapp.main:app --reload --host 127.0.0.1 --port 8000

You can instead export the same variables listed in .env.example; see that file for OPENROUTER_* and GIL_INDEX_DIR.

Open http://127.0.0.1:8000. Toggle Retrieval only to skip the LLM.

The web UI also supports optional metadata-aware filters, cross-encoder reranking, streaming responses, /health, /ready, and /api/ingestion/status.

Manifest (this run only)

Write a JSON array of successfully scraped episodes (metadata + transcript URL + path relative to --output-dir):

uv run python scripts/scrape_podscripts.py --manifest gil/transcripts/manifest.json

If there are no episodes to scrape or none succeed, the file is still written as []. The manifest is not merged with previous runs; combine files yourself if you need a full index.

Outputs

Transcripts: one .txt per episode under --output-dir (default gil/transcripts).
Failures: append-only log at data/scrape_failures.log (default; see ScraperConfig in scripts/scrape_podscripts.py).
Manifest: optional JSON when --manifest is set.
Index artifacts: local FAISS bundle under data/transcript_index by default:
- index.faiss
- chunks.json
- config.json

Evaluation Seeds

Evaluation seed queries: evals/gil_queries.json

Etiquette

The script checks robots.txt, waits between list-page and episode requests (default 1s, override with --delay SEC), and retries with backoff. Tune rate_limit_seconds, total_pages, and related fields in ScraperConfig inside scripts/scrape_podscripts.py if the site policy or layout changes.

Contributing

uv run ruff check src scripts tests
uv run ty check
uv run pytest

Tests import from src/, the repo root, and scripts/ with pythonpath = ["src", ".", "scripts"] in pyproject.toml.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
.vscode		.vscode
evals		evals
gil/transcripts		gil/transcripts
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
mypy.ini		mypy.ini
neural_architecture_flowchart.html		neural_architecture_flowchart.html
neural_roadmap_timeline.html		neural_roadmap_timeline.html
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

neural

Requirements

Setup

Usage

Scrape Transcripts

Build A Semantic Index

Query The Index

Evaluate Retrieval

Ingestion Status Report

Chat Web App (citation-first RAG)

Manifest (this run only)

Outputs

Evaluation Seeds

Etiquette

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

neural

Requirements

Setup

Usage

Scrape Transcripts

Build A Semantic Index

Query The Index

Evaluate Retrieval

Ingestion Status Report

Chat Web App (citation-first RAG)

Manifest (this run only)

Outputs

Evaluation Seeds

Etiquette

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages