Skip to content

josh-max2/Parser

Repository files navigation

Parser — FDD Database

Buyer-facing database of franchise financial performance data, sourced from publicly filed Franchise Disclosure Documents (FDDs). Monetized via affiliate referrals (franchise consultants + SBA preferred lenders) and (later) display ads.

Live site: https://josh-max2.github.io/Parser/ (after Pages is enabled — see below)


How to enable GitHub Pages (3-click setup)

The repo and the generated docs/ folder are already in place. To make the site live:

  1. Go to https://github.com/josh-max2/Parser/settings/pages
  2. Under "Build and deployment"Source, pick "Deploy from a branch"
  3. Set Branch to main and folder to /docs, then click Save

GitHub will build and publish within 1–2 minutes. The live URL will be:

https://josh-max2.github.io/Parser/

Once live, every git push to main rebuilds the site automatically.


What's in this repo

parser/
├── README.md                      # This file
├── HANDOFF.md                     # Full project state, decision log, validation matrix
├── fdd_tool_build_spec.md         # Original build plan
├── fdd-tool/                      # Python pipeline (extraction + scraper + DB + site gen)
│   ├── src/
│   │   ├── pdf_utils.py           # PDF text + section finder + cover detector
│   │   ├── prompts.py             # Claude extraction prompts
│   │   ├── claude_client.py       # Anthropic SDK wrapper
│   │   ├── extract.py             # PDF -> 6 JSON files
│   │   ├── db.py                  # SQLite schema
│   │   ├── site_gen.py            # SQLite -> docs/ HTML
│   │   ├── scrapers/wisconsin.py  # WI DFI Playwright scraper
│   │   └── templates/             # Jinja2 templates (brand, index, category, about)
│   ├── scripts/                   # One-off validation + ingest scripts
│   ├── output/                    # 25 FDD extractions as JSON (committed — facts, not text)
│   └── data/                      # Source PDFs (gitignored — copyrighted)
└── docs/                          # Generated static site (served by GitHub Pages)
    ├── index.html
    ├── about/
    ├── franchise/{slug}/
    └── category/{slug}/

Local development

cd fdd-tool

# Install deps
uv sync
uv run playwright install chromium  # only if running scraper

# Re-ingest existing JSONs into SQLite
uv run python scripts/ingest_outputs.py

# Re-generate the static site
uv run python -m src.site_gen

# Commit + push to rebuild the live site
git add docs/
git commit -m "Regenerate site"
git push

Status (2026-05-16)

  • Phase 0 (extraction validation) — DONE on 4 FDDs across diverse failure modes
  • Phase 1 (WI scraper + 20-brand home-services pilot) — DONE, $6.50 total API spend
  • Phase 2 (SQLite + static site) — DONE (this commit)
  • Phase 3 (live, indexed, affiliate-monetized) — pending GitHub Pages enablement + affiliate program applications

See HANDOFF.md for the full state and decision log.

Costs

  • $0.30/FDD avg API spend at current Sonnet 4.6 pricing
  • $0 hosting (GitHub Pages)
  • $0 domain (uses josh-max2.github.io/Parser/ until you wire up a custom domain)
  • Estimated $565 to ingest the full 1,879-brand Wisconsin corpus (deferred until storefront is live)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors