A scraper toolkit, PostgreSQL graph database, and REST API for US youth soccer club data. Covers 127 league directories across all tiers of the US youth soccer pyramid — from MLS NEXT down to 54 state associations — normalizing and deduplicating clubs into a single canonical dataset.
- Scrapes club rosters from 115 publicly accessible league directories (7 Tier-1 national, 13 Tier-2 high performance, 41 Tier-3 regional, 54 Tier-4 state associations)
- Normalizes club names (stripping suffixes, fixing casing) and deduplicates across sources with fuzzy matching (RapidFuzz, threshold 88)
- Enriches clubs with website URLs — first from directory pages, then via Brave Search API for remaining clubs
- Discovers coaches by scraping staff pages (SportsEngine, LeagueApps, WordPress)
- Seeds a PostgreSQL graph database with canonical clubs, aliases, affiliations, events, and coaches
- Exposes a typed REST API on port 8080 with search, analytics, and graph traversal endpoints
cd scraper
pip install -r requirements.txt
python3 -m playwright install chromium
# Scrape all 115 leagues
python3 run.py
# Scrape by tier / filter
python3 run.py --tier 1 # Tier 1 national elite (7 leagues)
python3 run.py --scope state # All 54 USYS state associations
python3 run.py --league "ECNL" # Single league by name
python3 run.py --dry-run # Preview without writing
python3 run.py --list # Print league inventory
# Non-league scrapers
python3 run.py --source gotsport-matches --event-id 12345 \
--season 2025-26 --league-name "ECNL Boys National"
# Resolve raw team names on event_teams + matches to canonical_clubs.id.
# Run after every scrape — powers /api/events/search?club_id=N and
# matches -> club_results rollup.
python3 run.py --source link-canonical-clubs
python3 run.py --source link-canonical-clubs --dry-run --limit 100
# Derived-data rollups
python3 run.py --rollup club-resultsThe gotsport-matches scraper populates the matches table (Domain 5).
home_club_id / away_club_id stay NULL at scrape time — a separate
linker job resolves raw team names to canonical club FKs. The
club-results rollup reads linker-resolved matches and writes aggregate
W/L/D + GF/GA counts per (club, season, league, division, age, gender).
Full-recompute, idempotent, safe to re-run.
cd scraper
python3 enrich_clubs.py # Extract websites from directory pages
python3 enrich_websites.py --limit 200 # Brave Search for remaining clubs
python3 scrape_staff.py --limit 100 # Scrape staff pages for coaches# Push schema and seed DB from master.csv
pnpm --filter @workspace/db run push
npx tsx lib/db/src/seed.ts
# Backfill coaches master from coach_discoveries (idempotent)
pnpm --filter @workspace/scripts run backfill-coaches -- --dry-run
pnpm --filter @workspace/scripts run backfill-coaches
# Start API (port 8080)
pnpm --filter @workspace/api-server run dev127 entries total, 115 scrapeable (has_public_clubs=True).
| Tier | Count | Notable Leagues |
|---|---|---|
| 1 — National Elite | 7 | MLS NEXT, ECNL Girls, ECNL Boys, Girls Academy, NWSL Academy, USL Academy, Elite 64 |
| 2 — High Performance | 13 | ECNL RL (Boys + Girls), GA Aspire, DPL, US Club NPL, USYS National League, Pre-ECNL |
| 3 — Regional Power | 41 | EDP, NorCal Premier, SOCAL, Super Y, SincSports tournaments (14), NPL regional leagues |
| 4 — State Hubs | 54 | All 54 USYS member state/regional associations |
| Type | Count | How It Works |
|---|---|---|
state_association_hub |
54 | GotSport event rosters or Google My Maps KML |
homepage |
39 | BeautifulSoup HTML scraping of club directory pages |
sincsports |
14 | Static HTML from soccer.sincsports.com/TTTeamList.aspx?tid= |
program |
6 | Program/division-specific directory pages |
league_page |
4 | League-owned team/club listing pages |
athleteone_api |
4 | ECNL's AthleteOne JSON API |
directory |
2 | Club directory index pages |
no_source |
1 | No public club listing available |
leagues_master.csv
│
▼
run.py ──► extractors/ ──► scraper_static.py / scraper_js.py
│ (per-site custom extractors)
│
▼
normalizer.py ← RapidFuzz (threshold=88)
│
▼
storage.py ──► output/master.csv
output/leagues/<slug>.csv
│
▼
enrich_clubs.py (website URLs from directory pages)
enrich_websites.py (Brave Search API for remaining clubs)
scrape_staff.py (coach discovery via staff pages)
│
▼
lib/db/src/seed.ts ──► PostgreSQL
PostgreSQL, managed with Drizzle ORM. 26 tables after the April 2026 Path A expansion. Push schema: pnpm --filter @workspace/db run push.
| Table | Purpose |
|---|---|
canonical_clubs |
Master club records; website, status, socials, last-scraped timestamps |
club_aliases |
All scraped name variants per canonical club |
club_affiliations |
League/source associations (unique on club_id + source_name) |
leagues_master |
League directory inventory |
league_sources |
Official scrape source registry |
| Table | Purpose |
|---|---|
coaches |
Master coach records; person_hash dedup; manually_merged guard for operator curation |
coach_discoveries |
Primary coach read model; FK to coaches via coach_id; platform family + confidence |
coach_career_history |
Role+tenure records per coach across clubs |
coach_movement_events |
Hire/leave/promotion events |
coach_scrape_snapshots |
Point-in-time snapshot per scrape run |
coach_effectiveness |
Aggregated outcome metrics |
| Table | Purpose |
|---|---|
events / event_teams |
Tournaments, leagues, showcases and participating teams |
matches / club_results |
Individual games + aggregated per-club results |
roster_diffs / tryouts |
Roster change log + tryout announcements |
club_roster_snapshots / club_site_changes |
Point-in-time roster diffing + website change detection |
| Table | Purpose |
|---|---|
colleges / college_coaches / college_roster_history |
NCAA/NAIA/NJCAA dataset |
| Table | Purpose |
|---|---|
scrape_run_logs |
Per-run telemetry with failure_kind enum; written by scraper/scrape_run_logger.py |
scrape_health |
Rolling health rollups |
club_coaches was dropped after the backfill verified zero residual rows and /api/coaches/search was rewired to coach_discoveries. club_events was dropped after /api/events/search was rewired to events + event_teams.
See docs/path-a-data-model.md for the full domain-by-domain spec and CLAUDE.md for session context.
Base URL: /api — port 8080. All list endpoints are paginated (?page=1&page_size=20, max 100).
Every request under /api/* except /api/healthz requires a machine-to-machine API key (when enforcement is turned on — see bootstrap below). Pass it in either header:
X-API-Key: <key>
Authorization: Bearer <key>
Requests without a valid key return 401 { "error": "unauthorized" }. The response body is intentionally the same for missing, unknown, and revoked keys — detailed reason is logged server-side only. There are no user sessions — this is a pure M2M API.
Enforcement is gated by the API_KEY_AUTH_ENABLED env var. A fresh deploy with the flag unset accepts all /api/* traffic so you can bring the table up and mint a key before flipping it on.
- Pull, install, and push the schema (creates the
api_keystable):pnpm install pnpm --filter @workspace/db run push
- Create the first key (plaintext prints once — copy immediately into the caller's env):
pnpm --filter @workspace/scripts run create-api-key -- --name "upshift-player-platform prod" - Set
API_KEY_AUTH_ENABLED=truein Replit Secrets. - Restart the API server. The boot log will print
[api-key-auth] enabled; from here every/api/*call requires the header.
The plaintext key is printed ONCE. Only the sha256 hash is stored in the database — a lost key cannot be recovered, only revoked and replaced.
- Create a new key with
create-api-key(different--namesuffix or timestamp). - Update the caller's env var and redeploy.
- Confirm the new key works by tailing logs for 401s.
- Revoke the old key:
pnpm --filter @workspace/scripts run revoke-api-key -- --prefix <8-char-prefix>
const res = await fetch(`${process.env.UPSHIFT_DATA_API_URL}/api/clubs`, {
headers: { "X-API-Key": process.env.UPSHIFT_DATA_API_KEY! },
});/api/healthz remains open for Replit liveness probes.
| Endpoint | Description |
|---|---|
GET /clubs |
Paginated list; filter by state, tier, gender_program |
GET /clubs/search |
Advanced search: name, state, league, has_website |
GET /clubs/:id |
Single club with affiliations and aliases |
GET /clubs/:id/related |
Related clubs sharing affiliations |
GET /clubs/:id/staff |
Discovered coaches from staff pages |
| Endpoint | Description |
|---|---|
GET /search |
Fuzzy club name search |
GET /events/search |
Filter events by club_id, league, age_group, gender, season, start_date_from/to |
GET /coaches/search |
Filter coaches by club_id, name, title, min_confidence |
| Endpoint | Description |
|---|---|
GET /leagues |
All leagues in master directory |
GET /leagues/:id/clubs |
All clubs for a specific league |
| Endpoint | Description |
|---|---|
GET /analytics/duplicates |
Near-duplicate club clusters (normalized name + state); includes source labels |
GET /analytics/coverage |
Per-state and per-league club counts; flags states below min_clubs threshold |
GET /analytics/overlap |
Clubs appearing in 2+ leagues; useful for detecting wrong-source associations |
# Clubs in California with a known website
curl "http://localhost:8080/api/clubs/search?state=CA&has_website=true&page_size=10"
# Events in 2024-2025 season
curl "http://localhost:8080/api/events/search?season=2024-2025&page_size=20"
# High-confidence coaches for a club
curl "http://localhost:8080/api/coaches/search?club_id=155&min_confidence=0.8"
# Coverage report (states below 5 clubs)
curl "http://localhost:8080/api/analytics/coverage?min_clubs=5"
# Near-duplicate clubs in Texas
curl "http://localhost:8080/api/analytics/duplicates?state=TX"
# Clubs in multiple leagues (potential wrong associations)
curl "http://localhost:8080/api/analytics/overlap?min_leagues=5&page_size=20"Custom extractors live in scraper/extractors/ and are matched by URL pattern via registry.py.
| Extractor | Leagues | Technique |
|---|---|---|
ecnl.py |
ECNL Boys + Girls, ECNL RL B+G | AthleteOne JSON API; auto-discovers all conference event IDs |
girls_academy.py |
Girls Academy, GA Aspire | <article><li> HTML structure |
mls_next.py |
MLS NEXT | Pattern A (table) + Pattern B (card grid); extracts website links |
norcal.py |
NorCal Premier Soccer | /clubs/ table |
gotsport.py |
SOCAL, MSPSP, state assocs, NPL regions | GotSport org_event/events/{id}/clubs roster pages |
sincsports.py |
14 SincSports tournaments | TTTeamList.aspx?tid= static HTML; single response, no pagination |
sincsports_events.py |
Same 14 tournaments, Path A mode | Populates events + event_teams tables (not CSVs); invoke via python3 run.py --source sincsports-events |
state_assoc.py |
All 54 USYS state associations | GotSport events or Google Maps KML per state |
npl_extra.py |
SE NPL, Empire Soccer, Mid-Atlantic, NY Club Soccer | GotSport event IDs via _multi_event_scrape helper |
edp.py |
EDP Soccer | Wix static crawl |
dpl.py |
DPL | WordPress pages |
- Retry logic: All HTTP requests use
utils/retry.py— exponential backoff (2s base, cap 60s), 3 retries.ConnectionError,Timeout, 5xx →TransientError. Playwright navigation errors retried similarly. - Failure reporting:
run.pytracks each league scrape result. End-of-run summary shows counts byFailureKind:timeout,network,parse_error,zero_results,unknown. - Multi-state events: GotSport events spanning multiple states (MN, WV) set
multi_state=truein config; clubs keep blank state rather than inheriting wrong parent state. - Deduplication: RapidFuzz token-sort ratio at threshold 88 merges near-identical club names across sources into a single canonical record.
- Python 3.11+
requests,beautifulsoup4,lxml,html5lib— static scrapingplaywright— JS-rendered pages (headless Chromium)pandas,rapidfuzz— normalization and fuzzy deduplicationpsycopg2-binary— direct PostgreSQL writes from staff scraper
- Monorepo: pnpm workspaces
- Runtime: Node.js 24
- API: Express 5
- Database: PostgreSQL + Drizzle ORM
- Validation: Zod v4 + drizzle-zod
- API spec: OpenAPI 3.1 → Orval codegen (Zod validators + TS types)
- Build: esbuild (single ESM bundle)
pnpm run typecheck # typecheck all packages
pnpm --filter @workspace/api-spec run codegen # regenerate Zod types from OpenAPI
pnpm --filter @workspace/db run push # sync schema to DB
pnpm --filter @workspace/api-server run dev # start API (port 8080).
├── scraper/ # Python data pipeline
│ ├── extractors/ # Per-site scrapers + GotSport/SincSports helpers
│ ├── utils/ # retry_with_backoff utility
│ ├── data/ # League inventory CSVs + state config JSON
│ └── output/ # Generated CSVs (gitignored)
├── lib/
│ ├── db/ # Drizzle schema, seed script, DB client
│ ├── api-spec/ # OpenAPI 3.1 YAML
│ └── api-zod/ # Generated Zod validators + TypeScript types
└── artifacts/
└── api-server/ # Express API server
└── src/
├── routes/ # clubs, events, coaches, leagues, analytics, search
└── lib/ # pagination, analytics normalization helpers