Skip to content

feat: MA lobbying data pipeline and dashboard charts#71

Open
nesanders wants to merge 48 commits into
mainfrom
feat/ma-lobbying-data
Open

feat: MA lobbying data pipeline and dashboard charts#71
nesanders wants to merge 48 commits into
mainfrom
feat/ma-lobbying-data

Conversation

@nesanders
Copy link
Copy Markdown
Owner

@nesanders nesanders commented May 20, 2026

This PR adds a complete pipeline for collecting and analyzing Massachusetts lobbying disclosure data. The MA Secretary of State publishes semi-annual lobbying filings that document which organizations hired lobbyists, how much they paid, and which specific bills their lobbyists worked on. By pairing this with bill metadata from the MA Legislature API and semantic environmental relevance scoring via Google Gemini embeddings, we can identify which industries are spending most on environmental legislation, track whether heavily-lobbied bills are more or less likely to pass, and correlate lobbying intensity with trends in DEP enforcement and budget.

The data integrates naturally with existing AMEND analyses: lobbying spend can be plotted alongside DEP staffing and enforcement actions to surface relationships between industry influence and regulatory outcomes. All four new database tables are exposed in the AI Analysis tool's semantic context, enabling natural-language queries like "which companies spent the most lobbying against clean water bills" or "has lobbying spend on climate legislation increased since 2015." The pipeline is fully incremental — weekly CI runs exit early when no new semi-annual filings have been posted, so the added runtime cost is near-zero on most weeks.

Summary

  • Scraper (get_MA_lobbying.py): scrapes the MA SoS lobbying disclosure portal using an iPad User-Agent (bypasses Incapsula WAF without Selenium). Incremental via a disc_url set stored in MA_lobbying_summary_links.csv — weekly CI exits early when no new semi-annual filings are posted, so most runs touch only 2 search pages and exit.
  • Legislature bills (get_MA_legislature_bills.py): fetches bill metadata (title, sponsor, committee, status, passed bool) from the MA Legislature OpenAPI for every unique (bill_number, general_court) pair in the lobbying data. JSON responses cached under MA_legislature_cache/ for incremental re-runs.
  • Environmental scoring (score_lobbying_bills.py): scores each bill for environmental relevance using Gemini gemini-embedding-2 cosine similarity against 20 seed phrases (threshold 0.60). Only unscored bills are embedded per run.
  • Dashboard charts (MA_lobbying_viz.py): 4 weekly-updated charts — annual spend trend, top 15 employers, bill intensity + pass rate, lobbying spend vs. enforcement actions. Plus 2 analysis-post charts.
  • Pipeline wiring: all scripts added to update-data.yml CI, assemble_db.py (4 new tables: MA_Lobbying_Employers, MA_Lobbying_Bills, MA_Lobbying_Lobbyists, MA_Legislature_Bills), validate_data.py (lobbying tables in OPTIONAL_DATASETS — CI doesn't fail before first fetch), generate_semantic_context.py, and dashboard_charts.py.

Test plan

  • End-to-end test with --year 2024 --limit 10: produced 18 disclosure rows, 56 employer rows, 539 bill rows with correct entity names, compensation amounts, bill numbers, titles, and positions
  • Full 2024 fetch running in background — data CSVs will be committed as follow-up once complete
  • Run get_MA_legislature_bills.py against live API after lobbying fetch completes
  • Run score_lobbying_bills.py with real Google API key
  • Run dashboard_charts.py to verify chart generation once DB is assembled

🤖 Generated with Claude Code

Adds end-to-end pipeline for MA Secretary of State lobbying disclosures:

- get_MA_lobbying.py: scrapes SoS portal (iPad UA bypasses Incapsula WAF),
  incremental via disc_url set in summary_links CSV — weekly CI exits early
  when no new semi-annual filings are posted
- get_MA_legislature_bills.py: fetches bill metadata from MA Legislature
  OpenAPI for bills appearing in lobbying data; JSON cache under
  MA_legislature_cache/ for incremental re-runs
- score_lobbying_bills.py: scores bills for environmental relevance using
  Gemini embedding-2 cosine similarity against 20 seed phrases (threshold 0.60)
- MA_lobbying_viz.py: 4 dashboard charts (spend trend, top employers, bill
  intensity, lobbying vs enforcement) + 2 analysis-post charts
- Wires all scripts into update-data.yml CI, assemble_db.py (4 new tables),
  validate_data.py (OPTIONAL_DATASETS so CI doesn't fail before first fetch),
  generate_semantic_context.py, and dashboard_charts.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 0b4c17034694
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

…notes

- docs/data/MA_lobbying.md: new dataset page with source description,
  data tables (employers, bills, legislature bills), and download links
- docs/dashboard.md: add lobbying section with 4 chart includes and
  methodology note; add nav link
- CLAUDE.md: document Incapsula WAF bypass (iPad UA), conda run stdout
  buffering gotcha, correct Gemini SDK (google.genai not google.generativeai),
  full historical fetch timing, and REQUEST_DELAY tip for historical runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 0b4c17034694
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

nesanders and others added 4 commits May 20, 2026 12:05
…ill rows)

Full fetch of all 1,715 registrants for 2024. Historical years (2005–2023)
to follow in a subsequent commit once the full fetch completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ying viz

- MA_lobbying_viz.py: entity_name/compensation (not employer_name/total_expenditure);
  dual-axis charts use yAxisID='y'/'y1' + y2nd=1 per chartjs convention
- get_MA_legislature_bills.py: use /Documents/{billId} endpoint (not /Bills/);
  construct bill ID from chamber prefix + number; fetch history via separate
  DocumentHistoryActions URL; Action field (not StatusDescription) for passed
- Add initial 2024 dashboard chart outputs (3 of 4; bill intensity pending legislature data)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- score_lobbying_bills.py: rewritten to embed bill titles directly from
  MA_lobbying_bills.csv (not legislature CSV); stores embeddings as
  MA_bill_embeddings.npy for clustering; incremental per run
- cluster_lobbying_bills.py: one-time k-means (default 15 clusters) on
  normalized embeddings + Gemini Flash labeling of each cluster; writes
  MA_bill_cluster_labels.csv and updates cluster_id in scored CSV
- MA_lobbying_viz.py: add Chart 5 — stacked bar of annual spend by topic
  cluster; gracefully skipped until cluster_lobbying_bills.py has been run
- dashboard.md: add cluster spend chart include
- requirements-ci.txt: add scikit-learn==1.8.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ata from DB

- assemble_db.py: coerce bill_number/general_court to Int64 in MA_Lobbying_Bills,
  MA_Legislature_Bills, MA_Lobbying_Bills_Scored; add MA_Lobbying_Bills_Scored and
  MA_Bill_Cluster_Labels as DB tables so all downstream analysis reads from DB
- MA_lobbying_viz.py: remove CSV file reads; load scored bills and cluster labels
  from DB; remove redundant numeric coercions (now guaranteed by assemble_db.py)
- cluster_lobbying_bills.py: update to gemini-2.5-flash for cluster labeling
- score_lobbying_bills.py: differential cosine scoring with example bills
- Add dash_lobbying_bill_intensity.html and dash_lobbying_spend_by_cluster.html charts
- Update semantic context with new DB tables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 7dc233f87aac
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 4 commits May 20, 2026 18:23
Both scripts now flush progress to disk frequently so an interrupt
loses at most one disclosure (lobbying) or 50 bills (legislature)
of work, rather than the entire in-progress run.

get_MA_lobbying.py:
- Load each CSV independently so a missing lobbyists file doesn't
  prevent resuming from employers/bills/links
- Flush all three CSVs to disk after every completed disclosure URL
- Print running totals with each flush for live progress monitoring

get_MA_legislature_bills.py:
- Append each bill to the combined DataFrame and flush every 50 bills
- Already had per-bill JSON cache; now the merged CSV is also safe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MA_lobbying.md: replace stale seed-phrase scoring description with
  accurate account of differential cosine similarity; add cluster
  summary table; add t-SNE section with lobbying_bill_tsne.html embed
- MA_lobbying_tsne.py: new script generating interactive Plotly t-SNE
  scatter of all lobbied bills coloured by cluster; env bills shown
  larger with white ring; hover shows bill title and cluster
- get_MA_lobbying.py: add exponential-backoff retry (5 attempts) on
  GET/POST timeouts and connection errors; remove unused existing_lobbyists

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains the full MA lobbying pipeline (scripts 7–9 + cluster):
scraping strategy (iPad UA, ASP.NET viewstate, incremental disc_url
cache), modern vs. legacy HTML formats, legislature API endpoint
quirks, differential cosine embedding scoring, and k-means clustering.
Also covers credentials, CI pipeline order, and manual-only scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 7dc233f87aac
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

Remove general repo overview, CI pipeline table, other scripts section,
and SODA credential reference — lobbying-only content remains.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 7dc233f87aac
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern and date range filters.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality and eventType, ensuring accurate results.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 2 commits May 21, 2026 21:30
- Fix MA_lobbying_viz.py DB path (was looking in analysis/ not get_data/)
- Add MA_Lobbying_Bills_Scored and MA_Bill_Cluster_Labels to semantic context
  with correct join examples (is_environmental now in scored table, not legislature)
- Add HB/SB legacy chamber abbreviations to legislature bills fetcher
- Fix score_lobbying_bills.py to fill missing titles from Legislature API
  and skip empty-string texts rather than retrying against Gemini API
- Generate all 6 lobbying charts with 2009+2024 data
- Update DB and GCS with current data (2009+2024 lobbying, GC 185+192 bills)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peline fixes

Data handling:
- Large lobbying CSVs (bills, employers, summary_links, scored, legislature)
  moved out of git and into GCS; only 100-row _sample.csv files committed.
  assemble_db.py now auto-generates samples after each DB rebuild.
- .gitignore updated to exclude all five large lobbying CSVs going forward.

Parser fixes (get_MA_lobbying.py):
- Detect 5-col vs 6-col legacy disclosure table layouts by checking for
  "Lobbyist name" in the second header cell (affects 2010-2013 entity filers).
- Broaden modern activity table regex from grdvActivitiesNew\d{4}_\d+ to
  grdvActivitiesNew(\d{4})?_\d+ to match 2014-2018 year-less table IDs.

Fix get_MA_legislature_bills.py to skip non-numeric bill_number values
instead of crashing on int() conversion.

Analysis:
- MA_lobbying_viz.py: all aggregations switched to client_name (paying client),
  not entity_name (lobbying firm); five new post charts added (env spend by
  cluster, top env clients, spend vs DEP budget/staff, CSO operators);
  _write_facts() counts unique clients and firms.
- All 12 lobbying chart HTMLs regenerated with partial 2005-2025 scrape data.
- DRAFT analysis post: docs/_posts/2026-05-22-ma-environmental-lobbying.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread get_data/assemble_db.py
continue
out = f'../docs/data/{fname}_sample.csv'
df.head(100).to_csv(out, index=has_index)
print(f'Wrote sample: {out}')
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash ff1a87b9b4fd
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate using strftime, and applies the necessary filters for the years 2010 and onward.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts the number of employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 2 commits May 26, 2026 21:11
…ssemble_db

- MA_lobbying.md data tables already used _sample Jekyll data sources; updated
  download links to reference GCS paths (with pending-upload note until
  assemble_db.py runs and uploads them).
- assemble_db.py now uploads MA_lobbying_bills, MA_lobbying_employers,
  MA_lobbying_bills_scored, and MA_legislature_bills CSVs to GCS after each
  DB rebuild, so the public download links stay current automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ata page

Raw portal names are preserved unchanged in GCS CSVs. assemble_db.py adds
entity_name_norm and client_name_norm columns to MA_Lobbying_Employers and
MA_Lobbying_Bills when loading into the DB, using _normalize_entity() which
strips LLC/LLP/Inc, "Law Office of", "& Associates", etc. and applies
replacements (& → AND). This makes downstream joins robust to typographical
variation across filers without modifying the source data.

MA_lobbying.md: use Liquid `limit:10` on all three sample table loops.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash ff1a87b9b4fd
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate, and filters for years 2010 and later.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

… to 0.08, add --rescore flag

Spot-checked false positives: liquor licenses, medical debt, digital advertising, LGBTQ bills,
hate crime, dental loss ratio all scoring above 0.05 threshold. Root cause: non-env example set
lacked coverage for healthcare, criminal justice, housing, education, municipal licensing, digital/media.

Changes:
- NON_ENV_EXAMPLE_BILLS expanded from 20 → 42 with targeted coverage of 8 problem domains
- ENV_THRESHOLD raised 0.05 → 0.08 (drops 1280 → ~333 env bills on existing embeddings)
- --rescore flag: re-applies scoring to all existing embeddings using new examples without re-embedding
- Graceful fallback if legislature CSV is unreadable (e.g. mid-write by concurrent fetch)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash ff1a87b9b4fd
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

…env examples

0.08 was too restrictive (113 bills, 0.4%). 0.06 gives 483 bills (1.9%) and
the marginal bills at that threshold are genuine environmental legislation
(invasive plants, offshore wind, pollinators, wetlands, recycling).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash ff1a87b9b4fd
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern and date formatting.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate, and filters for years 2010 and later.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts the number of employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 3 commits May 27, 2026 06:14
…buffer overflow

Status fields containing embedded newlines or long text caused pandas C engine
to fail with "Buffer overflow caught". QUOTE_NONNUMERIC wraps all string fields
in quotes, keeping each row on one line regardless of field content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 228k bill rows, 12k employer rows across all GCs (183-193, 2005-2025)
- 483 env bills at threshold=0.06 with expanded non-env example set
- New cluster labels from re-clustering 25,928 unique bills
- 31,658 legislature bill metadata rows (all GCs now covered)
- Updated sample CSVs, semantic context, facts_lobbying.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The github.token lacked actions:write, causing HTTP 403 on
'gh workflow run update-charts.yml --ref main' at end of data update job.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nesanders and others added 3 commits May 27, 2026 12:33
…tion charts

Bug fixes:
- _annual_env_spend: switch from 'all compensation for any env-bill client'
  to proportional allocation (compensation × env_bills/total_bills per firm/
  client/year). Removes inflation from clients who lobbied 1 env bill among
  hundreds of others. lobbying_total_spend_latest drops from $6.6M → $1.8M.
  Now consistent with the top-env-employers chart which already used this
  methodology.
- Remove unused MA_Lobbying_Lobbyists table load from _load_data (never
  referenced downstream).

New charts (generate_post_charts):
- lobbying_env_positions: stacked bar of unique clients by Support/Oppose/
  Neutral position on env bills per year. Shows 2014 opposition spike (60
  opposing clients) and recent shift to more neutral/support engagement.
- lobbying_env_opponents: top 20 clients by unique env bills opposed across
  all years. Chemical/waste/real-estate industry coalition visible; Mass
  Audubon appears as they oppose anti-env bills the scorer also flags.
- lobbying_pass_by_position: env bill pass rate by dominant lobbying
  position (mostly-supported 5%, mostly-opposed 8%, contested 0%).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d update lobbying analysis

GC183 started January 2003, not 2005. The off-by-one in _year_to_general_court
caused every Legislature API bill lookup to be sent to the wrong session, resulting
in ~98% title mismatches between SoS portal bills and fetched body text.

Confirmed: H3111 GC192 (wrong) = open meeting law; H3111 GC193 (correct) = DCR
skating rinks. Systematic check of 60 bills: 2%→65% title match after correction.
Residual 35% = SoS title-number prefix formatting differences + rare wrong numbers
in lobbyist filings. See get_data/NOTES_bill_embeddings.md for full analysis.

On-disk MA_lobbying_bills.csv migrated (+1 to all general_court values).
MA_legislature_bills.csv backed up and cleared; pipeline will re-fetch with
corrected GCs → score_lobbying_bills.py --reembed to follow.

Also includes:
- MA lobbying analysis scripts (viz + t-SNE subsample approach)
- New lobbying charts (env score vs clients scatter, env positions, opponents, etc.)
- get_data/NOTES_bill_embeddings.md: clustering evaluation, mismatch diagnosis
- get_data/test_concat_embeddings.py: title-only vs concat embedding comparison
- CLAUDE.md: document GC bug, correct GC183=2003-2004 note
- docs/data/MA_lobbying.md: fix GC184=2005-2006 reference

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cluster_lobbying_bills.py now saves the fitted KMeans model and training
mean vector to GCS (MA_bill_kmeans.joblib + MA_bill_emb_mean.npy) on every
full re-cluster run.

New --incremental mode loads the saved model and assigns cluster labels to
any bill with cluster_id == -1 using nearest-centroid lookup — no re-fitting,
no Gemini API call, runs in seconds. Added to update-data.yml CI after
score_lobbying_bills.py so new bills are labelled automatically each week.

Also adds mean-centering to full clustering (training mean subtracted before
L2-norm) consistent with the sweep that showed silhouette improvement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 752fc928bc04
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, applies the necessary filter for eventType, and sums the volume of events.
enforcement_vs_budget 5/5 no The query correctly joins the MAEEADP_Enforcement and MassBudget_summary tables on the year, extracts the year from EnforcementDate using strftime, and filters for years 2010 and later.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts the number of employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality and eventType, ensuring accurate results.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 4 commits May 29, 2026 06:51
…eline outputs

Buffer overflow in pandas C parser when reading CSVs with long bill title
fields (cluster labels example_titles, scored CSV bill_title). Fix:
- score_lobbying_bills.py: write scored CSV with QUOTE_NONNUMERIC, cast
  is_environmental to int before writing
- cluster_lobbying_bills.py: write cluster labels with QUOTE_NONNUMERIC
- assemble_db.py: read scored CSV and cluster labels with engine='python'

Also gitignore model artifacts (MA_bill_kmeans.joblib, MA_bill_emb_mean.npy)
and the wrong-GC backup file.

Updated outputs: cluster labels (25 clusters, 654 env bills with corrected
GC body text), sample CSVs, semantic context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
177 KB + 3 KB — small enough to version alongside the code.
Simplifies CI: no GCS dependency for --incremental cluster assignment,
and the model is versioned with the embeddings that produced it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add summarize_lobbying_bills.py: Gemini 2.5 Flash summary + MAPLE taxonomy
  tags/categories + is_env_llm per bill, with prompt caching (~46% cost saving,
  $2.76 projected for 26k bills). 495-bill pilot run complete.

- Add diagnostics_summarize.py: 6-section diagnostic suite (reference set
  recall/specificity, LLM vs embedding disagreement, tag quality, cost breakdown,
  silhouette comparison, UMAP). Results appended to NOTES_bill_embeddings.md.

- Add cluster_pilot_summaries.py: k-means on summary embeddings with Gemini
  cluster labelling + UMAP recolour. k=20 silhouette=0.060 (best so far).

- Update MA_lobbying_tsne.py: t-SNE → UMAP (n_neighbors=30, cosine). Fix labels
  CSV read to handle malformed rows (engine=python + numeric coerce filter).

- Fix MA_lobbying_viz.py: cluster_id dtype coercion (str→Int64) on merge.

- Fix assemble_db.py: clamp out-of-range general_court values before Int64 cast
  (311 malformed rows had embedding scores in that column).

- Fix cluster_lobbying_bills.py: drop example_titles before CSV write to prevent
  unquoted-comma buffer overflow in downstream readers.

- Add embedding_diagnostics.png: 4-panel diagnostic plot (score distribution,
  borderline zoom, method comparison, cluster env density).

- Update NOTES_bill_embeddings.md: pilot diagnostic results, silhouette tables,
  UMAP links, cost actuals ($0.106/1k bills, 80.8% cached tokens).

- Regenerate all lobbying charts and UMAP with post-GC-fix full corpus data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… estimates

- summarize_lobbying_bills.py: parallelize LLM + embed with ThreadPoolExecutor
  (8 workers default), exponential backoff on all API calls, inline
  summary_embedding stored per bill, thinking token tracking + cost reporting,
  embed cost (gemini-embedding-2 $0.20/1M) added to running total

- cluster_pilot_summaries.py: parallel embed with ThreadPoolExecutor, smart
  parquet fallback (use stored summary_embedding, re-embed only gaps),
  duplicate-append guard in append_to_notes

- NOTES_bill_embeddings.md: corrected full-corpus cost estimate (~$6.93 all-in
  vs earlier $2.76 LLM-only), pricing table for all models/operations, note on
  thinking token tracking

- UMAP: regenerated with 3,497-bill summary-embedding corpus (k=20, sil=0.064)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 705e9586d032
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate, and filters for years 2010 and later.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts the number of employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

… from GCP billing)

Output tokens are the dominant cost at $2.50/1M, not $0.30/1M.
Also corrects input prices to GA tier ($0.30/$0.075 uncached/cached).
Updates NOTES_bill_embeddings.md with verified billing breakdown and
revised full-corpus cost estimate (~$16-17 actual vs ~$6.93 prior estimate).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 705e9586d032
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate using strftime, and filters for years 2010 and later.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts the number of employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 4 commits June 1, 2026 09:17
If the save block crashes inside the ThreadPoolExecutor loop, Python's
executor.shutdown(wait=True) runs all remaining futures to completion —
spending API budget while saving nothing. Wrap the checkpoint save in
try/except so errors log a warning and continue rather than propagating.

Also fixes the immediate cause: df.loc with a list value for
summary_embedding was 2D (used df.at instead).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each completed bill now logs a 'SUMMARY: <text>' line to stdout so that
if a run crashes before saving, recover_from_log.py can parse the log
and restore all fields (summary, categories, tags, is_env_llm) to the
parquet without re-running Gemini.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- cluster_pilot_summaries.py: k=20 UMAP on full 25,915-bill summary embeddings
- cluster_lobbying_bills.py: re-fit k=25 clusters on original embeddings (full corpus)
- assemble_db.py: rebuilt AMEND.db with updated MA_Lobbying_Bills_Scored,
  MA_Bill_Cluster_Labels; regenerated semantic context
- MA_lobbying_viz.py + MA_lobbying_tsne.py: all charts regenerated

Also fixes:
- cluster_lobbying_bills.py: engine='python' for CSV with long text fields;
  NaN-safe general_court/bill_number key building for both full and incremental paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New charts in MA_lobbying_viz.py:
- lobbying_gc_trend: env bills + clients per General Court (GC186-194)
- lobbying_env_categories_by_gc: stacked bar of env bill LLM categories by GC
- lobbying_employer_env_scatter: Plotly scatter, total spend vs env fraction
- lobbying_opposition_pairs: top 15 employer pairs most often on opposite sides
- lobbying_top_env_tags: top 15 LLM tags on env bills

Post (2026-05-22-ma-environmental-lobbying.md):
- Added GC trend section with narrative (134 → 493 bills, 77 → 624 clients)
- Added category stacked bar section
- Added employer scatter section with interpretation
- Added top tags section
- Added opposition pairs section with AIM/ELM/National Grid narrative
- Updated env scoring description to cover both LLM + embedding approaches
- Updated caveats to mention GC off-by-one bug and LLM classification limits
- Updated reproducibility section

Also adds docs/data/MA_lobbying_static_site_proposal.md: detailed prompt for
building a standalone static browsable site for the lobbying dataset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 902ce022774f
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern and date range filters.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate using strftime, and applies the necessary filters for the years 2010 and onward.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts the number of employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 9 commits June 1, 2026 21:10
- Covers all ~26k lobbied bills, not just environmental ones
- env classification is one filter, not a scope constraint
- Drops per-entity static page generation entirely
- bills.html?id=H1234&gc=194 / employers.html?name=slug pattern
  routes detail views purely client-side via URLSearchParams
- Added edges.json lazy-load strategy (fetched once, cached in module promise)
- Updated data size estimates and file structure accordingly
- Removed build_pages.py and Jinja2 dependency (no build step)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous semantic context had wrong column names and join patterns throughout
the lobbying tables, which would cause the AI Analysis LLM to generate broken SQL:

- employer_name → entity_name (firm) / client_name (employer) everywhere
  CRITICAL: entity_name = lobbying firm, client_name = paying employer;
  all employer-level analysis must use client_name
- total_expenditure → compensation (MA_Lobbying_Employers column name)
- subject_tags → removed (column does not exist)
- MA_Lobbying_Lobbyists → removed (table does not exist)
- cluster range 0–14 → 0–24 (25 clusters, not 15)
- Join key MA_Lobbying_Bills ↔ MA_Lobbying_Employers is now correctly documented
  as (entity_name, client_name, year) — three columns, not just (employer_name, year)

Updated JOIN_RELATIONSHIPS examples:
- env spend by year: now uses proportional allocation pattern with correct cols
- top clients: uses client_name + compensation, warns NOT entity_name
- topic cluster: uses client_name not employer_name
- lobbying vs enforcement: uses compensation not total_expenditure
- new: support/oppose positions example using position column
- bill passage tier: uses client_name, client_count

Regenerated docs/assets/db_semantic_context.txt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…semble_db

Issue 1 — MA_Lobbying_Bills (2,868 extra rows removed):
  Null-bill rows ('no specific bills' disclosures) were scraped multiple times
  from the SoS portal across filing periods, producing exact duplicates.
  Fix: sort by amount desc, drop_duplicates on
  (entity_name, client_name, year, general_court, bill_number, position),
  keeping the row with the highest amount so no spend is lost.
  Result: 228,046 → 225,178 rows.

Issue 2 — MA_Lobbying_Bills_Scored (1,457 extra rows removed):
  Multiple filers referenced the same bill_number with slightly different
  bill_title strings, so the scoring pipeline produced two scored rows for
  the same (bill_number, general_court). GC189 was the dominant case (2,730
  of 2,913 dup rows), not GC194 as the external report suggested.
  Fix: sort by env_relevance_score desc (then by bill_id presence),
  drop_duplicates on (bill_number, general_court), keeping the highest-
  scoring row so env-relevant bills are never suppressed by a lower-scoring
  duplicate title variant.
  Result: 25,932 → 24,475 rows.

Both deduplication steps run at DB assembly time; source CSVs are unchanged.
Rebuilt and uploaded AMEND.db to gs://openamend-data/amend.db.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
459 rows had bill_title values that were concatenated multi-bill text blobs
(up to 24,355 chars) scraped from the SoS portal's combined display pages.

Fix: after deduplication, join scored bills to MA_Legislature_Bills on
(bill_id, general_court) and replace any bill_title > 300 chars with the
authoritative title from the Legislature API wherever the join succeeds.

Result:
  453 of 459 long titles replaced (max len: 24,355 → 740 chars)
  Remaining 52 rows > 300 chars are legitimately long procedural titles
  (Governor messages, conference committee texts, budget amendments) —
  not scraper artifacts.

Rebuilt and uploaded AMEND.db to gs://openamend-data/amend.db.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xplorer

- assemble_db.py: deduplicate MA_Lobbying_Bills_Scored on (bill_id, general_court)
  instead of (bill_number, general_court) — H and S bills share the same integer
  bill_number within a session and must not be merged
- assemble_db.py: derive bill_prefix and bill_id columns on MA_Lobbying_Bills
  from chamber field (e.g. 'House Bill' → 'H', bill_id = 'H1234'); 95.4% of rows
  get a non-NULL bill_id
- MA_lobbying_viz.py: add _bill_merge() helper that joins on (bill_id, gc) when
  available, falling back to (bill_number, gc) for legacy rows without bill_id;
  update all 10+ join sites to use it — fixes 68,437 cross-prefix contaminated
  rows in env bill counts, spend allocation, and opposition pairs chart
- generate_semantic_context.py / db_semantic_context.txt: document new columns,
  update all JOIN_RELATIONSHIPS SQL examples to use (bill_id, general_court)
- Regenerate all 20 lobbying charts against corrected data
- Blog post: add MA Lobbying Explorer link; link named employers (AIM, ELM,
  National Grid, Eversource, Orsted, NextEra, Bloom Energy, CLF, NECE, Vote
  Solar, BCC Solar, MMA) to explorer employer pages
- Data page: add explorer browse links at top; link bill_id, client_name, and
  entity_name in sample tables to explorer bills/employers/lobbyists pages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace simple token-stripping with structured normalization pipeline:
  1. Strip d/b/a suffix first (prevents trade name bleeding into canonical form)
  2. Hyphen -> space (LAN-TEL == LAN TEL)
  3. Punctuation -> space (not '') so adjacent tokens don't concatenate
  4. Whole-word regex removal of legal entity type words (LLC, LLP, INC,
     INCORPORATED, CORPORATION, CORP, LTD, LIMITED, PC, PLLC)
  5. Whole-word 'THE' removal (any position, not just leading prefix)
  6. & -> AND
  7. ASSICIATES typo fix

Merges ~180 additional redundant entity groups (e.g. Partners In Democracy /
Partners in Democracy Inc, LAN-TEL Communications / Lan-Tel Communications Inc).
Reduces distinct normalized entity count from ~6,512 to ~4,790.

Regenerate AMEND.db, all lobbying charts, and GCS samples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GITHUB_TOKEN cannot trigger workflow_dispatch events on other workflows
(GitHub security restriction, regardless of actions:write permission).

New structure — no cross-dispatch, each workflow is fully self-contained:
- update-weekly.yml: scheduled Monday 06:00 UTC; runs full data fetch +
  chart generation in sequence; commits data/ + charts/ in one PR
- update-data.yml: manual only; data fetch + DB assembly; commits data/
- update-charts.yml: manual only; downloads DB from GCS, regenerates
  dashboard charts; commits charts/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 36fda9ffbb67
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, applies the necessary filter for eventType, and sums the volume of events.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

…-ci.txt

scikit-learn==1.8.0 requires joblib>=1.3.0; the old joblib==1.2.0 pin caused
a ResolutionImpossible error in CI. Also consolidates gcsfs/pyarrow (needed
by the lobbying clustering pipeline parquet I/O) into the correct section
with proper comment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nesanders and others added 2 commits June 4, 2026 16:46
…h steps

score_lobbying_bills.py was deduplicating lobby bills on (bill_number, gc),
causing every S-prefix bill to be silently skipped when its H-prefix twin was
already embedded (and vice versa). H1234 and S1234 are independent bills with
independent content — they must be embedded separately.

Fix: derive bill_id from the chamber column (House Bill → H, Senate Bill → S,
etc.) BEFORE deduplication and key on (bill_id, gc) for H/S/HD/SD bills.
Bills with unmapped chamber types (Joint, Executive, etc.) retain the old
(bill_number, gc) dedup since they have no H/S collision risk. Legislature API
bill_id_map is no longer used to assign bill_id for H/S bills (that map is
keyed on bill_number and would silently overwrite the correct id with the twin).

Ran local backfill: 7,057 previously missing bills now embedded.
Parquet: 26,102 → 33,159 rows. Environmental bills: 661 → 924.

CI workflow changes:
- Split monolithic 'Fetch data' step into 4 named steps with timeouts
- Legislature bills step: 90 min → 20 min (incremental runs fetch ~dozens
  of new bills per week, not thousands)
- Scoring step: 30 min (sufficient for typical weekly incremental volume)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- summarize_lobbying_bills.py: 7,211 bills processed, 0 failures, $4.62 total
  Actual rate: $0.627/1k bills (output tokens at $2.50/1M dominate at 60%)
- Regenerated all lobbying charts with full tag/summary coverage
- Rebuilt AMEND.db + semantic context with updated scored bills
- Documented actual Gemini API costs in CLAUDE.md, score_lobbying_bills.py,
  summarize_lobbying_bills.py, and NOTES_bill_embeddings.md to prevent
  future underestimates (prior projections were off by ~6× due to wrong
  output token price: $0.30/1M used instead of $2.50/1M actual)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants