feat: MA lobbying data pipeline and dashboard charts#71
Conversation
Adds end-to-end pipeline for MA Secretary of State lobbying disclosures: - get_MA_lobbying.py: scrapes SoS portal (iPad UA bypasses Incapsula WAF), incremental via disc_url set in summary_links CSV — weekly CI exits early when no new semi-annual filings are posted - get_MA_legislature_bills.py: fetches bill metadata from MA Legislature OpenAPI for bills appearing in lobbying data; JSON cache under MA_legislature_cache/ for incremental re-runs - score_lobbying_bills.py: scores bills for environmental relevance using Gemini embedding-2 cosine similarity against 20 seed phrases (threshold 0.60) - MA_lobbying_viz.py: 4 dashboard charts (spend trend, top employers, bill intensity, lobbying vs enforcement) + 2 analysis-post charts - Wires all scripts into update-data.yml CI, assemble_db.py (4 new tables), validate_data.py (OPTIONAL_DATASETS so CI doesn't fail before first fetch), generate_semantic_context.py, and dashboard_charts.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…notes - docs/data/MA_lobbying.md: new dataset page with source description, data tables (employers, bills, legislature bills), and download links - docs/dashboard.md: add lobbying section with 4 chart includes and methodology note; add nav link - CLAUDE.md: document Incapsula WAF bypass (iPad UA), conda run stdout buffering gotcha, correct Gemini SDK (google.genai not google.generativeai), full historical fetch timing, and REQUEST_DELAY tip for historical runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…ill rows) Full fetch of all 1,715 registrants for 2024. Historical years (2005–2023) to follow in a subsequent commit once the full fetch completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ying viz
- MA_lobbying_viz.py: entity_name/compensation (not employer_name/total_expenditure);
dual-axis charts use yAxisID='y'/'y1' + y2nd=1 per chartjs convention
- get_MA_legislature_bills.py: use /Documents/{billId} endpoint (not /Bills/);
construct bill ID from chamber prefix + number; fetch history via separate
DocumentHistoryActions URL; Action field (not StatusDescription) for passed
- Add initial 2024 dashboard chart outputs (3 of 4; bill intensity pending legislature data)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- score_lobbying_bills.py: rewritten to embed bill titles directly from MA_lobbying_bills.csv (not legislature CSV); stores embeddings as MA_bill_embeddings.npy for clustering; incremental per run - cluster_lobbying_bills.py: one-time k-means (default 15 clusters) on normalized embeddings + Gemini Flash labeling of each cluster; writes MA_bill_cluster_labels.csv and updates cluster_id in scored CSV - MA_lobbying_viz.py: add Chart 5 — stacked bar of annual spend by topic cluster; gracefully skipped until cluster_lobbying_bills.py has been run - dashboard.md: add cluster spend chart include - requirements-ci.txt: add scikit-learn==1.8.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ata from DB - assemble_db.py: coerce bill_number/general_court to Int64 in MA_Lobbying_Bills, MA_Legislature_Bills, MA_Lobbying_Bills_Scored; add MA_Lobbying_Bills_Scored and MA_Bill_Cluster_Labels as DB tables so all downstream analysis reads from DB - MA_lobbying_viz.py: remove CSV file reads; load scored bills and cluster labels from DB; remove redundant numeric coercions (now guaranteed by assemble_db.py) - cluster_lobbying_bills.py: update to gemini-2.5-flash for cluster labeling - score_lobbying_bills.py: differential cosine scoring with example bills - Add dash_lobbying_bill_intensity.html and dash_lobbying_spend_by_cluster.html charts - Update semantic context with new DB tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
Both scripts now flush progress to disk frequently so an interrupt loses at most one disclosure (lobbying) or 50 bills (legislature) of work, rather than the entire in-progress run. get_MA_lobbying.py: - Load each CSV independently so a missing lobbyists file doesn't prevent resuming from employers/bills/links - Flush all three CSVs to disk after every completed disclosure URL - Print running totals with each flush for live progress monitoring get_MA_legislature_bills.py: - Append each bill to the combined DataFrame and flush every 50 bills - Already had per-bill JSON cache; now the merged CSV is also safe Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MA_lobbying.md: replace stale seed-phrase scoring description with accurate account of differential cosine similarity; add cluster summary table; add t-SNE section with lobbying_bill_tsne.html embed - MA_lobbying_tsne.py: new script generating interactive Plotly t-SNE scatter of all lobbied bills coloured by cluster; env bills shown larger with white ring; hover shows bill title and cluster - get_MA_lobbying.py: add exponential-backoff retry (5 attempts) on GET/POST timeouts and connection errors; remove unused existing_lobbyists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains the full MA lobbying pipeline (scripts 7–9 + cluster): scraping strategy (iPad UA, ASP.NET viewstate, incremental disc_url cache), modern vs. legacy HTML formats, legislature API endpoint quirks, differential cosine embedding scoring, and k-means clustering. Also covers credentials, CI pipeline order, and manual-only scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
Remove general repo overview, CI pipeline table, other scripts section, and SODA credential reference — lobbying-only content remains. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
- Fix MA_lobbying_viz.py DB path (was looking in analysis/ not get_data/) - Add MA_Lobbying_Bills_Scored and MA_Bill_Cluster_Labels to semantic context with correct join examples (is_environmental now in scored table, not legislature) - Add HB/SB legacy chamber abbreviations to legislature bills fetcher - Fix score_lobbying_bills.py to fill missing titles from Legislature API and skip empty-string texts rather than retrying against Gemini API - Generate all 6 lobbying charts with 2009+2024 data - Update DB and GCS with current data (2009+2024 lobbying, GC 185+192 bills) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peline fixes
Data handling:
- Large lobbying CSVs (bills, employers, summary_links, scored, legislature)
moved out of git and into GCS; only 100-row _sample.csv files committed.
assemble_db.py now auto-generates samples after each DB rebuild.
- .gitignore updated to exclude all five large lobbying CSVs going forward.
Parser fixes (get_MA_lobbying.py):
- Detect 5-col vs 6-col legacy disclosure table layouts by checking for
"Lobbyist name" in the second header cell (affects 2010-2013 entity filers).
- Broaden modern activity table regex from grdvActivitiesNew\d{4}_\d+ to
grdvActivitiesNew(\d{4})?_\d+ to match 2014-2018 year-less table IDs.
Fix get_MA_legislature_bills.py to skip non-numeric bill_number values
instead of crashing on int() conversion.
Analysis:
- MA_lobbying_viz.py: all aggregations switched to client_name (paying client),
not entity_name (lobbying firm); five new post charts added (env spend by
cluster, top env clients, spend vs DEP budget/staff, CSO operators);
_write_facts() counts unique clients and firms.
- All 12 lobbying chart HTMLs regenerated with partial 2005-2025 scrape data.
- DRAFT analysis post: docs/_posts/2026-05-22-ma-environmental-lobbying.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| continue | ||
| out = f'../docs/data/{fname}_sample.csv' | ||
| df.head(100).to_csv(out, index=has_index) | ||
| print(f'Wrote sample: {out}') |
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…ssemble_db - MA_lobbying.md data tables already used _sample Jekyll data sources; updated download links to reference GCS paths (with pending-upload note until assemble_db.py runs and uploads them). - assemble_db.py now uploads MA_lobbying_bills, MA_lobbying_employers, MA_lobbying_bills_scored, and MA_legislature_bills CSVs to GCS after each DB rebuild, so the public download links stay current automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ata page Raw portal names are preserved unchanged in GCS CSVs. assemble_db.py adds entity_name_norm and client_name_norm columns to MA_Lobbying_Employers and MA_Lobbying_Bills when loading into the DB, using _normalize_entity() which strips LLC/LLP/Inc, "Law Office of", "& Associates", etc. and applies replacements (& → AND). This makes downstream joins robust to typographical variation across filers without modifying the source data. MA_lobbying.md: use Liquid `limit:10` on all three sample table loops. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
… to 0.08, add --rescore flag Spot-checked false positives: liquor licenses, medical debt, digital advertising, LGBTQ bills, hate crime, dental loss ratio all scoring above 0.05 threshold. Root cause: non-env example set lacked coverage for healthcare, criminal justice, housing, education, municipal licensing, digital/media. Changes: - NON_ENV_EXAMPLE_BILLS expanded from 20 → 42 with targeted coverage of 8 problem domains - ENV_THRESHOLD raised 0.05 → 0.08 (drops 1280 → ~333 env bills on existing embeddings) - --rescore flag: re-applies scoring to all existing embeddings using new examples without re-embedding - Graceful fallback if legislature CSV is unreadable (e.g. mid-write by concurrent fetch) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…env examples 0.08 was too restrictive (113 bills, 0.4%). 0.06 gives 483 bills (1.9%) and the marginal bills at that threshold are genuine environmental legislation (invasive plants, offshore wind, pollinators, wetlands, recycling). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…buffer overflow Status fields containing embedded newlines or long text caused pandas C engine to fail with "Buffer overflow caught". QUOTE_NONNUMERIC wraps all string fields in quotes, keeping each row on one line regardless of field content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 228k bill rows, 12k employer rows across all GCs (183-193, 2005-2025) - 483 env bills at threshold=0.06 with expanded non-env example set - New cluster labels from re-clustering 25,928 unique bills - 31,658 legislature bill metadata rows (all GCs now covered) - Updated sample CSVs, semantic context, facts_lobbying.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The github.token lacked actions:write, causing HTTP 403 on 'gh workflow run update-charts.yml --ref main' at end of data update job. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion charts Bug fixes: - _annual_env_spend: switch from 'all compensation for any env-bill client' to proportional allocation (compensation × env_bills/total_bills per firm/ client/year). Removes inflation from clients who lobbied 1 env bill among hundreds of others. lobbying_total_spend_latest drops from $6.6M → $1.8M. Now consistent with the top-env-employers chart which already used this methodology. - Remove unused MA_Lobbying_Lobbyists table load from _load_data (never referenced downstream). New charts (generate_post_charts): - lobbying_env_positions: stacked bar of unique clients by Support/Oppose/ Neutral position on env bills per year. Shows 2014 opposition spike (60 opposing clients) and recent shift to more neutral/support engagement. - lobbying_env_opponents: top 20 clients by unique env bills opposed across all years. Chemical/waste/real-estate industry coalition visible; Mass Audubon appears as they oppose anti-env bills the scorer also flags. - lobbying_pass_by_position: env bill pass rate by dominant lobbying position (mostly-supported 5%, mostly-opposed 8%, contested 0%). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d update lobbying analysis GC183 started January 2003, not 2005. The off-by-one in _year_to_general_court caused every Legislature API bill lookup to be sent to the wrong session, resulting in ~98% title mismatches between SoS portal bills and fetched body text. Confirmed: H3111 GC192 (wrong) = open meeting law; H3111 GC193 (correct) = DCR skating rinks. Systematic check of 60 bills: 2%→65% title match after correction. Residual 35% = SoS title-number prefix formatting differences + rare wrong numbers in lobbyist filings. See get_data/NOTES_bill_embeddings.md for full analysis. On-disk MA_lobbying_bills.csv migrated (+1 to all general_court values). MA_legislature_bills.csv backed up and cleared; pipeline will re-fetch with corrected GCs → score_lobbying_bills.py --reembed to follow. Also includes: - MA lobbying analysis scripts (viz + t-SNE subsample approach) - New lobbying charts (env score vs clients scatter, env positions, opponents, etc.) - get_data/NOTES_bill_embeddings.md: clustering evaluation, mismatch diagnosis - get_data/test_concat_embeddings.py: title-only vs concat embedding comparison - CLAUDE.md: document GC bug, correct GC183=2003-2004 note - docs/data/MA_lobbying.md: fix GC184=2005-2006 reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cluster_lobbying_bills.py now saves the fitted KMeans model and training mean vector to GCS (MA_bill_kmeans.joblib + MA_bill_emb_mean.npy) on every full re-cluster run. New --incremental mode loads the saved model and assigns cluster labels to any bill with cluster_id == -1 using nearest-centroid lookup — no re-fitting, no Gemini API call, runs in seconds. Added to update-data.yml CI after score_lobbying_bills.py so new bills are labelled automatically each week. Also adds mean-centering to full clustering (training mean subtracted before L2-norm) consistent with the sweep that showed silhouette improvement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…eline outputs Buffer overflow in pandas C parser when reading CSVs with long bill title fields (cluster labels example_titles, scored CSV bill_title). Fix: - score_lobbying_bills.py: write scored CSV with QUOTE_NONNUMERIC, cast is_environmental to int before writing - cluster_lobbying_bills.py: write cluster labels with QUOTE_NONNUMERIC - assemble_db.py: read scored CSV and cluster labels with engine='python' Also gitignore model artifacts (MA_bill_kmeans.joblib, MA_bill_emb_mean.npy) and the wrong-GC backup file. Updated outputs: cluster labels (25 clusters, 654 env bills with corrected GC body text), sample CSVs, semantic context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
177 KB + 3 KB — small enough to version alongside the code. Simplifies CI: no GCS dependency for --incremental cluster assignment, and the model is versioned with the embeddings that produced it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add summarize_lobbying_bills.py: Gemini 2.5 Flash summary + MAPLE taxonomy tags/categories + is_env_llm per bill, with prompt caching (~46% cost saving, $2.76 projected for 26k bills). 495-bill pilot run complete. - Add diagnostics_summarize.py: 6-section diagnostic suite (reference set recall/specificity, LLM vs embedding disagreement, tag quality, cost breakdown, silhouette comparison, UMAP). Results appended to NOTES_bill_embeddings.md. - Add cluster_pilot_summaries.py: k-means on summary embeddings with Gemini cluster labelling + UMAP recolour. k=20 silhouette=0.060 (best so far). - Update MA_lobbying_tsne.py: t-SNE → UMAP (n_neighbors=30, cosine). Fix labels CSV read to handle malformed rows (engine=python + numeric coerce filter). - Fix MA_lobbying_viz.py: cluster_id dtype coercion (str→Int64) on merge. - Fix assemble_db.py: clamp out-of-range general_court values before Int64 cast (311 malformed rows had embedding scores in that column). - Fix cluster_lobbying_bills.py: drop example_titles before CSV write to prevent unquoted-comma buffer overflow in downstream readers. - Add embedding_diagnostics.png: 4-panel diagnostic plot (score distribution, borderline zoom, method comparison, cluster env density). - Update NOTES_bill_embeddings.md: pilot diagnostic results, silhouette tables, UMAP links, cost actuals ($0.106/1k bills, 80.8% cached tokens). - Regenerate all lobbying charts and UMAP with post-GC-fix full corpus data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… estimates - summarize_lobbying_bills.py: parallelize LLM + embed with ThreadPoolExecutor (8 workers default), exponential backoff on all API calls, inline summary_embedding stored per bill, thinking token tracking + cost reporting, embed cost (gemini-embedding-2 $0.20/1M) added to running total - cluster_pilot_summaries.py: parallel embed with ThreadPoolExecutor, smart parquet fallback (use stored summary_embedding, re-embed only gaps), duplicate-append guard in append_to_notes - NOTES_bill_embeddings.md: corrected full-corpus cost estimate (~$6.93 all-in vs earlier $2.76 LLM-only), pricing table for all models/operations, note on thinking token tracking - UMAP: regenerated with 3,497-bill summary-embedding corpus (k=20, sil=0.064) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
… from GCP billing) Output tokens are the dominant cost at $2.50/1M, not $0.30/1M. Also corrects input prices to GA tier ($0.30/$0.075 uncached/cached). Updates NOTES_bill_embeddings.md with verified billing breakdown and revised full-corpus cost estimate (~$16-17 actual vs ~$6.93 prior estimate). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
If the save block crashes inside the ThreadPoolExecutor loop, Python's executor.shutdown(wait=True) runs all remaining futures to completion — spending API budget while saving nothing. Wrap the checkpoint save in try/except so errors log a warning and continue rather than propagating. Also fixes the immediate cause: df.loc with a list value for summary_embedding was 2D (used df.at instead). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each completed bill now logs a 'SUMMARY: <text>' line to stdout so that if a run crashes before saving, recover_from_log.py can parse the log and restore all fields (summary, categories, tags, is_env_llm) to the parquet without re-running Gemini. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- cluster_pilot_summaries.py: k=20 UMAP on full 25,915-bill summary embeddings - cluster_lobbying_bills.py: re-fit k=25 clusters on original embeddings (full corpus) - assemble_db.py: rebuilt AMEND.db with updated MA_Lobbying_Bills_Scored, MA_Bill_Cluster_Labels; regenerated semantic context - MA_lobbying_viz.py + MA_lobbying_tsne.py: all charts regenerated Also fixes: - cluster_lobbying_bills.py: engine='python' for CSV with long text fields; NaN-safe general_court/bill_number key building for both full and incremental paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New charts in MA_lobbying_viz.py: - lobbying_gc_trend: env bills + clients per General Court (GC186-194) - lobbying_env_categories_by_gc: stacked bar of env bill LLM categories by GC - lobbying_employer_env_scatter: Plotly scatter, total spend vs env fraction - lobbying_opposition_pairs: top 15 employer pairs most often on opposite sides - lobbying_top_env_tags: top 15 LLM tags on env bills Post (2026-05-22-ma-environmental-lobbying.md): - Added GC trend section with narrative (134 → 493 bills, 77 → 624 clients) - Added category stacked bar section - Added employer scatter section with interpretation - Added top tags section - Added opposition pairs section with AIM/ELM/National Grid narrative - Updated env scoring description to cover both LLM + embedding approaches - Updated caveats to mention GC off-by-one bug and LLM classification limits - Updated reproducibility section Also adds docs/data/MA_lobbying_static_site_proposal.md: detailed prompt for building a standalone static browsable site for the lobbying dataset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
- Covers all ~26k lobbied bills, not just environmental ones - env classification is one filter, not a scope constraint - Drops per-entity static page generation entirely - bills.html?id=H1234&gc=194 / employers.html?name=slug pattern routes detail views purely client-side via URLSearchParams - Added edges.json lazy-load strategy (fetched once, cached in module promise) - Updated data size estimates and file structure accordingly - Removed build_pages.py and Jinja2 dependency (no build step) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous semantic context had wrong column names and join patterns throughout the lobbying tables, which would cause the AI Analysis LLM to generate broken SQL: - employer_name → entity_name (firm) / client_name (employer) everywhere CRITICAL: entity_name = lobbying firm, client_name = paying employer; all employer-level analysis must use client_name - total_expenditure → compensation (MA_Lobbying_Employers column name) - subject_tags → removed (column does not exist) - MA_Lobbying_Lobbyists → removed (table does not exist) - cluster range 0–14 → 0–24 (25 clusters, not 15) - Join key MA_Lobbying_Bills ↔ MA_Lobbying_Employers is now correctly documented as (entity_name, client_name, year) — three columns, not just (employer_name, year) Updated JOIN_RELATIONSHIPS examples: - env spend by year: now uses proportional allocation pattern with correct cols - top clients: uses client_name + compensation, warns NOT entity_name - topic cluster: uses client_name not employer_name - lobbying vs enforcement: uses compensation not total_expenditure - new: support/oppose positions example using position column - bill passage tier: uses client_name, client_count Regenerated docs/assets/db_semantic_context.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…semble_db
Issue 1 — MA_Lobbying_Bills (2,868 extra rows removed):
Null-bill rows ('no specific bills' disclosures) were scraped multiple times
from the SoS portal across filing periods, producing exact duplicates.
Fix: sort by amount desc, drop_duplicates on
(entity_name, client_name, year, general_court, bill_number, position),
keeping the row with the highest amount so no spend is lost.
Result: 228,046 → 225,178 rows.
Issue 2 — MA_Lobbying_Bills_Scored (1,457 extra rows removed):
Multiple filers referenced the same bill_number with slightly different
bill_title strings, so the scoring pipeline produced two scored rows for
the same (bill_number, general_court). GC189 was the dominant case (2,730
of 2,913 dup rows), not GC194 as the external report suggested.
Fix: sort by env_relevance_score desc (then by bill_id presence),
drop_duplicates on (bill_number, general_court), keeping the highest-
scoring row so env-relevant bills are never suppressed by a lower-scoring
duplicate title variant.
Result: 25,932 → 24,475 rows.
Both deduplication steps run at DB assembly time; source CSVs are unchanged.
Rebuilt and uploaded AMEND.db to gs://openamend-data/amend.db.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
459 rows had bill_title values that were concatenated multi-bill text blobs (up to 24,355 chars) scraped from the SoS portal's combined display pages. Fix: after deduplication, join scored bills to MA_Legislature_Bills on (bill_id, general_court) and replace any bill_title > 300 chars with the authoritative title from the Legislature API wherever the join succeeds. Result: 453 of 459 long titles replaced (max len: 24,355 → 740 chars) Remaining 52 rows > 300 chars are legitimately long procedural titles (Governor messages, conference committee texts, budget amendments) — not scraper artifacts. Rebuilt and uploaded AMEND.db to gs://openamend-data/amend.db. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xplorer - assemble_db.py: deduplicate MA_Lobbying_Bills_Scored on (bill_id, general_court) instead of (bill_number, general_court) — H and S bills share the same integer bill_number within a session and must not be merged - assemble_db.py: derive bill_prefix and bill_id columns on MA_Lobbying_Bills from chamber field (e.g. 'House Bill' → 'H', bill_id = 'H1234'); 95.4% of rows get a non-NULL bill_id - MA_lobbying_viz.py: add _bill_merge() helper that joins on (bill_id, gc) when available, falling back to (bill_number, gc) for legacy rows without bill_id; update all 10+ join sites to use it — fixes 68,437 cross-prefix contaminated rows in env bill counts, spend allocation, and opposition pairs chart - generate_semantic_context.py / db_semantic_context.txt: document new columns, update all JOIN_RELATIONSHIPS SQL examples to use (bill_id, general_court) - Regenerate all 20 lobbying charts against corrected data - Blog post: add MA Lobbying Explorer link; link named employers (AIM, ELM, National Grid, Eversource, Orsted, NextEra, Bloom Energy, CLF, NECE, Vote Solar, BCC Solar, MMA) to explorer employer pages - Data page: add explorer browse links at top; link bill_id, client_name, and entity_name in sample tables to explorer bills/employers/lobbyists pages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace simple token-stripping with structured normalization pipeline:
1. Strip d/b/a suffix first (prevents trade name bleeding into canonical form)
2. Hyphen -> space (LAN-TEL == LAN TEL)
3. Punctuation -> space (not '') so adjacent tokens don't concatenate
4. Whole-word regex removal of legal entity type words (LLC, LLP, INC,
INCORPORATED, CORPORATION, CORP, LTD, LIMITED, PC, PLLC)
5. Whole-word 'THE' removal (any position, not just leading prefix)
6. & -> AND
7. ASSICIATES typo fix
Merges ~180 additional redundant entity groups (e.g. Partners In Democracy /
Partners in Democracy Inc, LAN-TEL Communications / Lan-Tel Communications Inc).
Reduces distinct normalized entity count from ~6,512 to ~4,790.
Regenerate AMEND.db, all lobbying charts, and GCS samples.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GITHUB_TOKEN cannot trigger workflow_dispatch events on other workflows (GitHub security restriction, regardless of actions:write permission). New structure — no cross-dispatch, each workflow is fully self-contained: - update-weekly.yml: scheduled Monday 06:00 UTC; runs full data fetch + chart generation in sequence; commits data/ + charts/ in one PR - update-data.yml: manual only; data fetch + DB assembly; commits data/ - update-charts.yml: manual only; downloads DB from GCS, regenerates dashboard charts; commits charts/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…-ci.txt scikit-learn==1.8.0 requires joblib>=1.3.0; the old joblib==1.2.0 pin caused a ResolutionImpossible error in CI. Also consolidates gcsfs/pyarrow (needed by the lobbying clustering pipeline parquet I/O) into the correct section with proper comment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h steps score_lobbying_bills.py was deduplicating lobby bills on (bill_number, gc), causing every S-prefix bill to be silently skipped when its H-prefix twin was already embedded (and vice versa). H1234 and S1234 are independent bills with independent content — they must be embedded separately. Fix: derive bill_id from the chamber column (House Bill → H, Senate Bill → S, etc.) BEFORE deduplication and key on (bill_id, gc) for H/S/HD/SD bills. Bills with unmapped chamber types (Joint, Executive, etc.) retain the old (bill_number, gc) dedup since they have no H/S collision risk. Legislature API bill_id_map is no longer used to assign bill_id for H/S bills (that map is keyed on bill_number and would silently overwrite the correct id with the twin). Ran local backfill: 7,057 previously missing bills now embedded. Parquet: 26,102 → 33,159 rows. Environmental bills: 661 → 924. CI workflow changes: - Split monolithic 'Fetch data' step into 4 named steps with timeouts - Legislature bills step: 90 min → 20 min (incremental runs fetch ~dozens of new bills per week, not thousands) - Scoring step: 30 min (sufficient for typical weekly incremental volume) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- summarize_lobbying_bills.py: 7,211 bills processed, 0 failures, $4.62 total Actual rate: $0.627/1k bills (output tokens at $2.50/1M dominate at 60%) - Regenerated all lobbying charts with full tag/summary coverage - Rebuilt AMEND.db + semantic context with updated scored bills - Documented actual Gemini API costs in CLAUDE.md, score_lobbying_bills.py, summarize_lobbying_bills.py, and NOTES_bill_embeddings.md to prevent future underestimates (prior projections were off by ~6× due to wrong output token price: $0.30/1M used instead of $2.50/1M actual) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This PR adds a complete pipeline for collecting and analyzing Massachusetts lobbying disclosure data. The MA Secretary of State publishes semi-annual lobbying filings that document which organizations hired lobbyists, how much they paid, and which specific bills their lobbyists worked on. By pairing this with bill metadata from the MA Legislature API and semantic environmental relevance scoring via Google Gemini embeddings, we can identify which industries are spending most on environmental legislation, track whether heavily-lobbied bills are more or less likely to pass, and correlate lobbying intensity with trends in DEP enforcement and budget.
The data integrates naturally with existing AMEND analyses: lobbying spend can be plotted alongside DEP staffing and enforcement actions to surface relationships between industry influence and regulatory outcomes. All four new database tables are exposed in the AI Analysis tool's semantic context, enabling natural-language queries like "which companies spent the most lobbying against clean water bills" or "has lobbying spend on climate legislation increased since 2015." The pipeline is fully incremental — weekly CI runs exit early when no new semi-annual filings have been posted, so the added runtime cost is near-zero on most weeks.
Summary
get_MA_lobbying.py): scrapes the MA SoS lobbying disclosure portal using an iPad User-Agent (bypasses Incapsula WAF without Selenium). Incremental via adisc_urlset stored inMA_lobbying_summary_links.csv— weekly CI exits early when no new semi-annual filings are posted, so most runs touch only 2 search pages and exit.get_MA_legislature_bills.py): fetches bill metadata (title, sponsor, committee, status,passedbool) from the MA Legislature OpenAPI for every unique(bill_number, general_court)pair in the lobbying data. JSON responses cached underMA_legislature_cache/for incremental re-runs.score_lobbying_bills.py): scores each bill for environmental relevance using Geminigemini-embedding-2cosine similarity against 20 seed phrases (threshold 0.60). Only unscored bills are embedded per run.MA_lobbying_viz.py): 4 weekly-updated charts — annual spend trend, top 15 employers, bill intensity + pass rate, lobbying spend vs. enforcement actions. Plus 2 analysis-post charts.update-data.ymlCI,assemble_db.py(4 new tables:MA_Lobbying_Employers,MA_Lobbying_Bills,MA_Lobbying_Lobbyists,MA_Legislature_Bills),validate_data.py(lobbying tables inOPTIONAL_DATASETS— CI doesn't fail before first fetch),generate_semantic_context.py, anddashboard_charts.py.Test plan
--year 2024 --limit 10: produced 18 disclosure rows, 56 employer rows, 539 bill rows with correct entity names, compensation amounts, bill numbers, titles, and positionsget_MA_legislature_bills.pyagainst live API after lobbying fetch completesscore_lobbying_bills.pywith real Google API keydashboard_charts.pyto verify chart generation once DB is assembled🤖 Generated with Claude Code