tooling(scripts): nested-path support — Git Tree helper + classifier consumers by hyperpolymath · Pull Request #204 · hyperpolymath/standards

hyperpolymath · 2026-05-26T12:51:11Z

Summary

Two-commit change adding nested-path support to the sweep-classifier pipeline:

scripts/sweep-classifiers/list-workflow-paths.sh — walks gh repo list and queries each repo's Git Tree API directly. Bypasses two compounding undercounts in gh api /search/code.
All 4 classify-*.sh scripts updated to consume the helper's TSV output and emit the sweep-target path as an explicit column.

Why the helper exists — 3 layers of undercount

Layer 1 — path-prefix filter: path:.github/workflows matches the path PREFIX, excluding nested <subdir>/.github/workflows/<file>.yml paths outright.
Layer 2 — org-scope truncation: even broad filename:<file>.yml org:<org> queries hit internal caps. Validated against scorecard.yml: broad query saw 152 paths (all flagged top-level); per-repo enumeration found 626 additional nested copies the broad query missed entirely.
Layer 3 — nested workflows are inert: GitHub Actions only runs .github/workflows/ at the repo root. Nested copies are vendored templates / stale leftover. Security campaigns gain nothing from sweeping nested copies; single-source-of-truth campaigns still benefit.

Helper output

TSV, one row per matching workflow file:

<repo>\t<path>\t<blob-sha>\t<top-level|nested>

Cost: one Git Tree API call per repo (~300 calls), uses core bucket (5000/hr) not throttled code_search (10/min).

Classifier extensions

Each classify-*.sh now auto-detects input format from the first byte:

{ → JSONL from gh /search/code (legacy path)
otherwise → TSV from list-workflow-paths.sh (preferred — handles nested)

Output is unified to 7 columns: repo \t path \t sha \t class \t reason \t lines \t details. The new path column carries the file's location inside the repo, so sweeps can target nested copies as first-class wrapper sites.

Shared normalize_input extracted into _lib.sh; each classifier sources it.

Validation

Smoke-tested both input paths:

TSV (helper): classify-mirror.sh on scorecard-tuples.tsv (287 repos × top-level + nested) — fetches blobs and emits per-(repo, path) rows.
JSONL (legacy): classify-mirror.sh on mirror-full.json — 267 TRIVIAL + 22 NEEDS_REVIEW, matching prior /tmp/drift-survey/sweep-report.md.

Stacked on #194

scripts/sweep-classifiers/ only exists once #194 merges. The diff against main includes #194's files transitively; once #194 lands, this PR narrows to just the helper + extensions.

Standing follow-ups

Once this lands, re-survey each candidate with the helper for ground-truth wrapper-site counts before firing any sweep.

🤖 Generated with Claude Code

…-workflow campaign Durable tooling for the wrapper-sweep work that follows each of the foundational reusable PRs (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan). Each classifier: - reads a paginated `gh api /search/code` JSON dump - fetches each unique blob SHA exactly once (cached in $BLOBS_DIR) - emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details> Classes vary per template but follow the same shape: TRIVIAL (canonical match, mechanical wrapper) vs SLIM/MISSING/OLDER (propagation lag, auto-upgrades on first run after wrapper merge) vs NEEDS_REVIEW (custom workflow body, requires per-repo diff). Numbers produced by these classifiers across the four campaign templates: - mirror.yml — 267/289 TRIVIAL (92.4%); 22 NEEDS_REVIEW - secret-scanner — 273/281 missing shell-secrets (97.2%); 1 TRIVIAL (standards itself) - codeql — 246/263 mechanical (93.5%); 17 NEEDS_REVIEW - hypatia-scan — 249/255 safe-to-standardize-up (97.6%); 6 NEEDS_REVIEW README documents the path-filter caveat: `gh api /search/code` with `path:.github/workflows` excludes monorepo-nested workflow files; the broader `filename:` query (no path filter) catches them. For hypatia-scan, the broader query returns 704 vs the 255 path-filtered count — the ~449 nested copies also need wrappers when sweeps fire.

…rch undercounts The /search/code endpoint undercounts monorepo-nested workflow files in two compounding ways: Layer 1 (path-prefix filter) — `path:.github/workflows` matches the path prefix only; nested .github paths are excluded outright. Layer 2 (org-scope truncation) — even broad `filename:` queries hit internal result caps and silently drop entries. Validated against scorecard.yml: broad query saw 152 paths (all top-level flagged); the Git Tree API walk found 626 nested copies in the same scope. list-workflow-paths.sh walks `gh repo list` and queries each repo's Git Tree API (recursive=1) directly, emitting one TSV row per matching file: <repo>\t<path>\t<blob-sha>\t<top-level|nested> Cost: one Git Tree API call per repo (~300 calls; uses core bucket, not the 10/min code_search bucket). README updated to document all 3 undercount layers + the Layer-3 caveat that nested workflows are inert (Actions only runs the repo root .github/workflows/ directory). For sweep planning, this script is now the source of truth. Refs: #194 (parent), #199 (campaign meta-doc).

Original draft claimed scorecard.yml had zero nested copies, based on the broad gh /search/code query. After #204 added the list-workflow-paths.sh helper (which bypasses the Layer-2 truncation), per-repo enumeration revealed 626 nested copies the broad query missed entirely. Scorecard is NOT the cleanest sweep target. It now has the second-highest nested count of the 5 candidates (behind hypatia-scan). Total wrapper sites: 884 (258 top-level + 626 nested). Layer-3 caveat still applies — nested copies are inert; disposition is single-source-of-truth cleanup, not security gain. This correction is itself an example of the methodology lesson the doc was trying to make: every count derived from /search/code is suspect until validated against the Git Tree API.

Each classify-*.sh now auto-detects input format and emits the sweep-target path as an explicit column. Input formats (auto-detected from first byte): - JSONL (legacy): {repo, path?, sha} — from gh /search/code paginate - TSV (preferred): repo\tpath\tsha\tscope — from list-workflow-paths.sh Output (all 4 classifiers, 7 columns): repo \t path \t sha \t class \t reason \t lines \t details The path column carries the file's location inside the repo, so sweeps can target nested copies as first-class wrapper sites (not just root-level workflows). When path is empty (legacy JSONL without .path field), it defaults to "". Validated: - classify-mirror.sh /tmp/drift-survey/mirror-full.json (JSONL): 267 TRIVIAL + 22 NEEDS_REVIEW, matches prior sweep-report.md. - classify-mirror.sh /tmp/drift-survey/scorecard-tuples.tsv (TSV): fetches blobs and emits per-(repo, path) rows correctly. Shared `normalize_input` extracted into _lib.sh; each classifier sources it. Reduces per-script duplication. Refs: #199 (campaign), #204 (helper script).

Walked the Git Tree API for all 5 templates via list-workflow-paths.sh from #204. Findings: - Top-level path-filtered queries were 1-35% undercounted across all 5 templates (worst: hypatia-scan 255 -> 344, +89 / 35%). - Nested-copy counts were 100%+ undercounted for mirror.yml (133 reported -> 335 true). - hypatia-scan top-level has only 3 unique blob SHAs across 344 sites -> 0.9% drift on the executing surface (vs the 11.8% drift the PR body reports for top-level+nested). Replaced the 'Corrected estate counts' section with three tables: helper-validated totals, top-level-only drift, and initial-survey undercount summary. Added LOC retirement table: ~275k LOC top-level across the 5 reusables, ~732k including nested copies. Updated Layer 2 documentation to note path-filtered queries are ALSO truncated (previously the doc only said broad queries were). Updated Standing follow-ups: marked the per-(repo,path) classifier ingestion DONE (shipped in #204); removed the 'file scorecard' item (filed as #205); added quarterly re-run suggestion.

hyperpolymath added 2 commits May 26, 2026 12:18

hyperpolymath enabled auto-merge (squash) May 26, 2026 12:51

hyperpolymath mentioned this pull request May 26, 2026

feat(governance): add scorecard-reusable.yml — close 5-candidate convergence set #205

Open

2 tasks

hyperpolymath changed the title ~~tooling(scripts): bypass /search/code undercount with per-repo Git Tree walk~~ tooling(scripts): nested-path support — Git Tree helper + classifier consumers May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tooling(scripts): nested-path support — Git Tree helper + classifier consumers#204

tooling(scripts): nested-path support — Git Tree helper + classifier consumers#204
hyperpolymath wants to merge 3 commits into
mainfrom
feat/sweep-classifiers-nested-paths

hyperpolymath commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hyperpolymath commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why the helper exists — 3 layers of undercount

Helper output

Classifier extensions

Validation

Stacked on #194

Standing follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hyperpolymath commented May 26, 2026 •

edited

Loading