tooling(scripts): nested-path support — Git Tree helper + classifier consumers#204
Open
hyperpolymath wants to merge 3 commits into
Open
tooling(scripts): nested-path support — Git Tree helper + classifier consumers#204hyperpolymath wants to merge 3 commits into
hyperpolymath wants to merge 3 commits into
Conversation
…-workflow campaign Durable tooling for the wrapper-sweep work that follows each of the foundational reusable PRs (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan). Each classifier: - reads a paginated `gh api /search/code` JSON dump - fetches each unique blob SHA exactly once (cached in $BLOBS_DIR) - emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details> Classes vary per template but follow the same shape: TRIVIAL (canonical match, mechanical wrapper) vs SLIM/MISSING/OLDER (propagation lag, auto-upgrades on first run after wrapper merge) vs NEEDS_REVIEW (custom workflow body, requires per-repo diff). Numbers produced by these classifiers across the four campaign templates: - mirror.yml — 267/289 TRIVIAL (92.4%); 22 NEEDS_REVIEW - secret-scanner — 273/281 missing shell-secrets (97.2%); 1 TRIVIAL (standards itself) - codeql — 246/263 mechanical (93.5%); 17 NEEDS_REVIEW - hypatia-scan — 249/255 safe-to-standardize-up (97.6%); 6 NEEDS_REVIEW README documents the path-filter caveat: `gh api /search/code` with `path:.github/workflows` excludes monorepo-nested workflow files; the broader `filename:` query (no path filter) catches them. For hypatia-scan, the broader query returns 704 vs the 255 path-filtered count — the ~449 nested copies also need wrappers when sweeps fire.
…rch undercounts
The /search/code endpoint undercounts monorepo-nested workflow files
in two compounding ways:
Layer 1 (path-prefix filter) — `path:.github/workflows` matches the
path prefix only; nested .github
paths are excluded outright.
Layer 2 (org-scope truncation) — even broad `filename:` queries hit
internal result caps and silently
drop entries. Validated against
scorecard.yml: broad query saw 152
paths (all top-level flagged); the
Git Tree API walk found 626 nested
copies in the same scope.
list-workflow-paths.sh walks `gh repo list` and queries each repo's
Git Tree API (recursive=1) directly, emitting one TSV row per
matching file:
<repo>\t<path>\t<blob-sha>\t<top-level|nested>
Cost: one Git Tree API call per repo (~300 calls; uses core bucket,
not the 10/min code_search bucket).
README updated to document all 3 undercount layers + the Layer-3
caveat that nested workflows are inert (Actions only runs the repo
root .github/workflows/ directory). For sweep planning, this script
is now the source of truth.
Refs: #194 (parent), #199 (campaign meta-doc).
2 tasks
hyperpolymath
added a commit
that referenced
this pull request
May 26, 2026
Original draft claimed scorecard.yml had zero nested copies, based on the broad gh /search/code query. After #204 added the list-workflow-paths.sh helper (which bypasses the Layer-2 truncation), per-repo enumeration revealed 626 nested copies the broad query missed entirely. Scorecard is NOT the cleanest sweep target. It now has the second-highest nested count of the 5 candidates (behind hypatia-scan). Total wrapper sites: 884 (258 top-level + 626 nested). Layer-3 caveat still applies — nested copies are inert; disposition is single-source-of-truth cleanup, not security gain. This correction is itself an example of the methodology lesson the doc was trying to make: every count derived from /search/code is suspect until validated against the Git Tree API.
Each classify-*.sh now auto-detects input format and emits the
sweep-target path as an explicit column.
Input formats (auto-detected from first byte):
- JSONL (legacy): {repo, path?, sha} — from gh /search/code paginate
- TSV (preferred): repo\tpath\tsha\tscope — from list-workflow-paths.sh
Output (all 4 classifiers, 7 columns):
repo \t path \t sha \t class \t reason \t lines \t details
The path column carries the file's location inside the repo, so
sweeps can target nested copies as first-class wrapper sites (not
just root-level workflows). When path is empty (legacy JSONL without
.path field), it defaults to "".
Validated:
- classify-mirror.sh /tmp/drift-survey/mirror-full.json (JSONL):
267 TRIVIAL + 22 NEEDS_REVIEW, matches prior sweep-report.md.
- classify-mirror.sh /tmp/drift-survey/scorecard-tuples.tsv (TSV):
fetches blobs and emits per-(repo, path) rows correctly.
Shared `normalize_input` extracted into _lib.sh; each classifier
sources it. Reduces per-script duplication.
Refs: #199 (campaign), #204 (helper script).
hyperpolymath
added a commit
that referenced
this pull request
May 26, 2026
Walked the Git Tree API for all 5 templates via list-workflow-paths.sh from #204. Findings: - Top-level path-filtered queries were 1-35% undercounted across all 5 templates (worst: hypatia-scan 255 -> 344, +89 / 35%). - Nested-copy counts were 100%+ undercounted for mirror.yml (133 reported -> 335 true). - hypatia-scan top-level has only 3 unique blob SHAs across 344 sites -> 0.9% drift on the executing surface (vs the 11.8% drift the PR body reports for top-level+nested). Replaced the 'Corrected estate counts' section with three tables: helper-validated totals, top-level-only drift, and initial-survey undercount summary. Added LOC retirement table: ~275k LOC top-level across the 5 reusables, ~732k including nested copies. Updated Layer 2 documentation to note path-filtered queries are ALSO truncated (previously the doc only said broad queries were). Updated Standing follow-ups: marked the per-(repo,path) classifier ingestion DONE (shipped in #204); removed the 'file scorecard' item (filed as #205); added quarterly re-run suggestion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-commit change adding nested-path support to the sweep-classifier pipeline:
scripts/sweep-classifiers/list-workflow-paths.sh— walksgh repo listand queries each repo's Git Tree API directly. Bypasses two compounding undercounts ingh api /search/code.classify-*.shscripts updated to consume the helper's TSV output and emit the sweep-target path as an explicit column.Why the helper exists — 3 layers of undercount
path:.github/workflowsmatches the path PREFIX, excluding nested<subdir>/.github/workflows/<file>.ymlpaths outright.filename:<file>.yml org:<org>queries hit internal caps. Validated againstscorecard.yml: broad query saw 152 paths (all flagged top-level); per-repo enumeration found 626 additional nested copies the broad query missed entirely..github/workflows/at the repo root. Nested copies are vendored templates / stale leftover. Security campaigns gain nothing from sweeping nested copies; single-source-of-truth campaigns still benefit.Helper output
TSV, one row per matching workflow file:
Cost: one Git Tree API call per repo (~300 calls), uses
corebucket (5000/hr) not throttledcode_search(10/min).Classifier extensions
Each
classify-*.shnow auto-detects input format from the first byte:{→ JSONL fromgh /search/code(legacy path)list-workflow-paths.sh(preferred — handles nested)Output is unified to 7 columns:
repo \t path \t sha \t class \t reason \t lines \t details. The newpathcolumn carries the file's location inside the repo, so sweeps can target nested copies as first-class wrapper sites.Shared
normalize_inputextracted into_lib.sh; each classifier sources it.Validation
Smoke-tested both input paths:
/tmp/drift-survey/sweep-report.md.Stacked on #194
scripts/sweep-classifiers/only exists once #194 merges. The diff againstmainincludes #194's files transitively; once #194 lands, this PR narrows to just the helper + extensions.Standing follow-ups
🤖 Generated with Claude Code