Skip to content

tooling(scripts): nested-path support — Git Tree helper + classifier consumers#204

Open
hyperpolymath wants to merge 3 commits into
mainfrom
feat/sweep-classifiers-nested-paths
Open

tooling(scripts): nested-path support — Git Tree helper + classifier consumers#204
hyperpolymath wants to merge 3 commits into
mainfrom
feat/sweep-classifiers-nested-paths

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

@hyperpolymath hyperpolymath commented May 26, 2026

Summary

Two-commit change adding nested-path support to the sweep-classifier pipeline:

  1. scripts/sweep-classifiers/list-workflow-paths.sh — walks gh repo list and queries each repo's Git Tree API directly. Bypasses two compounding undercounts in gh api /search/code.
  2. All 4 classify-*.sh scripts updated to consume the helper's TSV output and emit the sweep-target path as an explicit column.

Why the helper exists — 3 layers of undercount

  1. Layer 1 — path-prefix filter: path:.github/workflows matches the path PREFIX, excluding nested <subdir>/.github/workflows/<file>.yml paths outright.
  2. Layer 2 — org-scope truncation: even broad filename:<file>.yml org:<org> queries hit internal caps. Validated against scorecard.yml: broad query saw 152 paths (all flagged top-level); per-repo enumeration found 626 additional nested copies the broad query missed entirely.
  3. Layer 3 — nested workflows are inert: GitHub Actions only runs .github/workflows/ at the repo root. Nested copies are vendored templates / stale leftover. Security campaigns gain nothing from sweeping nested copies; single-source-of-truth campaigns still benefit.

Helper output

TSV, one row per matching workflow file:

<repo>\t<path>\t<blob-sha>\t<top-level|nested>

Cost: one Git Tree API call per repo (~300 calls), uses core bucket (5000/hr) not throttled code_search (10/min).

Classifier extensions

Each classify-*.sh now auto-detects input format from the first byte:

  • { → JSONL from gh /search/code (legacy path)
  • otherwise → TSV from list-workflow-paths.sh (preferred — handles nested)

Output is unified to 7 columns: repo \t path \t sha \t class \t reason \t lines \t details. The new path column carries the file's location inside the repo, so sweeps can target nested copies as first-class wrapper sites.

Shared normalize_input extracted into _lib.sh; each classifier sources it.

Validation

Smoke-tested both input paths:

  • TSV (helper): classify-mirror.sh on scorecard-tuples.tsv (287 repos × top-level + nested) — fetches blobs and emits per-(repo, path) rows.
  • JSONL (legacy): classify-mirror.sh on mirror-full.json — 267 TRIVIAL + 22 NEEDS_REVIEW, matching prior /tmp/drift-survey/sweep-report.md.

Stacked on #194

scripts/sweep-classifiers/ only exists once #194 merges. The diff against main includes #194's files transitively; once #194 lands, this PR narrows to just the helper + extensions.

Standing follow-ups

  • Once this lands, re-survey each candidate with the helper for ground-truth wrapper-site counts before firing any sweep.

🤖 Generated with Claude Code

…-workflow campaign

Durable tooling for the wrapper-sweep work that follows each of the
foundational reusable PRs (#187 mirror, #190 secret-scanner, #192
codeql, #193 hypatia-scan).

Each classifier:
- reads a paginated `gh api /search/code` JSON dump
- fetches each unique blob SHA exactly once (cached in $BLOBS_DIR)
- emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details>

Classes vary per template but follow the same shape: TRIVIAL (canonical
match, mechanical wrapper) vs SLIM/MISSING/OLDER (propagation lag,
auto-upgrades on first run after wrapper merge) vs NEEDS_REVIEW
(custom workflow body, requires per-repo diff).

Numbers produced by these classifiers across the four campaign templates:
- mirror.yml      — 267/289 TRIVIAL (92.4%); 22 NEEDS_REVIEW
- secret-scanner  — 273/281 missing shell-secrets (97.2%); 1 TRIVIAL (standards itself)
- codeql          — 246/263 mechanical (93.5%); 17 NEEDS_REVIEW
- hypatia-scan    — 249/255 safe-to-standardize-up (97.6%); 6 NEEDS_REVIEW

README documents the path-filter caveat: `gh api /search/code` with
`path:.github/workflows` excludes monorepo-nested workflow files; the
broader `filename:` query (no path filter) catches them. For
hypatia-scan, the broader query returns 704 vs the 255 path-filtered
count — the ~449 nested copies also need wrappers when sweeps fire.
…rch undercounts

The /search/code endpoint undercounts monorepo-nested workflow files
in two compounding ways:

  Layer 1 (path-prefix filter)  — `path:.github/workflows` matches the
                                  path prefix only; nested .github
                                  paths are excluded outright.
  Layer 2 (org-scope truncation) — even broad `filename:` queries hit
                                  internal result caps and silently
                                  drop entries. Validated against
                                  scorecard.yml: broad query saw 152
                                  paths (all top-level flagged); the
                                  Git Tree API walk found 626 nested
                                  copies in the same scope.

list-workflow-paths.sh walks `gh repo list` and queries each repo's
Git Tree API (recursive=1) directly, emitting one TSV row per
matching file:

  <repo>\t<path>\t<blob-sha>\t<top-level|nested>

Cost: one Git Tree API call per repo (~300 calls; uses core bucket,
not the 10/min code_search bucket).

README updated to document all 3 undercount layers + the Layer-3
caveat that nested workflows are inert (Actions only runs the repo
root .github/workflows/ directory). For sweep planning, this script
is now the source of truth.

Refs: #194 (parent), #199 (campaign meta-doc).
@hyperpolymath hyperpolymath enabled auto-merge (squash) May 26, 2026 12:51
hyperpolymath added a commit that referenced this pull request May 26, 2026
Original draft claimed scorecard.yml had zero nested copies, based on
the broad gh /search/code query. After #204 added the
list-workflow-paths.sh helper (which bypasses the Layer-2
truncation), per-repo enumeration revealed 626 nested copies the
broad query missed entirely.

Scorecard is NOT the cleanest sweep target. It now has the
second-highest nested count of the 5 candidates (behind hypatia-scan).
Total wrapper sites: 884 (258 top-level + 626 nested).

Layer-3 caveat still applies — nested copies are inert; disposition
is single-source-of-truth cleanup, not security gain.

This correction is itself an example of the methodology lesson the
doc was trying to make: every count derived from /search/code is
suspect until validated against the Git Tree API.
Each classify-*.sh now auto-detects input format and emits the
sweep-target path as an explicit column.

Input formats (auto-detected from first byte):
  - JSONL (legacy):  {repo, path?, sha}  — from gh /search/code paginate
  - TSV (preferred): repo\tpath\tsha\tscope — from list-workflow-paths.sh

Output (all 4 classifiers, 7 columns):
  repo \t path \t sha \t class \t reason \t lines \t details

The path column carries the file's location inside the repo, so
sweeps can target nested copies as first-class wrapper sites (not
just root-level workflows). When path is empty (legacy JSONL without
.path field), it defaults to "".

Validated:
  - classify-mirror.sh /tmp/drift-survey/mirror-full.json (JSONL):
    267 TRIVIAL + 22 NEEDS_REVIEW, matches prior sweep-report.md.
  - classify-mirror.sh /tmp/drift-survey/scorecard-tuples.tsv (TSV):
    fetches blobs and emits per-(repo, path) rows correctly.

Shared `normalize_input` extracted into _lib.sh; each classifier
sources it. Reduces per-script duplication.

Refs: #199 (campaign), #204 (helper script).
@hyperpolymath hyperpolymath changed the title tooling(scripts): bypass /search/code undercount with per-repo Git Tree walk tooling(scripts): nested-path support — Git Tree helper + classifier consumers May 26, 2026
hyperpolymath added a commit that referenced this pull request May 26, 2026
Walked the Git Tree API for all 5 templates via list-workflow-paths.sh
from #204. Findings:

- Top-level path-filtered queries were 1-35% undercounted across
  all 5 templates (worst: hypatia-scan 255 -> 344, +89 / 35%).
- Nested-copy counts were 100%+ undercounted for mirror.yml
  (133 reported -> 335 true).
- hypatia-scan top-level has only 3 unique blob SHAs across
  344 sites -> 0.9% drift on the executing surface (vs the
  11.8% drift the PR body reports for top-level+nested).

Replaced the 'Corrected estate counts' section with three tables:
helper-validated totals, top-level-only drift, and initial-survey
undercount summary. Added LOC retirement table: ~275k LOC top-level
across the 5 reusables, ~732k including nested copies.

Updated Layer 2 documentation to note path-filtered queries are
ALSO truncated (previously the doc only said broad queries were).

Updated Standing follow-ups: marked the per-(repo,path) classifier
ingestion DONE (shipped in #204); removed the 'file scorecard' item
(filed as #205); added quarterly re-run suggestion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant