Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions scripts/sweep-classifiers/README.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
= Sweep classifiers
:SPDX-License-Identifier: PMPL-1.0-or-later

Per-template classifiers used to triage the wrapper-sweep work that
follows each of the foundational reusable PRs filed against
`standards`:

[cols="2,1,3", options="header"]
|===
| Classifier | Reusable PR | Template
| `classify-mirror.sh` | #187 | `mirror.yml` (7-forge mirror bundle)
| `classify-secret-scanner.sh` | #190 | `secret-scanner.yml` (trufflehog/gitleaks/rust/shell)
| `classify-codeql.sh` | #192 | `codeql.yml` (CodeQL security analysis)
| `classify-hypatia-scan.sh` | #193 | `hypatia-scan.yml` (Hypatia neurosymbolic scan)
|===

== Pipeline

. Paginate `gh api /search/code` for the template across the estate.
. Group results by blob SHA; fetch each unique blob exactly once.
. Classify each blob (job-set match / line-count band / language matrix).
. Emit per-repo TSV: `<repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details>`.

The expensive step (blob fetch) is cached in `$BLOBS_DIR`, so reruns
are fast.

== Usage

[source,bash]
----
# 1. Paginate the search (one-time per template):
gh api --paginate -X GET '/search/code' \
-f q='filename:mirror.yml path:.github/workflows org:hyperpolymath' \
--jq '.items[] | {repo: .repository.name, sha: .sha}' \
> /tmp/mirror-full.json

# 2. Run the classifier:
BLOBS_DIR=/tmp/mirror-blobs \
./classify-mirror.sh /tmp/mirror-full.json > /tmp/mirror-classification.tsv

# 3. Summarise:
awk -F'\t' '{print $3}' /tmp/mirror-classification.tsv | sort | uniq -c | sort -rn
----

== Nested-path caveat — 3 layers of undercount

`gh api /search/code` undercounts monorepo-nested workflow files in
three compounding ways:

. *Layer 1 — `path:` filter is a prefix match.* `path:.github/workflows`
excludes paths like `a2ml/bindings/deno/.github/workflows/mirror.yml`
outright.
. *Layer 2 — even broad queries are org-scope-truncated.* Removing the
`path:` filter and running `filename:codeql.yml org:hyperpolymath`
returns ~190 paths, but per-repo enumeration of just
`developer-ecosystem` finds 170 codeql.yml files. The endpoint
silently caps results once it hits internal limits. Empirically
validated against `scorecard.yml`: broad query saw 152 paths, all
top-level; per-repo enumeration revealed 626 nested copies the
broad query missed entirely.
. *Layer 3 — nested workflows do not execute.* GitHub Actions only runs
workflows from the repo-root `.github/workflows/` directory. Nested
copies are inert vendored templates or stale leftover. Security-driven
campaigns (e.g. propagating a missing guardrail) do NOT close
additional attack surface via nested wrappers. Single-source-of-truth
campaigns still benefit.

For accurate counts, use `list-workflow-paths.sh` (this directory),
which walks `gh repo list` and queries each repo's Git Tree API
directly — bypassing Layers 1 and 2.

[source,bash]
----
./list-workflow-paths.sh codeql.yml \
> /tmp/drift-survey/codeql-all-tuples.tsv

# Top-level vs nested split:
awk -F'\t' '{print $4}' /tmp/drift-survey/codeql-all-tuples.tsv \
| sort | uniq -c
----

Output format: `<repo>\t<path>\t<blob-sha>\t<top-level|nested>`.

The legacy `gh api /search/code` query is still useful as a quick first
look, but `list-workflow-paths.sh` is the source of truth for any
sweep-planning decision.

== Output classes (varies by classifier)

* `TRIVIAL` / `TRIVIAL_CURRENT` / `TRIVIAL_DEFAULT` — canonical match;
pure mechanical wrapper conversion.
* `MISSING_*` / `SLIM_*` / `OLDER_*` — propagation lag; same wrapper as
TRIVIAL upgrades the workflow body on first run after merge.
* `SINGLE_NON_DEFAULT` / `MULTI_LANGUAGE` — requires one or more input
overrides at the call site.
* `NEEDS_REVIEW` — extra jobs, line count out of any known band, or
custom workflow body. Per-repo diff required before wrapper.

== Related

* `scripts/apply-baseline.sh` — Hypatia-finding baseline filter (paired
with `scripts/tests/apply-baseline-test.sh`).
* PRs #187, #190, #192, #193 — the reusables these classifiers triage.
28 changes: 28 additions & 0 deletions scripts/sweep-classifiers/_lib.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: PMPL-1.0-or-later
# _lib.sh — Shared helpers for classify-*.sh scripts.
#
# Sourced by every classifier. Not directly executable.
#
# Provides: normalize_input <file>
#
# Emits one TSV row per workflow file: "<repo>\t<path>\t<sha>".
# Accepts two input shapes auto-detected from the first line:
#
# 1. JSONL from `gh api /search/code` (legacy):
# {"repo":"foo","sha":"abc123"}
# Emitted path is empty (no nested-copy info available).
#
# 2. TSV from `list-workflow-paths.sh` (preferred — handles nested):
# foo<TAB><path><TAB>abc123<TAB>top-level|nested
# Emitted path is the original `<path>` value; the scope column
# is dropped.

normalize_input() {
local file="$1"
if [[ "$(head -c1 "$file")" == "{" ]]; then
jq -r '[.repo, (.path // ""), .sha] | @tsv' "$file"
else
awk -F'\t' 'NF >= 3 { print $1 "\t" $2 "\t" $3 }' "$file"
fi
}
63 changes: 63 additions & 0 deletions scripts/sweep-classifiers/classify-codeql.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: PMPL-1.0-or-later
# classify-codeql.sh — classify per-repo codeql.yml for #192 sweep.
#
# Canonical: 49 lines, single `analyze` job, matrix language=javascript-typescript
# build-mode=none.
#
# Classes:
# TRIVIAL_DEFAULT — single javascript-typescript language, default wrapper
# SINGLE_NON_DEFAULT — single rust or actions language, override wrapper
# MULTI_LANGUAGE — 2+ languages, multi-call wrapper
# NEEDS_REVIEW — NONE language matrix, or large custom workflow

set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=_lib.sh
. "$SCRIPT_DIR/_lib.sh"

INPUT="${1:-/tmp/drift-survey/codeql-full.json}"
BLOBS_DIR="${BLOBS_DIR:-/tmp/drift-survey/codeql-blobs}"
mkdir -p "$BLOBS_DIR"

classify_blob() {
local blob="$1" langs lines jobs
jobs=$(awk '/^jobs:[[:space:]]*$/{in_jobs=1; next} in_jobs && /^[A-Za-z]/{exit} in_jobs && /^ [a-z][a-z0-9_-]*:[[:space:]]*$/{sub(/^ /,""); sub(/:.*/,""); print}' "$blob" 2>/dev/null | sort -u | paste -sd, -)
langs=$(grep -E "^\s*- language:" "$blob" 2>/dev/null | sed 's/.*language: //; s/[[:space:]]*$//' | sort -u | paste -sd, -)
lines=$(wc -l < "$blob")
langs="${langs:-NONE}"

if [ "$lines" -gt 80 ]; then
echo "NEEDS_REVIEW custom_workflow_${lines}_lines $lines $langs"
return
fi
case "$langs" in
javascript-typescript) echo "TRIVIAL_DEFAULT - $lines $langs" ;;
rust|actions) echo "SINGLE_NON_DEFAULT +language=$langs $lines $langs" ;;
NONE) echo "NEEDS_REVIEW no_language_matrix $lines $langs" ;;
*,*) echo "MULTI_LANGUAGE +per-language-wrapper $lines $langs" ;;
*) echo "NEEDS_REVIEW unknown_lang:$langs $lines $langs" ;;
esac
}

echo "[fetch] retrieving unique blobs..." >&2
normalize_input "$INPUT" | awk -F'\t' '{print $3 "\t" $1}' | sort -u | while IFS=$'\t' read -r sha repo; do
[ -z "$sha" ] || [ "$sha" = "null" ] && continue
blob_file="$BLOBS_DIR/$sha.yml"
[ -s "$blob_file" ] && continue
gh api "/repos/hyperpolymath/$repo/git/blobs/$sha" --jq '.content' 2>/dev/null \
| base64 -d > "$blob_file" || echo "::warn fetch failed for $sha ($repo)" >&2
done

declare -A SHA_CLASS
echo "[classify] classifying unique blobs..." >&2
for blob in "$BLOBS_DIR"/*.yml; do
[ -s "$blob" ] || continue
sha=$(basename "$blob" .yml)
SHA_CLASS[$sha]=$(classify_blob "$blob")
done

normalize_input "$INPUT" | while IFS=$'\t' read -r repo path sha; do
[ -z "$sha" ] || [ "$sha" = "null" ] && { printf '%s\t%s\t%s\tNEEDS_REVIEW\tnull_sha\t-\t-\n' "$repo" "$path" "$sha"; continue; }
printf '%s\t%s\t%s\t%s\n' "$repo" "$path" "$sha" "${SHA_CLASS[$sha]:-NEEDS_REVIEW fetch_failed - -}"
done
62 changes: 62 additions & 0 deletions scripts/sweep-classifiers/classify-hypatia-scan.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: PMPL-1.0-or-later
# classify-hypatia-scan.sh — classify per-repo hypatia-scan.yml for #193 sweep.
#
# Canonical (standards/.github/workflows/hypatia-scan.yml): 416 lines,
# single `scan` job. Every estate variant carries the same job set;
# drift is pure propagation lag, manifested as line count.
#
# Classes:
# TRIVIAL_CURRENT — 410-416 lines (current canonical-1)
# SLIM_PROPAGATION_LAG — 245-260 lines (older slimmer version)
# OLDER_SLIM — 205-235 lines (much older slim version)
# NEEDS_REVIEW — anything else (multi-job, line count out of band,
# or self-source like hypatia/standards repos)

set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=_lib.sh
. "$SCRIPT_DIR/_lib.sh"

INPUT="${1:-/tmp/drift-survey/hypatia-scan-full.json}"
BLOBS_DIR="${BLOBS_DIR:-/tmp/drift-survey/hypatia-blobs}"
mkdir -p "$BLOBS_DIR"

classify_blob() {
local blob="$1" jobs lines
jobs=$(awk '/^jobs:[[:space:]]*$/{in_jobs=1; next} in_jobs && /^[A-Za-z]/{exit} in_jobs && /^ [a-z][a-z0-9_-]*:[[:space:]]*$/{sub(/^ /,""); sub(/:.*/,""); print}' "$blob" 2>/dev/null | sort -u | paste -sd, -)
jobs="${jobs:-NONE}"
lines=$(wc -l < "$blob")
if [ "$jobs" != "scan" ]; then
echo "NEEDS_REVIEW job_set:$jobs $lines $jobs"
return
fi
case "$lines" in
41[0-9]|42[0-9]) echo "TRIVIAL_CURRENT - $lines scan" ;;
24[0-9]|25[0-9]) echo "SLIM_PROPAGATION_LAG +upgrade-to-current $lines scan" ;;
20[5-9]|21[0-9]|22[0-9]|23[0-9]) echo "OLDER_SLIM +upgrade-to-current $lines scan" ;;
*) echo "NEEDS_REVIEW line_count_oob:$lines $lines scan" ;;
esac
}

echo "[fetch] retrieving unique blobs..." >&2
normalize_input "$INPUT" | awk -F'\t' '{print $3 "\t" $1}' | sort -u | while IFS=$'\t' read -r sha repo; do
[ -z "$sha" ] || [ "$sha" = "null" ] && continue
blob_file="$BLOBS_DIR/$sha.yml"
[ -s "$blob_file" ] && continue
gh api "/repos/hyperpolymath/$repo/git/blobs/$sha" --jq '.content' 2>/dev/null \
| base64 -d > "$blob_file" || echo "::warn fetch failed for $sha ($repo)" >&2
done

declare -A SHA_CLASS
echo "[classify] classifying unique blobs..." >&2
for blob in "$BLOBS_DIR"/*.yml; do
[ -s "$blob" ] || continue
sha=$(basename "$blob" .yml)
SHA_CLASS[$sha]=$(classify_blob "$blob")
done

normalize_input "$INPUT" | while IFS=$'\t' read -r repo path sha; do
[ -z "$sha" ] || [ "$sha" = "null" ] && { printf '%s\t%s\t%s\tNEEDS_REVIEW\tnull_sha\t-\t-\n' "$repo" "$path" "$sha"; continue; }
printf '%s\t%s\t%s\t%s\n' "$repo" "$path" "$sha" "${SHA_CLASS[$sha]:-NEEDS_REVIEW fetch_failed - -}"
done
97 changes: 97 additions & 0 deletions scripts/sweep-classifiers/classify-mirror.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: PMPL-1.0-or-later
# classify-mirror.sh — classify per-repo mirror.yml for the #187 sweep.
#
# Reads /tmp/drift-survey/mirror-full.json (output of `gh api --paginate
# /search/code` filtered to mirror.yml in .github/workflows). For each
# unique blob SHA, fetches the blob once (cached in $BLOBS_DIR) and
# classifies as:
#
# TRIVIAL — exactly the canonical 7 forges, line count in band,
# no extra/missing top-level jobs.
# NEEDS_REVIEW — anything else (extra forges, missing forges,
# structural diff). Reported with reason.
#
# Output: TSV to stdout, one row per repo:
# <repo>\t<sha>\t<classification>\t<reason>\t<line_count>\t<forges>

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=_lib.sh
. "$SCRIPT_DIR/_lib.sh"

INPUT="${1:-/tmp/drift-survey/mirror-full.json}"
BLOBS_DIR="${BLOBS_DIR:-/tmp/drift-survey/blobs}"
mkdir -p "$BLOBS_DIR"

# Canonical 7 forges (alphabetical for stable compare).
CANON_FORGES="bitbucket,codeberg,disroot,gitea,gitlab,radicle,sourcehut"
LINE_MIN=130
LINE_MAX=160

classify_blob() {
local blob="$1"
local forges line_count extra_jobs

# Forges: top-level jobs under `jobs:` starting with mirror-.
forges=$(awk '
/^jobs:[[:space:]]*$/ { in_jobs=1; next }
in_jobs && /^[A-Za-z]/ { exit }
in_jobs && /^ mirror-[a-z]+:[[:space:]]*$/ {
sub(/^ mirror-/, ""); sub(/:[[:space:]]*$/, ""); print
}
' "$blob" 2>/dev/null | sort -u | paste -sd, -)
forges="${forges:-NONE}"
line_count=$(wc -l < "$blob")

# Any non-mirror top-level jobs? Scope to the `jobs:` block (between
# `^jobs:$` and the next column-0 key, EOF, or comment-only/blank).
extra_jobs=$(awk '
/^jobs:[[:space:]]*$/ { in_jobs=1; next }
in_jobs && /^[A-Za-z]/ { exit } # next column-0 key ends the block
in_jobs && /^ [a-z][a-z0-9_-]*:[[:space:]]*$/ {
sub(/^ /, ""); sub(/:[[:space:]]*$/, ""); print
}
' "$blob" 2>/dev/null | grep -vE '^mirror-' | paste -sd, -)
extra_jobs="${extra_jobs:-}"

if [ "$forges" != "$CANON_FORGES" ]; then
echo "NEEDS_REVIEW forge_set:$forges $line_count $forges"
return
fi
if [ -n "$extra_jobs" ]; then
echo "NEEDS_REVIEW extra_jobs:$extra_jobs $line_count $forges"
return
fi
if [ "$line_count" -lt "$LINE_MIN" ] || [ "$line_count" -gt "$LINE_MAX" ]; then
echo "NEEDS_REVIEW line_count_oob:$line_count $line_count $forges"
return
fi
echo "TRIVIAL - $line_count $forges"
}

# 1. Fetch each unique SHA's blob (cached).
echo "[fetch] unique SHAs to retrieve..." >&2
normalize_input "$INPUT" | awk -F'\t' '{print $3 "\t" $1}' | sort -u | while IFS=$'\t' read -r sha repo; do
[ -z "$sha" ] || [ "$sha" = "null" ] && continue
blob_file="$BLOBS_DIR/$sha.yml"
[ -s "$blob_file" ] && continue
gh api "/repos/hyperpolymath/$repo/git/blobs/$sha" --jq '.content' 2>/dev/null \
| base64 -d > "$blob_file" || echo "::warn fetch failed for $sha ($repo)" >&2
done

# 2. Classify each unique SHA, cache in a map.
declare -A SHA_CLASS
echo "[classify] classifying unique blobs..." >&2
for blob in "$BLOBS_DIR"/*.yml; do
[ -s "$blob" ] || continue
sha=$(basename "$blob" .yml)
SHA_CLASS[$sha]=$(classify_blob "$blob")
done

# 3. Emit per-(repo, path) TSV: repo \t path \t sha \t class \t reason \t lines \t forges
normalize_input "$INPUT" | while IFS=$'\t' read -r repo path sha; do
[ -z "$sha" ] || [ "$sha" = "null" ] && { printf '%s\t%s\t%s\tNEEDS_REVIEW\tnull_sha\t-\t-\n' "$repo" "$path" "$sha"; continue; }
printf '%s\t%s\t%s\t%s\n' "$repo" "$path" "$sha" "${SHA_CLASS[$sha]:-NEEDS_REVIEW fetch_failed - -}"
done
Loading
Loading