RFC: Evolving daily-pr-review-health into an Org-Wide Actions Fleet Monitor #193

don-petry · 2026-05-16T12:59:54Z

don-petry
May 16, 2026
Maintainer

Context

We currently have daily-pr-review-health.yml — a Claude-powered workflow that:

Runs daily at 06:00 UTC
Fetches the last N days of pr-review.yml run history
Asks Claude to identify failure patterns
Opens a GitHub Issue when failures are detected

This is great for a single workflow, but we have a growing fleet of agentic workflows across the petry-projects org and no unified view of their health.

Problem Statement

As org admins we lack a single place to answer:

Question	Current answer
Which workflows are failing most often?	Manual — check each repo
Which workflows have gotten slower over the past week?	None
Where is our GitHub Actions spend going?	GitHub billing UI (no per-workflow breakdown)
Which workflows are flaky (non-deterministically failing)?	None

Vision

A single scheduled workflow that acts as an org-wide Actions fleet monitor — scanning all repos in petry-projects, aggregating signals across the reliability / performance / cost axes, and surfacing findings directly in GitHub (step summaries + targeted Issues) with zero external infrastructure.

Target signals

Reliability — failure rate per workflow over rolling 7 / 30 day windows; flakiness score (workflows that alternate pass/fail more than a threshold).
Performance — p50/p95 run duration per workflow; week-over-week duration regression detection.
Cost / usage — billed minutes per workflow and per repo (via the GitHub Billing REST API, or job-level billable field on run objects).

Target delivery (GitHub-native only)

Step Summary — rich HTML table published on every run; visually scannable without leaving GitHub Actions.
GitHub Issue — opened (or updated via find-or-create) only when actionable thresholds are breached (e.g., failure rate > 20%, duration regression > 50%, top-5 cost consumers changed significantly).
Optional: Discussion post — weekly digest as a Discussion thread so the team can comment inline.

Existing Solutions to Evaluate (reduce custom code)

Before building more bespoke shell + Claude glue, we should investigate what already exists:

GitHub-native (no extra infra)

Feature	What it gives us	Gaps
Actions Usage Metrics (GA Mar 2025)	Minutes consumed per workflow/repo/OS in the GitHub UI	Not queryable via API in aggregate; no alerting
Actions Performance Metrics (GA Mar 2025)	Job run times, queue times, failure rates in UI	Same — UI-only; no webhook/alerting
Billing REST API (`/repos/{owner}/{repo}/actions/billing`)	Minutes used + cost estimate per repo	Per-repo only; no per-workflow breakdown via API
Workflow Run API + `billable` field	Per-job OS-bucketed minutes per run	Must roll up manually; no org-level endpoint

Open-source tooling worth evaluating

Tool	What it does	Fit for us
actions-usage	CLI: minutes used per workflow/OS across an org	High — could replace billing rollup logic
actionlint	Static analysis of workflow YAML: syntax, security, deprecated commands	Orthogonal but high-value — hygiene signal
Labbs/github-actions-exporter	Prometheus exporter for workflow run metrics	Lower — requires Prometheus (out of scope now)
github-actions-opentelemetry	OTLP traces per workflow run	Future path if we add observability infra
Grafana dashboard #24157	Pre-built dashboard for Actions metrics	Requires Grafana — out of scope now

Commercial (noted for completeness)

Depot, BuildPulse, Trunk.io, Datadog CI Visibility — all provide richer fleet-level dashboards but require an external account and ongoing cost.

Proposed Architecture (strawman)

org-fleet-monitor.yml
  schedule: daily + workflow_dispatch
  inputs:
    lookback_days: [1, 7, 30]
    repos: ["all" | comma-separated list]      # default: org-wide discovery
    signal_flags: [reliability, perf, cost]   # toggleable

  jobs:
    discover-repos
      → gh api /orgs/petry-projects/repos
      → filter: archived=false, has workflows

    collect-metrics (matrix: repos)
      → For each repo:
          - fetch workflow runs (last N days)
          - compute: failure_rate, flakiness_score, p95_duration, billed_minutes
          - emit structured JSON artifact per repo

    aggregate-and-report
      → merge per-repo JSONs
      → rank by severity (reliability > perf > cost)
      → write Step Summary (HTML table)
      → if any threshold breached:
          find-or-create Issue titled "Actions Fleet Report — <date>"
          update body with latest findings
          @mention org admin team

Claude's role narrows: instead of driving the entire analysis via shell + Claude prompt, Claude is called once at the end to write the narrative summary paragraph — the data collection and threshold logic becomes pure gh / jq.

Open Questions

Repo discovery: enumerate all org repos dynamically each run, or maintain an explicit allow/deny list? Dynamic is zero-maintenance; an explicit list enables opt-out.
Flakiness definition: how many alternating pass/fail cycles in a rolling window constitute "flaky"? Suggested: ≥ 3 alternations in 10 consecutive runs.
Threshold calibration: what failure-rate % triggers an Issue? Suggested defaults: warning at 10%, critical at 25%.
Issue strategy: new Issue per day vs. find-or-update a single open "fleet health" Issue? Find-or-update reduces noise but makes the history harder to scan.
actions-usage CLI: worth vendoring into this repo, or run via go install in the workflow step?
Actionlint hygiene scan: add a one-time actionlint sweep of all org workflow YAML as a bonus sub-report?
Auth: current workflow uses DON_PETRY_BOT_GH_PAT. For org-wide actions:read, does that PAT already have the right scopes across all repos?

Next Steps

Discuss and close the open questions above
Evaluate actions-usage CLI as the cost-data foundation
Draft org-fleet-monitor.yml + scripts/fleet_monitor.sh (replacing pr_review_health.sh)
Define structured JSON schema for per-repo metrics artifact
Design Step Summary HTML table template
Decide on actionlint hygiene sub-report
Verify PAT scopes for org-wide read

Looking forward to input on thresholds, additional signals, or OSS tools I missed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Evolving daily-pr-review-health into an Org-Wide Actions Fleet Monitor #193

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC: Evolving daily-pr-review-health into an Org-Wide Actions Fleet Monitor #193

Uh oh!

don-petry May 16, 2026 Maintainer

Context

Problem Statement

Vision

Target signals

Target delivery (GitHub-native only)

Existing Solutions to Evaluate (reduce custom code)

GitHub-native (no extra infra)

Open-source tooling worth evaluating

Commercial (noted for completeness)

Proposed Architecture (strawman)

Open Questions

Next Steps

Replies: 0 comments

don-petry
May 16, 2026
Maintainer