You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fetches the last N days of pr-review.yml run history
Asks Claude to identify failure patterns
Opens a GitHub Issue when failures are detected
This is great for a single workflow, but we have a growing fleet of agentic workflows across the petry-projects org and no unified view of their health.
Problem Statement
As org admins we lack a single place to answer:
Question
Current answer
Which workflows are failing most often?
Manual — check each repo
Which workflows have gotten slower over the past week?
None
Where is our GitHub Actions spend going?
GitHub billing UI (no per-workflow breakdown)
Which workflows are flaky (non-deterministically failing)?
None
Vision
A single scheduled workflow that acts as an org-wide Actions fleet monitor — scanning all repos in petry-projects, aggregating signals across the reliability / performance / cost axes, and surfacing findings directly in GitHub (step summaries + targeted Issues) with zero external infrastructure.
Target signals
Reliability — failure rate per workflow over rolling 7 / 30 day windows; flakiness score (workflows that alternate pass/fail more than a threshold).
Performance — p50/p95 run duration per workflow; week-over-week duration regression detection.
Cost / usage — billed minutes per workflow and per repo (via the GitHub Billing REST API, or job-level billable field on run objects).
Target delivery (GitHub-native only)
Step Summary — rich HTML table published on every run; visually scannable without leaving GitHub Actions.
GitHub Issue — opened (or updated via find-or-create) only when actionable thresholds are breached (e.g., failure rate > 20%, duration regression > 50%, top-5 cost consumers changed significantly).
Optional: Discussion post — weekly digest as a Discussion thread so the team can comment inline.
Existing Solutions to Evaluate (reduce custom code)
Before building more bespoke shell + Claude glue, we should investigate what already exists:
org-fleet-monitor.yml
schedule: daily + workflow_dispatch
inputs:
lookback_days: [1, 7, 30]
repos: ["all" | comma-separated list] # default: org-wide discovery
signal_flags: [reliability, perf, cost] # toggleable
jobs:
discover-repos
→ gh api /orgs/petry-projects/repos
→ filter: archived=false, has workflows
collect-metrics (matrix: repos)
→ For each repo:
- fetch workflow runs (last N days)
- compute: failure_rate, flakiness_score, p95_duration, billed_minutes
- emit structured JSON artifact per repo
aggregate-and-report
→ merge per-repo JSONs
→ rank by severity (reliability > perf > cost)
→ write Step Summary (HTML table)
→ if any threshold breached:
find-or-create Issue titled "Actions Fleet Report — <date>"
update body with latest findings
@mention org admin team
Claude's role narrows: instead of driving the entire analysis via shell + Claude prompt, Claude is called once at the end to write the narrative summary paragraph — the data collection and threshold logic becomes pure gh / jq.
Open Questions
Repo discovery: enumerate all org repos dynamically each run, or maintain an explicit allow/deny list? Dynamic is zero-maintenance; an explicit list enables opt-out.
Flakiness definition: how many alternating pass/fail cycles in a rolling window constitute "flaky"? Suggested: ≥ 3 alternations in 10 consecutive runs.
Threshold calibration: what failure-rate % triggers an Issue? Suggested defaults: warning at 10%, critical at 25%.
Issue strategy: new Issue per day vs. find-or-update a single open "fleet health" Issue? Find-or-update reduces noise but makes the history harder to scan.
actions-usage CLI: worth vendoring into this repo, or run via go install in the workflow step?
Actionlint hygiene scan: add a one-time actionlint sweep of all org workflow YAML as a bonus sub-report?
Auth: current workflow uses DON_PETRY_BOT_GH_PAT. For org-wide actions:read, does that PAT already have the right scopes across all repos?
Next Steps
Discuss and close the open questions above
Evaluate actions-usage CLI as the cost-data foundation
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Context
We currently have
daily-pr-review-health.yml— a Claude-powered workflow that:pr-review.ymlrun historyThis is great for a single workflow, but we have a growing fleet of agentic workflows across the
petry-projectsorg and no unified view of their health.Problem Statement
As org admins we lack a single place to answer:
Vision
A single scheduled workflow that acts as an org-wide Actions fleet monitor — scanning all repos in
petry-projects, aggregating signals across the reliability / performance / cost axes, and surfacing findings directly in GitHub (step summaries + targeted Issues) with zero external infrastructure.Target signals
billablefield on run objects).Target delivery (GitHub-native only)
Existing Solutions to Evaluate (reduce custom code)
Before building more bespoke shell + Claude glue, we should investigate what already exists:
GitHub-native (no extra infra)
/repos/{owner}/{repo}/actions/billing)billablefieldOpen-source tooling worth evaluating
Commercial (noted for completeness)
Depot, BuildPulse, Trunk.io, Datadog CI Visibility — all provide richer fleet-level dashboards but require an external account and ongoing cost.
Proposed Architecture (strawman)
Claude's role narrows: instead of driving the entire analysis via shell + Claude prompt, Claude is called once at the end to write the narrative summary paragraph — the data collection and threshold logic becomes pure
gh/jq.Open Questions
actions-usageCLI: worth vendoring into this repo, or run viago installin the workflow step?actionlintsweep of all org workflow YAML as a bonus sub-report?DON_PETRY_BOT_GH_PAT. For org-wideactions:read, does that PAT already have the right scopes across all repos?Next Steps
actions-usageCLI as the cost-data foundationorg-fleet-monitor.yml+scripts/fleet_monitor.sh(replacingpr_review_health.sh)actionlinthygiene sub-reportLooking forward to input on thresholds, additional signals, or OSS tools I missed.
Beta Was this translation helpful? Give feedback.
All reactions