ci(notify): file/update ci-broken GitHub issue when internal AzDO build breaks on main/release/*#17920
ci(notify): file/update ci-broken GitHub issue when internal AzDO build breaks on main/release/*#17920radical wants to merge 5 commits into
Conversation
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17920Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17920" |
0e6133b to
fe60937
Compare
PowerShell helper that files / updates / closes a ci-broken GitHub issue on microsoft/aspire for a given branch. Two modes: - Failure: GETs open ci-broken issues, filters by a hidden HTML-comment marker (<!-- aspire-internal-build-broken:<branch> -->), and either creates a new issue (with a managed failures-table region in the body) or appends a row to the existing one and posts a follow-up @-mention comment. - Success: closes the matching open issue with a green-build comment. Drives the gh CLI throughout, authenticated via $env:GH_TOKEN that the caller sets after minting an aspire-repo-bot installation token. List uses the strongly-consistent /issues endpoint, not /search/issues, so near-simultaneous failures don't each file a duplicate; a post-create re-list catches the rare race past that window and closes ours as a duplicate of the older. Always exits 0. Warnings surface via task.logissue + task.complete result=SucceededWithIssues so a silent regression (bot loses permission, label deleted, gh API shape changes) goes yellow rather than green. -DryRun logs the would-be gh calls without mutating GitHub. In dry-run the script skips token mint entirely so the wrapper can validate pipeline plumbing without resolving aspire-repo-bot credentials. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ipeline Adds two stages to azure-pipelines.yml that gate on the upstream build_sign_native / build / prepare_installers stage results and run Notify-GitHubOnBuildResult.ps1 on 1es-ubuntu-2204: - notify_failure: fires when at least one upstream stage Failed. Composes a comma-separated -FailedStages list from dependencies.<stage>.result so the filed / updated issue body identifies which stage broke. - notify_success: fires when all three upstream stages Succeeded / SucceededWithIssues (prepare_installers may legitimately Skip on stable GA release builds). Closes the open ci-broken issue for the branch. Both stages mint the aspire-repo-bot installation token via Get-AspireBotInstallationToken.ps1 and export it as GH_TOKEN so the gh CLI invocations in the script authenticate as the bot. Adds _IsNotificationBranch in common-variables.yml — exact match on refs/heads/main (NOT startsWith, the pipeline trigger's `main*` wildcard would otherwise sweep in branches like main-something) plus startsWith refs/heads/release/. Excludes internal/release/* so internal branch names don't leak into the public tracker. Aspire-Release-Secrets variable group is imported at pipeline scope with the same non-PR + main/release/* gate, so manual feature-branch and PR runs don't pay the variable-group auth check at queue time. A notifyOnFailureDryRun queue-time parameter logs would-be gh calls without mutating GitHub; applies to both stages so a green-build dry-run can't accidentally close real open issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds docs/ci/internal-build-failure-notifications.md describing the contract (labels, marker syntax, failures-table region, dedupe behavior, dry-run, manual cleanup) and pointer from eng/pipelines/README.md so anyone reading the pipeline docs lands on the notification system. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b3883d4 to
7870f74
Compare
There was a problem hiding this comment.
Pull request overview
Adds automated public GitHub issue visibility for internal AzDO build breaks on main and release/*, so internal failures don’t remain unnoticed without someone manually checking AzDO.
Changes:
- Introduces a PowerShell notifier that files/updates a branch-deduped
ci-brokenissue on failures and closes it on the next green build. - Adds
notify_failure/notify_successstages to the internal pipeline, plus a queue-time dry-run parameter and branch gating. - Documents the notification contract and links it from pipeline docs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| eng/pipelines/scripts/Notify-GitHubOnBuildResult.ps1 | New notifier script that uses gh + an installation token to create/update/close ci-broken issues per branch. |
| eng/pipelines/README.md | Adds a short pointer describing the internal build-result notification behavior and links to full docs. |
| eng/pipelines/common-variables.yml | Adds _IsNotificationBranch to gate notifications to main/release/* (excluding internal/release/*). |
| eng/pipelines/azure-pipelines.yml | Adds dry-run parameter, imports secrets conditionally, and adds notify_failure/notify_success stages. |
| docs/ci/internal-build-failure-notifications.md | New contract documentation for behavior, dedupe strategy, and operational expectations. |
- Notify-GitHubOnBuildResult.ps1: register the GitHub App installation token via `##vso[task.setsecret]` instead of `##vso[task.setvariable ...;issecret=true]`. The only purpose was log masking, but task.setvariable also persists the value as a job-scoped variable that other tasks could accidentally reference via $(__notifyGhToken). task.setsecret gives us the masking without the persistence. - azure-pipelines.yml + docs: dry-run mode logs the `gh` CLI commands it would run, not GitHub REST calls. Fix the parameter displayName, the parameter comment, and the corresponding docs paragraph. - docs: Azure Pipelines stage results use `Skipped`, not `Skip` (the YAML stage condition correctly checks for 'Skipped'). Fix the prose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The internal-build-failure notification script had no automated coverage, despite the established tests/Infrastructure.Tests/PowerShellScripts/ pattern and the script's highly testable pure helpers. Add NotifyGitHubOnBuildResultTests covering: - Test-NotifiableBranch: main (exact), release/*, and the negatives that matter (main-something, mainline, internal/release/*, feature/*) - New-FailureTableRow: pipe-escaping, short-SHA shortening with full-SHA link, sub-7-char SHA fallback, em-dash for empty FailedStages - Update-FailuresTableInBody: append + surrounding-prose preservation, max-rows rollover into the "earlier failures omitted" summary, omitted-count carry into the next index, and the missing-markers warning path (body left unchanged) - -DryRun (Failure + Success): exits 0 and launches zero gh processes, verified with a recording fake gh placed ahead of the real one on PATH The helpers are exercised by dot-sourcing the shipped script unmodified with a non-notifiable branch, so its main routine bails before any token mint or gh call and leaves the functions in scope. 'exit' inside a dot-sourced script only unwinds that script, not the test harness, so no testability hook in the production script is needed. Repo-root lookup reuses the shared worktree-aware TestUtils.FindRepoRoot() rather than adding another private Aspire.slnx walk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # same stream so any failure message is visible in the thrown exception. | ||
| # Auth: gh reads GH_TOKEN from the process environment (set once in the main | ||
| # body after the bot token is minted); no token plumbing through call sites. | ||
| function Invoke-Gh { |
There was a problem hiding this comment.
Nit: we're assuming gh is on the image, but nothing here probes for it — if the 1ES image ever ships without it, every call throws, the catch swallows it, and we silently stop filing issues. The release pipeline pin-installs+SHA-verifies gh 2.92 for the same reason. At minimum a gh --version check up front (with a louder warning if missing) would catch the silent-dead-feature case.
| dependsOn: | ||
| - build_sign_native | ||
| - build | ||
| - prepare_installers |
There was a problem hiding this comment.
One thing to think through: if any stage upstream of these three fails (or anything that gates them), neither notify stage runs at all and no issue gets filed. The whole point of this feature is broken builds sit unnoticed — worth confirming nothing earlier in the pipeline can break in a way that bypasses this.
| ) | ||
|
|
||
| $all = @($json | ConvertFrom-Json) | ||
| $matched = @($all | Where-Object { $_.body -and $_.body.Contains($Marker) }) |
There was a problem hiding this comment.
Substring match on the marker means if anyone ever pastes this marker text into an unrelated issue (e.g. a meta-tracking issue), success-mode will happily comment and close it. An anchored match (start-of-line or start-of-body) would be a bit safer.
| # the next failure index. | ||
| # - When the visible-row count exceeds FailuresTableMaxRows, drops oldest | ||
| # rows and rolls them into the omitted tally. | ||
| function Update-FailuresTableInBody { |
There was a problem hiding this comment.
Honest question: is the in-body failures table pulling its weight? Every failure already posts a comment with the same data, so the comments are effectively the per-failure history. Dropping the table would delete this whole function plus the row-cap/omitted-counter logic plus the documented two near-simultaneous failures may drop a row race — sizeable simplification for what feels like a small UX gain.
| $recheck = Get-OpenBrokenIssuesForBranch -Marker $Marker | ||
| $oldest = $recheck | Where-Object { $_.number -ne $createdNumber } | Select-Object -First 1 | ||
| if ($null -ne $oldest -and $oldest.number -lt $createdNumber) { | ||
| Write-Step "Race detected: older open issue #$($oldest.number) found. Closing our just-created #${createdNumber} as duplicate." |
There was a problem hiding this comment.
Our builds are rolling, so this race should be vanishingly rare in practice. Worst case if we drop the handler: occasional duplicate issues that a human closes manually. Probably not worth the extra gh issue list round-trip on every first-failure path.
|
|
||
| # Defense in depth: pipeline gate is the primary filter. Pipeline | ||
| # trigger's `main*` wildcard means we must match `main` exactly. | ||
| function Test-NotifiableBranch { |
There was a problem hiding this comment.
There are now three independent places that define notifiable branch: _IsNotificationBranch in common-variables.yml, the two stage condition: blocks, and this function. They can silently drift apart if someone ever updates the policy. A cross-reference comment pointing at the variable would help future-you.
| ASPIRE_BOT_APP_ID: $(aspire-bot-app-id) | ||
| ASPIRE_BOT_PRIVATE_KEY: $(aspire-bot-private-key) | ||
|
|
||
| - stage: notify_success |
There was a problem hiding this comment.
This stage is basically a copy-paste of notify_failure with a different mode and condition. Worth extracting into a small stage template under eng/pipelines/templates/ so the two can't silently drift apart?
| # instead of engaging the switch, leaving DryRun=False and | ||
| # producing a live token mint against '-DryRun/aspire'. | ||
| - ${{ if eq(parameters.notifyOnFailureDryRun, true) }}: | ||
| - name: NotifyDryRunFlag |
There was a problem hiding this comment.
Could this whole NotifyDryRunFlag variable indirection just be replaced by interpolating the parameter directly in the pwsh script (e.g. [bool]::Parse('${{ parameters.notifyOnFailureDryRun }}'))? That sidesteps the splatting gotcha you call out below without needing the per-stage variable.
|
I'll address the feedback, and also update this to account for the recent azdo pipeline changes. |
Internal AzDO build failures on
mainandrelease/*had no automated visibility on the public tracker — a broken build could sit unnoticed until someone happened to look at the AzDO build list.What it does
On every non-PR build of
mainorrelease/*:ci-brokenissue onmicrosoft/aspire, titledInternal build broken on <branch>, assigned to @joperezr + @radical. The body carries a managed failures table that grows a row per subsequent failure on the same branch (build link, commit, failed stages). A follow-up comment fires on each failure so @-mentions notify.ci-brokenissue for that branch with a green-build comment.One open issue per branch, deduplicated by a hidden HTML-comment marker in the body. Full contract (labels, marker syntax, dedupe behavior, dry-run, manual cleanup) in
docs/ci/internal-build-failure-notifications.md.How
Notify-GitHubOnBuildResult.ps1invoked from two new pipeline stages (notify_failure,notify_success), authenticated via anaspire-repo-botGitHub App installation token, driving theghCLI. Always exits 0 — a flaky notification path must never red an otherwise-correct build.A
notifyOnFailureDryRunqueue-time parameter logs the would-beghcalls without mutating GitHub.Validated on AzDO
-FailedStageson Linux, exits 0 cleanly.release/aspire-internal-notify-validationmarker, then closed it in the same build viagh issue close.