Fix aspire stop falsely reporting failure on Unix#17612
Conversation
The stop path treated the managing CLI PID as a success condition. On Unix the CLI process can remain observable (for example as an unreaped or briefly lingering process) after the AppHost has already stopped, so aspire stop reported '\u274c Failed to stop apphost.cs' even when shutdown completed successfully. Use the AppHost PID as the success condition, and keep the CLI PID as a shutdown handle that still gets force-killed when present so we never leave a true zombie CLI running after the AppHost is gone. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17612Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17612" |
The cascade path (signal launcher CLI, let it terminate 'dotnet run', and rely on that to terminate the AppHost) is racy on Unix: if the descendant walk inside 'dotnet run' misses the AppHost, the AppHost is orphaned to PID 1 and HasExited polling can't distinguish a zombie from a live process, so 'aspire stop' falsely reports failure after the full SIGTERM + SIGKILL timeout (~40s). On Unix we now: * Send SIGTERM directly to the AppHost PID so it shuts down through its own IHostApplicationLifetime path. The launcher CLI and dotnet run process exit naturally when their child exits. * Pass killEntireProcessTree:true when force-terminating on Unix. DCP is launched in a separate session there, so force-terminating the AppHost tree doesn't take DCP with it; DCP detects the parent gone and tears down its own children gracefully. Windows behavior is preserved: we keep cascading through the launcher CLI to DCP because DCP is an in-tree descendant of the AppHost on Windows, and a full tree termination would break DCP's orderly resource cleanup. Validation in the repo's Linux container (Ubuntu 24.04 ARM64): * Baseline: 5/25 failures (20%), failures hit the 40s timeout * CLI-visibility-only fix: 1/25 failures (4%) * This change: 0/30 failures, stop wall-time 0.7-2.3s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
aspire stop falsely reporting failure on Unix
|
/backport to release/13.4 |
|
Started backporting to |
There was a problem hiding this comment.
Pull request overview
This PR fixes intermittent Unix failures where aspire stop could time out and report Failed to stop apphost.cs due to a racy shutdown cascade through the launcher CLI / dotnet run, which could leave the AppHost unreaped on Unix.
Changes:
- Split shutdown behavior by OS: on Unix, request graceful shutdown by signaling the AppHost PID directly (avoiding the CLI→
dotnet runcascade). - Add an opt-in
killEntireProcessTreeparameter toProcessSignaler.ForceKilland use it on Unix when force-killing remaining processes. - Add a Unix-specific test ensuring the launcher CLI process is treated as a shutdown handle (cleaned up, but not required for success).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/Aspire.Cli.Tests/Processes/ProcessShutdownServiceTests.cs | Adds a Unix regression test and injects a TimeProvider for deterministic timing in the new test. |
| src/Shared/ProcessSignaler.cs | Extends ForceKill with optional whole-process-tree killing and improves debug logging for kill behavior. |
| src/Aspire.Cli/Processes/ProcessShutdownService.cs | Separates “success monitoring” (AppHost) from “cleanup force-kill” (AppHost + CLI handle), and routes Unix graceful shutdown directly to the AppHost PID. |
|
❓ CLI E2E Tests unknown — 107 passed, 0 failed, 2 unknown (commit View all recordings
📹 Recordings uploaded automatically from CI run #26599142125 |
|
✅ No documentation update needed. docs_optional → bug_fix_restores_documented_behavior No triggered signals (signal_count = 0). This PR fixes an internal Unix race condition where |
Why
SmokeTests.LatestCliCanStartStableChannelAppHosthas been failing intermittently onmainandrelease/13.4PRs withFailed to stop apphost.cs. Bimodal timing in the failures (success ~1s, failure ~40s) pointed at the AppHost being orphaned and unreaped on Unix during the shutdown cascade.Root cause
aspire stopwas cascading through the launcher CLI on every platform: it would SIGTERM the launcher CLI (aspire run), which would cancel itsCancellationTokenSource, which would causeProcessExecutionto call.Kill(entireProcessTree: true)on the spawneddotnet run, which would in turn try to terminate the AppHost via its own descendant walk.On Unix this cascade is racy:
dotnet runmisses the AppHost (timing, reparenting, etc.), the AppHost gets orphaned to PID 1.HasExitedfrom outside cannot distinguish from a live process.aspire stopthen waits the full SIGTERM (10s) + SIGKILL (10s) window before failing the operation — matching the ~40s failure timing observed in CI.Approach
Split the shutdown path by OS:
IHostApplicationLifetimepath, and the launcher CLI anddotnet runexit naturally when their child exits. Every process is reaped by its actual parent.killEntireProcessTree: truetoProcessSignaler.ForceKill. DCP is launched in a separate session/process group on Unix, so tree-terminating the AppHost does not take DCP with it; DCP detects the parent gone and runs its own orderly child shutdown.The earlier CLI-visibility fix (treating an unresponsive launcher CLI as success rather than blocking on it) is kept and is now belt-and-suspenders.
Validation
Reproduced and measured in the repo-local Linux container (Ubuntu 24.04 ARM64):
Full local unit-test suite (
Aspire.Cli.Tests): 3799 passed, 0 failed, 21 skipped.Notes for reviewers
ProcessSignaler.ForceKillgains akillEntireProcessTreeparameter, defaultfalse. All existing callers keep the old behavior; onlyProcessShutdownService.ForceKillRemainingProcessesopts in, and only on Unix.OperatingSystem.IsWindows()./proc/<pid>/statstate field) was considered but skipped — with the cleaner Unix path, AppHosts are reaped by their actual parent and zombies should not be produced in the normal flow.