[release/13.4] Fix aspire stop falsely reporting failure on Unix#17623
Conversation
The stop path treated the managing CLI PID as a success condition. On Unix the CLI process can remain observable (for example as an unreaped or briefly lingering process) after the AppHost has already stopped, so aspire stop reported '\u274c Failed to stop apphost.cs' even when shutdown completed successfully. Use the AppHost PID as the success condition, and keep the CLI PID as a shutdown handle that still gets force-killed when present so we never leave a true zombie CLI running after the AppHost is gone. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cascade path (signal launcher CLI, let it terminate 'dotnet run', and rely on that to terminate the AppHost) is racy on Unix: if the descendant walk inside 'dotnet run' misses the AppHost, the AppHost is orphaned to PID 1 and HasExited polling can't distinguish a zombie from a live process, so 'aspire stop' falsely reports failure after the full SIGTERM + SIGKILL timeout (~40s). On Unix we now: * Send SIGTERM directly to the AppHost PID so it shuts down through its own IHostApplicationLifetime path. The launcher CLI and dotnet run process exit naturally when their child exits. * Pass killEntireProcessTree:true when force-terminating on Unix. DCP is launched in a separate session there, so force-terminating the AppHost tree doesn't take DCP with it; DCP detects the parent gone and tears down its own children gracefully. Windows behavior is preserved: we keep cascading through the launcher CLI to DCP because DCP is an in-tree descendant of the AppHost on Windows, and a full tree termination would break DCP's orderly resource cleanup. Validation in the repo's Linux container (Ubuntu 24.04 ARM64): * Baseline: 5/25 failures (20%), failures hit the 40s timeout * CLI-visibility-only fix: 1/25 failures (4%) * This change: 0/30 failures, stop wall-time 0.7-2.3s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17623Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17623" |
|
❓ CLI E2E Tests unknown — 107 passed, 0 failed, 2 unknown (commit View all recordings
📹 Recordings uploaded automatically from CI run #26600615811 |
|
✅ No documentation update needed. docs_optional → No triggered signals (signal_count = 0). This is a Unix race-condition fix for |
Backport of #17612 to release/13.4
/cc @danegsta
Customer Impact
Fixes a Unix/Linux
aspire stoprace condition that could cause us to fail to report a successful stop.Testing
Validated via smoke tests in a Linux container.
Risk
Low - ultimately our process cleanup strategy doesn't change much, we're just being more careful about what PIDs we're watching/signaling during shutdown
Regression?
Yes, in 13.4