Skip to content

Fix aspire stop falsely reporting failure on Unix#17612

Merged
danegsta merged 2 commits into
mainfrom
danegsta/apphost-stop-failure
May 28, 2026
Merged

Fix aspire stop falsely reporting failure on Unix#17612
danegsta merged 2 commits into
mainfrom
danegsta/apphost-stop-failure

Conversation

@danegsta
Copy link
Copy Markdown
Member

@danegsta danegsta commented May 28, 2026

Why

SmokeTests.LatestCliCanStartStableChannelAppHost has been failing intermittently on main and release/13.4 PRs with Failed to stop apphost.cs. Bimodal timing in the failures (success ~1s, failure ~40s) pointed at the AppHost being orphaned and unreaped on Unix during the shutdown cascade.

Root cause

aspire stop was cascading through the launcher CLI on every platform: it would SIGTERM the launcher CLI (aspire run), which would cancel its CancellationTokenSource, which would cause ProcessExecution to call .Kill(entireProcessTree: true) on the spawned dotnet run, which would in turn try to terminate the AppHost via its own descendant walk.

On Unix this cascade is racy:

  1. If the descendant walk inside dotnet run misses the AppHost (timing, reparenting, etc.), the AppHost gets orphaned to PID 1.
  2. With no proper parent left to reap it, the AppHost can become a zombie that polling for HasExited from outside cannot distinguish from a live process.
  3. aspire stop then waits the full SIGTERM (10s) + SIGKILL (10s) window before failing the operation — matching the ~40s failure timing observed in CI.

Approach

Split the shutdown path by OS:

  • Unix: Send SIGTERM directly to the AppHost PID. The AppHost shuts down through its normal IHostApplicationLifetime path, and the launcher CLI and dotnet run exit naturally when their child exits. Every process is reaped by its actual parent.
  • Unix force-stop: Pass killEntireProcessTree: true to ProcessSignaler.ForceKill. DCP is launched in a separate session/process group on Unix, so tree-terminating the AppHost does not take DCP with it; DCP detects the parent gone and runs its own orderly child shutdown.
  • Windows: Unchanged. The existing CLI → DCP cascade is preserved because DCP is an in-tree descendant of the AppHost on Windows, and a tree-wide termination would break DCP's orderly resource cleanup.

The earlier CLI-visibility fix (treating an unresponsive launcher CLI as success rather than blocking on it) is kept and is now belt-and-suspenders.

Validation

Reproduced and measured in the repo-local Linux container (Ubuntu 24.04 ARM64):

Build Stops Failures Failure rate Failure duration
Baseline (no fix) 25 5 20% ~40s
CLI-visibility-only fix 25 1 4% ~40s
This change 30 0 0% n/a (all stops 0.7-2.3s)

Full local unit-test suite (Aspire.Cli.Tests): 3799 passed, 0 failed, 21 skipped.

Notes for reviewers

  • ProcessSignaler.ForceKill gains a killEntireProcessTree parameter, default false. All existing callers keep the old behavior; only ProcessShutdownService.ForceKillRemainingProcesses opts in, and only on Unix.
  • Windows behavior is intentionally preserved verbatim. The new Unix branch is fully gated on OperatingSystem.IsWindows().
  • Zombie-aware liveness detection (reading /proc/<pid>/stat state field) was considered but skipped — with the cleaner Unix path, AppHosts are reaped by their actual parent and zombies should not be produced in the normal flow.

The stop path treated the managing CLI PID as a success condition. On
Unix the CLI process can remain observable (for example as an unreaped
or briefly lingering process) after the AppHost has already stopped, so
aspire stop reported '\u274c Failed to stop apphost.cs' even when shutdown
completed successfully.

Use the AppHost PID as the success condition, and keep the CLI PID as a
shutdown handle that still gets force-killed when present so we never
leave a true zombie CLI running after the AppHost is gone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17612

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17612"

The cascade path (signal launcher CLI, let it terminate 'dotnet run',
and rely on that to terminate the AppHost) is racy on Unix: if the
descendant walk inside 'dotnet run' misses the AppHost, the AppHost
is orphaned to PID 1 and HasExited polling can't distinguish a zombie
from a live process, so 'aspire stop' falsely reports failure after
the full SIGTERM + SIGKILL timeout (~40s).

On Unix we now:
* Send SIGTERM directly to the AppHost PID so it shuts down through
  its own IHostApplicationLifetime path. The launcher CLI and dotnet
  run process exit naturally when their child exits.
* Pass killEntireProcessTree:true when force-terminating on Unix. DCP
  is launched in a separate session there, so force-terminating the
  AppHost tree doesn't take DCP with it; DCP detects the parent gone
  and tears down its own children gracefully.

Windows behavior is preserved: we keep cascading through the launcher
CLI to DCP because DCP is an in-tree descendant of the AppHost on
Windows, and a full tree termination would break DCP's orderly
resource cleanup.

Validation in the repo's Linux container (Ubuntu 24.04 ARM64):
* Baseline: 5/25 failures (20%), failures hit the 40s timeout
* CLI-visibility-only fix: 1/25 failures (4%)
* This change: 0/30 failures, stop wall-time 0.7-2.3s

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danegsta danegsta changed the title Fix aspire stop falsely reporting failure on Unix Fix aspire stop falsely reporting failure on Unix May 28, 2026
@danegsta danegsta marked this pull request as ready for review May 28, 2026 20:33
@danegsta danegsta requested a review from mitchdenny as a code owner May 28, 2026 20:33
Copilot AI review requested due to automatic review settings May 28, 2026 20:33
@danegsta
Copy link
Copy Markdown
Member Author

/backport to release/13.4

@github-actions
Copy link
Copy Markdown
Contributor

Started backporting to release/13.4 (link to workflow run)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes intermittent Unix failures where aspire stop could time out and report Failed to stop apphost.cs due to a racy shutdown cascade through the launcher CLI / dotnet run, which could leave the AppHost unreaped on Unix.

Changes:

  • Split shutdown behavior by OS: on Unix, request graceful shutdown by signaling the AppHost PID directly (avoiding the CLI→dotnet run cascade).
  • Add an opt-in killEntireProcessTree parameter to ProcessSignaler.ForceKill and use it on Unix when force-killing remaining processes.
  • Add a Unix-specific test ensuring the launcher CLI process is treated as a shutdown handle (cleaned up, but not required for success).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/Aspire.Cli.Tests/Processes/ProcessShutdownServiceTests.cs Adds a Unix regression test and injects a TimeProvider for deterministic timing in the new test.
src/Shared/ProcessSignaler.cs Extends ForceKill with optional whole-process-tree killing and improves debug logging for kill behavior.
src/Aspire.Cli/Processes/ProcessShutdownService.cs Separates “success monitoring” (AppHost) from “cleanup force-kill” (AppHost + CLI handle), and routes Unix graceful shutdown directly to the AppHost PID.

@github-actions
Copy link
Copy Markdown
Contributor

CLI E2E Tests unknown — 107 passed, 0 failed, 2 unknown (commit dcba2c4)

View all recordings
Status Test Recording Job Artifacts
AddPackageInteractiveWhileAppHostRunningDetached Recording #78380747411 Logs
AddPackageWhileAppHostRunningDetached Recording #78380747411 Logs
AgentCommands_AllHelpOutputs_AreCorrect Recording #78380747495 Logs
AgentInitCommand_DefaultSelection_InstallsDefaultSkills Recording #78380747495 Logs
AgentInitCommand_MigratesDeprecatedConfig Recording #78380747495 Logs
AgentMcpListStructuredLogsReturnsLogsFromStarterApp Recording #78380747819 Logs
AgentMcpListStructuredLogsReturnsLogsFromStarterApp_DevLocalhost Recording #78380747819 Logs
AgentMcpListStructuredLogsReturnsLogsFromStarterApp_Isolated Recording #78380747819 Logs
AllPublishMethodsBuildDockerImages Recording #78380748177 Logs
AspireAddAndStartWorkAgainstLegacyAppHostTs Recording #78380747978 Logs
AspireAddPackageVersionToDirectoryPackagesProps Recording #78380747228 Logs
AspireInitSingleFileAppHostRunsViaDotnetRunAppHost Recording #78380747500 Logs
AspireInitWithExistingAppHostDirRecreatesMissingNuGetConfigAndPreservesFiles Recording #78380747892 Logs
AspireInitWithSolutionFileGeneratesAppHostThatBuildsAgainstChannelHive Recording #78380747892 Logs
AspireStartUpdatesStaleTypeScriptAppHostPath Recording #78380747121 Logs
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps Recording #78380747228 Logs
AspireUpdateRemovesOrphanAppHostPackageVersionWhenSdkAlreadyCurrent Recording #78380747228 Logs
Banner_DisplayedOnFirstRun Recording #78380747995 Logs
Banner_DisplayedWithExplicitFlag Recording #78380747995 Logs
Banner_NotDisplayedWithNoLogoFlag Recording #78380747995 Logs
CertificatesClean_RemovesCertificates Recording #78380747431 Logs
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate Recording #78380747431 Logs
CertificatesTrust_WithUntrustedCert_TrustsCertificate Recording #78380747431 Logs
ConfigSetGet_CreatesNestedJsonFormat Recording #78380747804 Logs
CreateAndRunAspireStarterProject Recording #78380747093 Logs
CreateAndRunAspireStarterProjectWithBundle Recording #78380747321 Logs
CreateAndRunEmptyAppHostProject Recording #78380748171 Logs
CreateAndRunJavaEmptyAppHostProject Recording #78380747763 Logs
CreateAndRunJsReactProject Recording #78380747246 Logs
CreateAndRunPythonReactProject Recording #78380747380 Logs
CreateAndRunTypeScriptEmptyAppHostProject Recording #78380747285 Logs
CreateAndRunTypeScriptStarterProject Recording #78380748036 Logs
CreateJavaAppHostWithViteApp Recording #78380748042 Logs
CreateTypeScriptAppHostWithViteApp_AllowsGuestAppPackageManagerToDiffer Recording #78380747671 Logs
CreateTypeScriptAppHostWithViteApp_UsesConfiguredToolchain Recording #78380747671 Logs
DashboardRunWithAgentMcpListTracesReturnsNoTraces Recording #78380747280 Logs
DashboardRunWithAgentMcpListTracesReturnsNoTraces_DevLocalhost Recording #78380747280 Logs
DashboardRunWithOtelTracesReturnsNoTraces Recording #78380747280 Logs
DashboardRunWithOtelTracesReturnsNoTraces_DevLocalhost Recording #78380747280 Logs
DeployK8sBasicApiService Recording #78380747260 Logs
DeployK8sWithExternalHelmChart Recording #78380747918 Logs
DeployK8sWithGarnet Recording #78380747534 Logs
DeployK8sWithMongoDB Recording #78380748058 Logs
DeployK8sWithMySql Recording #78380747377 Logs
DeployK8sWithPostgres Recording #78380748197 Logs
DeployK8sWithRabbitMQ Recording #78380748272 Logs
DeployK8sWithRedis Recording #78380747830 Logs
DeployK8sWithSqlServer Recording #78380747701 Logs
DeployK8sWithValkey Recording #78380747831 Logs
DeployTypeScriptAppToKubernetes Recording #78380748077 Logs
DescribeCommandResolvesReplicaNames Recording #78380748167 Logs
DescribeCommandShowsRunningResources Recording #78380748167 Logs
DetachFormatJsonProducesValidJson Recording #78380746866 Logs
DetachFormatJsonProducesValidJsonWhenRestartingExistingInstance Recording #78380746866 Logs
DoPublishAndDeployListStepsWork Recording #78380748024 Logs
DocsCommand_RendersInteractiveMarkdownFromLocalSource Recording #78380747668 Logs
DoctorCommand_DetectsDeprecatedAgentConfig Recording #78380747495 Logs
DoctorCommand_TypeScriptAppHostReportsMissingConfiguredToolchain Recording #78380748148 Logs
DoctorCommand_WithSslCertDir_ShowsTrusted Recording #78380748148 Logs
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted Recording #78380748148 Logs
GatewayWithoutExternalEndpoint_FailsPublishWithGuidance Recording #78380748020 Logs
GeneratedAspireDevScript_StartsWatchMode_WithConfiguredToolchain Recording #78380747671 Logs
GlobalMigration_HandlesCommentsAndTrailingCommas Recording #78380747804 Logs
GlobalMigration_HandlesMalformedLegacyJson Recording #78380747804 Logs
GlobalMigration_PreservesAllValueTypes Recording #78380747804 Logs
GlobalMigration_SkipsWhenNewConfigExists Recording #78380747804 Logs
GlobalSettings_MigratedFromLegacyFormat Recording #78380747804 Logs
IngressWithoutExternalEndpoint_FailsPublishWithGuidance Recording #78380748020 Logs
InitTypeScriptAppHost_AugmentsExistingViteRepoInWorkspaceSubdirectory Recording #78380747671 Logs
InteractiveCSharpInitCreatesExpectedFiles Recording #78380747939 Logs
InvalidAppHostPathWithComments_IsHealedOnRun Recording #78380747301 Logs
JavaScriptHostingApisRunFromTypeScriptAppHost Recording #78380748177 Logs
LatestCliCanStartStableChannelAppHost Recording #78380747093 Logs
LatestCliCanStartStableChannelTypeScriptAppHost Recording #78380747093 Logs
LegacySettingsMigration_AdjustsRelativeAppHostPath Recording #78380747121 Logs
LogsCommandShowsResourceLogs Recording #78380748175 Logs
OtelLogsReturnsStructuredLogsFromStarterApp Recording #78380748053 Logs
OtelLogsReturnsStructuredLogsFromStarterAppIsolated Recording #78380748053 Logs
PsCommandListsRunningAppHost Recording #78380747492 Logs
PsFormatJsonOutputsOnlyJsonToStdout Recording #78380747492 Logs
PublishJavaScriptPatternsGeneratesExpectedDockerComposeArtifacts Recording #78380747929 Logs
PublishWithConfigureEnvFileUpdatesEnvOutput Recording #78380747929 Logs
PublishWithDockerComposeServiceCallbackSucceeds Recording #78380747929 Logs
PublishWithoutOutputPathUsesAppHostDirectoryDefault Recording #78380747929 Logs
ResourceCommand_FailedExecution_DisplaysAppHostLogPathAndLogContainsEntries Recording #78380747213 Logs
ResourceCommand_SetAndDeleteParameterUpdatesDescribeOutput Recording #78380747213 Logs
RestoreGeneratesSdkFiles Recording #78380747200 Logs
RestoreGeneratesSdkFiles_WithConfiguredToolchain Recording #78380747638 Logs
RestoreRefreshesGeneratedSdkAfterAddingIntegration Recording #78380747638 Logs
RestoreSupportsConfigOnlyHelperPackageAndCrossPackageTypes Recording #78380748074 Logs
RunFromParentDirectory_UsesExistingConfigNearAppHost Recording #78380747477 Logs
RunReportsSyntaxErrorsForDotNetAppHost Recording #78380747410 Logs
RunReportsSyntaxErrorsForTypeScriptAppHost Recording #78380747410 Logs
SecretCrudOnDotNetAppHost Recording #78380747959 Logs
SecretCrudOnTypeScriptAppHost Recording #78380747617 Logs
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels Recording #78380747574 Logs
StartAndWaitForTypeScriptSqlServerAppHostWithNativeAssets Recording #78380747548 Logs
StartReportsSyntaxErrorsForDotNetAppHost Recording #78380747410 Logs
StartReportsSyntaxErrorsForTypeScriptAppHost Recording #78380747410 Logs
StopAllAppHostsFromAppHostDirectory Recording #78380747770 Logs
StopJavaPolyglotAppHostUsingApphostDirectory Recording #78380747705 Logs
StopNonInteractiveSingleAppHost Recording #78380747770 Logs
StopTypeScriptPolyglotAppHostUsingApphostDirectory Recording #78380747931 Logs
StopWithNoRunningAppHostExitsSuccessfully Recording #78380747411 Logs
UnAwaitedChainsCompileWithAutoResolvePromises Recording #78380747638 Logs
UpdateProjectChannelToStable_CSharpEmptyAppHost_PreservesAspireConfigChannel Recording #78380747521 Logs
UpdateProjectChannelToStable_CSharpSingleFileInit_PreservesAspireConfigChannel Recording #78380747521 Logs
UpdateProjectChannelToStable_TypeScriptSingleFileInit_PreservesAspireConfigChannel Recording #78380747521 Logs
UpdateProjectChannelToStable_TypeScript_PreviewsStablePackagesAndPreservesChannel Recording #78380747521 Logs

📹 Recordings uploaded automatically from CI run #26599142125

@danegsta danegsta merged commit 30fb46a into main May 28, 2026
312 checks passed
@danegsta danegsta deleted the danegsta/apphost-stop-failure branch May 28, 2026 21:55
@microsoft-github-policy-service microsoft-github-policy-service Bot added this to the 13.5 milestone May 28, 2026
@aspire-repo-bot
Copy link
Copy Markdown
Contributor

✅ No documentation update needed.

docs_optional → bug_fix_restores_documented_behavior

No triggered signals (signal_count = 0). This PR fixes an internal Unix race condition where aspire stop could falsely report failure (~40s timeout) due to a racy SIGTERM cascade through the launcher CLI / dotnet run that could orphan the AppHost. The fix splits the shutdown path by OS so Unix sends SIGTERM directly to the AppHost PID — restoring the already-documented behavior that aspire stop reliably stops the app host. No new CLI flags, options, APIs, commands, or user-visible configuration were introduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants