Skip to content

Serialize CLI host tests to fix sample_rate race (#17450)#17451

Open
davidfowl wants to merge 3 commits into
mainfrom
davidfowl/issue-17450-failing-test-aspire-cli-tests-clismoket-f3b60f
Open

Serialize CLI host tests to fix sample_rate race (#17450)#17451
davidfowl wants to merge 3 commits into
mainfrom
davidfowl/issue-17450-failing-test-aspire-cli-tests-clismoket-f3b60f

Conversation

@davidfowl
Copy link
Copy Markdown
Contributor

@davidfowl davidfowl commented May 24, 2026

Description

Fixes #17450.

Aspire.Cli.Tests.CliSmokeTests.MainReturnsExpectedExitCode(args: [], expectedExitCode: 1) intermittently failed in CI with:

System.InvalidOperationException : "The collection already contains item with same key 'microsoft.sample_rate'"
   at System.Diagnostics.ActivityTagsCollection.Add(String key, Object value)
   at OpenTelemetry.Trace.TracerProviderSdk.ComputeActivitySamplingResult(...)
   at System.Diagnostics.ActivitySource.CreateActivity(...)
   at Aspire.Cli.Telemetry.AspireCliTelemetry.StartReportedActivity(...)

Root cause

Azure.Monitor.OpenTelemetry.Exporter uses RateLimitedSampler by default, which (once its adaptive state has matured for ~200ms after construction) returns a SamplingResult containing a microsoft.sample_rate attribute. OpenTelemetry's TracerProviderSdk.ComputeActivitySamplingResult writes those attributes into ActivityCreationOptions.SamplingTags via a hard Add(key, value) (no TryAdd). ActivitySource.CreateActivity reuses the same ActivityCreationOptions instance across every registered listener, so when two listeners on the Aspire.Cli.Reported source both run samplers that emit microsoft.sample_rate, the second Add throws.

Aspire CLI production code only ever has one live TelemetryManager (registered as a DI singleton), so the duplicate-tag race is impossible at runtime. xUnit v3 runs test classes in parallel by default, however, which allows two in-process host-building tests to overlap: when one of them calls StartReportedActivity, both listeners fire and the second Add throws.

Fix

Opt the entire Aspire.Cli.Tests process out of Azure Monitor by default, and re-enable it only for the handful of tests that actually need to assert on Azure Monitor configuration.

  • New tests/Aspire.Cli.Tests/TestTelemetryDefaults.cs uses [ModuleInitializer] to set ASPIRE_CLI_TELEMETRY_OPTOUT=true for the test process before any test method runs. With opt-out enabled, TelemetryManager does not register an Azure Monitor TracerProvider, so no Azure Monitor listener ever attaches to the Aspire.Cli.Reported source from test code.
  • TelemetryConfigurationTests adds a small WithTelemetryOptIn(...) helper that injects "ASPIRE_CLI_TELEMETRY_OPTOUT": "false" into the in-memory configuration. Program.BuildApplicationAsync reads environment variables before in-memory values, so in-memory wins. The three tests that exercise Azure Monitor (AzureMonitor_Enabled_ByDefault, OtlpExporter_WithoutProfiling_EnablesOnlyDebugDiagnostics_WhenEndpointProvided, OtlpExporter_WithProfiling_KeepsReportedTelemetryAndProfilingSeparate) opt back in via this helper. Because all Azure Monitor TracerProvider creations are now confined to a single class and xUnit runs methods within a class serially, the duplicate-Add race cannot happen.

No production code changes. No test collections, no parallelization toggles. The opt-out lives entirely outside test class code so it cannot be accidentally bypassed by adding new test classes that build a CLI host in-process.

Why not a product-side fix

The defect is fundamentally an OpenTelemetry/Azure Monitor interaction: any sampler that writes attributes into SamplingTags will collide when more than one listener is attached to the same ActivitySource. A product-side workaround would either (a) discard the Azure Monitor default sampler and lose rate-limiting + the sample_rate metric, or (b) wrap the sampler chain to dedupe attributes - both worse trade-offs than disabling Azure Monitor in the test process where it provides no value anyway.

Verification

  • Reproduced the exact CI stack trace locally by constructing two TelemetryManager instances with Azure Monitor enabled, waiting 300ms for the samplers' adaptive state to mature, and calling StartReportedActivity. This confirmed the root cause; the repro test was removed before commit (no upstream issue to track, and a [ActiveIssue]-skipped repro added more noise than value).
  • Full tests/Aspire.Cli.Tests suite locally: 3,660 / 3,681 pass (21 skipped on this platform), 0 failures.
  • MainReturnsExpectedExitCode across 5 stress runs locally: all green.
  • TelemetryConfigurationTests (all 3 Azure Monitor tests): pass with the opt-in helper.

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
    • No

Aspire.Cli.Tests.CliSmokeTests.MainReturnsExpectedExitCode(args: [],
expectedExitCode: 1) intermittently failed in CI with:

  System.InvalidOperationException : 'The collection already contains
  item with same key microsoft.sample_rate'
    at System.Diagnostics.ActivityTagsCollection.Add(String, Object)
    at OpenTelemetry.Trace.TracerProviderSdk.ComputeActivitySamplingResult(...)
    at System.Diagnostics.ActivitySource.CreateActivity(...)
    at Aspire.Cli.Telemetry.AspireCliTelemetry.StartReportedActivity(...)

Root cause: Azure.Monitor.OpenTelemetry.Exporter uses RateLimitedSampler
by default, which (once its adaptive state matures ~200ms after creation)
returns a SamplingResult with a 'microsoft.sample_rate' attribute.
OpenTelemetry's TracerProviderSdk writes those attributes into the shared
ActivityCreationOptions.SamplingTags via a hard Add (no TryAdd), and
ActivitySource.CreateActivity reuses the same options across every
registered listener. When two listeners on the 'Aspire.Cli.Reported'
source both run such samplers, the second Add throws the duplicate-key
exception.

Aspire CLI production only ever has one live TelemetryManager (DI
singleton), so the bug is invisible at runtime. xUnit v3 runs test
classes in parallel by default, however, which allows two in-process
host-building tests to race. Serialize them via a new
CliHostTestCollection (DisableParallelization = true), mirroring the
existing EnvVarMutatingTestCollection pattern. Apply [Collection] to:

  * CliSmokeTests           (in-process Program.Main([]))
  * CliBootstrapTests       (Program.BuildApplicationAsync)
  * TelemetryConfigurationTests (direct new TelemetryManager + host)

SdkDumpCommandTests only invokes Program.Main inside RemoteExecutor.Invoke
(separate process), so it does not contribute to the in-process race.

Verified locally:
  * Repro with two live TelemetryManager instances + Thread.Sleep(300ms)
    deterministically reproduces the exact CI stack trace before the fix.
  * Full Aspire.Cli.Tests suite (3,660 tests) passes after the fix.
  * MainReturnsExpectedExitCode passes across 5 stress runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 24, 2026 21:33
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 24, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17451

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17451"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent CLI test failures caused by an in-process OpenTelemetry/Azure Monitor listener interaction by serializing a small set of host-building/telemetry-initializing test classes.

Changes:

  • Introduces a new xUnit test collection (CliHostTestCollection) with parallelization disabled.
  • Applies the collection to CLI smoke, bootstrap, and telemetry configuration test classes that build the real CLI host or create live TelemetryManager instances in-process.
  • Leaves out tests that invoke Program.Main out-of-process via RemoteExecutor, avoiding unnecessary serialization.
Show a summary per file
File Description
tests/Aspire.Cli.Tests/CliHostTestCollection.cs Adds a non-parallelized xUnit collection to serialize in-process CLI host/telemetry tests.
tests/Aspire.Cli.Tests/CliSmokeTests.cs Places smoke tests that call Program.Main in-process into the non-parallelized collection.
tests/Aspire.Cli.Tests/CliBootstrapTests.cs Places host-building bootstrap tests into the non-parallelized collection.
tests/Aspire.Cli.Tests/Telemetry/TelemetryConfigurationTests.cs Places telemetry configuration tests (including TelemetryManager construction) into the non-parallelized collection.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 0

@JamesNK
Copy link
Copy Markdown
Member

JamesNK commented May 25, 2026

What about using ASPIRE_CLI_TELEMETRY_OPTOUT with these tests to avoid setting up Azure Monitor?

Per JamesNK and follow-up feedback on #17451, replace the in-code
[Collection(CliHostTestCollection.Name)] approach with a project-local
xunit.runner.json that disables test-collection parallelization for
Aspire.Cli.Tests. This matches the existing repo convention used by
Aspire.Cli.EndToEnd.Tests, Aspire.Templates.Tests, and others, and
removes the need for any C# code annotations or environment-variable
opt-out to work around the OpenTelemetry/Azure Monitor in-process
'microsoft.sample_rate' duplicate-key race (#17450).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davidfowl
Copy link
Copy Markdown
Contributor Author

Thanks for the look @JamesNK — I went a step further and used the repo's existing xunit.runner.json convention (already used by Aspire.Cli.EndToEnd.Tests, Aspire.Templates.Tests, Aspire.Deployment.EndToEnd.Tests) to disable test-collection parallelization for the whole Aspire.Cli.Tests assembly. This:

  • Avoids touching telemetry behavior or setting ASPIRE_CLI_TELEMETRY_OPTOUT from inside the test process (which would conflict with the existing AzureMonitor_Enabled_ByDefault and OTLP-with-Azure-Monitor tests that intentionally exercise the Azure Monitor branch).
  • Removes the per-class [Collection] plumbing and the CliHostTestCollection.cs file entirely — pure config, no C# code annotations.
  • Matches the convention already used for other CLI-adjacent test projects.

Full-project run: 3,660/3,681 pass (21 platform-skipped), 0 failures, ~28s end-to-end. 5× stress run of the originally failing MainReturnsExpectedExitCode is clean. Pushed as a follow-up commit (no force-push).

The Aspire.Cli.Tests assembly now sets ASPIRE_CLI_TELEMETRY_OPTOUT=true
process-wide via a [ModuleInitializer], so each in-process CliHost built
by a test skips Azure Monitor by default. Tests that need to exercise the
Azure-Monitor-enabled branch override the env-var-derived opt-out via the
in-memory configuration passed to Program.BuildApplicationAsync (which is
layered on top of AddEnvironmentVariables and therefore wins).

This replaces the xunit.runner.json serialization workaround. With Azure
Monitor disabled by default, only the few tests in TelemetryConfigurationTests
build a TracerProvider — and those run serially within a single class — so
the duplicate "microsoft.sample_rate" Add into the shared
ActivityCreationOptions.SamplingTags can no longer occur across parallel
test classes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davidfowl
Copy link
Copy Markdown
Contributor Author

Switched the fix to the env-var opt-out approach @JamesNK suggested.

What this commit does (replaces the previous xunit.runner.json serialization):

  • A single-purpose TestTelemetryDefaults.cs in Aspire.Cli.Tests uses [ModuleInitializer] to set ASPIRE_CLI_TELEMETRY_OPTOUT=true on the test process before any test code runs. Every CLI host built by a test is opted out of Azure Monitor by default.
  • The three tests in TelemetryConfigurationTests that need to exercise the Azure-Monitor-enabled branch opt back in by including ASPIRE_CLI_TELEMETRY_OPTOUT="false" in their in-memory configuration. Program.BuildApplicationAsync layers AddInMemoryCollection after AddEnvironmentVariables, so the per-test override wins over the assembly-wide env var.
  • All Azure Monitor TracerProvider creations are now confined to a single class (TelemetryConfigurationTests), which is internally serial — so the duplicate microsoft.sample_rate add into the shared ActivityCreationOptions.SamplingTags can no longer happen across parallel test classes.

No production code changes, no class-level [Collection] attributes, and no broad parallelization disable.

Validation (local, macOS arm64):

  • dotnet test --filter-class "*.CliSmokeTests" "*.CliBootstrapTests" "*.TelemetryConfigurationTests" → 31/31 passed.
  • 5× back-to-back runs of CliSmokeTests.MainReturnsExpectedExitCode → all green.
  • Full Aspire.Cli.Tests run → 3,660 passed, 21 skipped (quarantined), 0 failed (matches baseline).

Copy link
Copy Markdown
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks correct — the module-initializer approach cleanly prevents the microsoft.sample_rate race by keeping Azure Monitor out of the pipeline for parallel test classes, with focused opt-in only in TelemetryConfigurationTests (which runs sequentially within-class). One minor note about the PR description not matching the implemented approach.

Comment thread tests/Aspire.Cli.Tests/TestTelemetryDefaults.cs
@davidfowl
Copy link
Copy Markdown
Contributor Author

Thanks for the careful read. You're right - the description was stale; I've updated it to describe the [ModuleInitializer] global opt-out plus the in-class WithTelemetryOptIn(...) helper, since that's what actually shipped.

@github-actions
Copy link
Copy Markdown
Contributor

CLI E2E Tests unknown — 96 passed, 0 failed, 5 unknown (commit f249f95)

View all recordings
Status Test Recording
AddPackageInteractiveWhileAppHostRunningDetached ▶️ View recording
AddPackageWhileAppHostRunningDetached ▶️ View recording
AgentCommands_AllHelpOutputs_AreCorrect ▶️ View recording
AgentInitCommand_DefaultSelection_InstallsDefaultSkills ▶️ View recording
AgentInitCommand_MigratesDeprecatedConfig ▶️ View recording
AgentMcpListStructuredLogsFromStarterAppCore ▶️ View recording
AllPublishMethodsBuildDockerImages ▶️ View recording
AspireAddPackageVersionToDirectoryPackagesProps ▶️ View recording
AspireInitSingleFileAppHostRunsViaDotnetRunAppHost ▶️ View recording
AspireInitWithExistingAppHostDirRecreatesMissingNuGetConfigAndPreservesFiles ▶️ View recording
AspireInitWithSolutionFileGeneratesAppHostThatBuildsAgainstChannelHive ▶️ View recording
AspireStartUpdatesStaleTypeScriptAppHostPath ▶️ View recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps ▶️ View recording
AspireUpdateRemovesOrphanAppHostPackageVersionWhenSdkAlreadyCurrent ▶️ View recording
Banner_DisplayedOnFirstRun ▶️ View recording
Banner_DisplayedWithExplicitFlag ▶️ View recording
Banner_NotDisplayedWithNoLogoFlag ▶️ View recording
CertificatesClean_RemovesCertificates ▶️ View recording
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate ▶️ View recording
CertificatesTrust_WithUntrustedCert_TrustsCertificate ▶️ View recording
ConfigSetGet_CreatesNestedJsonFormat ▶️ View recording
CreateAndRunAspireStarterProject ▶️ View recording
CreateAndRunAspireStarterProjectWithBundle ▶️ View recording
CreateAndRunEmptyAppHostProject ▶️ View recording
CreateAndRunJavaEmptyAppHostProject ▶️ View recording
CreateAndRunJsReactProject ▶️ View recording
CreateAndRunPythonReactProject ▶️ View recording
CreateAndRunTypeScriptEmptyAppHostProject ▶️ View recording
CreateAndRunTypeScriptStarterProject ▶️ View recording
CreateJavaAppHostWithViteApp ▶️ View recording
CreateTypeScriptAppHostWithViteApp_UsesConfiguredToolchain ▶️ View recording
DashboardRunWithAgentMcpCore ▶️ View recording
DashboardRunWithOtelTracesReturnsNoTracesCore ▶️ View recording
DeployK8sBasicApiService ▶️ View recording
DeployK8sWithExternalHelmChart ▶️ View recording
DeployK8sWithGarnet ▶️ View recording
DeployK8sWithMongoDB ▶️ View recording
DeployK8sWithMySql ▶️ View recording
DeployK8sWithPostgres ▶️ View recording
DeployK8sWithRabbitMQ ▶️ View recording
DeployK8sWithRedis ▶️ View recording
DeployK8sWithSqlServer ▶️ View recording
DeployK8sWithValkey ▶️ View recording
DeployTypeScriptAppToKubernetes ▶️ View recording
DescribeCommandResolvesReplicaNames ▶️ View recording
DescribeCommandShowsRunningResources ▶️ View recording
DetachFormatJsonProducesValidJson ▶️ View recording
DetachFormatJsonProducesValidJsonWhenRestartingExistingInstance ▶️ View recording
DoListStepsShowsPipelineSteps ▶️ View recording
DocsCommand_RendersInteractiveMarkdownFromLocalSource ▶️ View recording
DoctorCommand_DetectsDeprecatedAgentConfig ▶️ View recording
DoctorCommand_TypeScriptAppHostReportsMissingConfiguredToolchain ▶️ View recording
DoctorCommand_WithSslCertDir_ShowsTrusted ▶️ View recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted ▶️ View recording
GeneratedAspireDevScript_StartsWatchMode_WithConfiguredToolchain ▶️ View recording
GlobalMigration_HandlesCommentsAndTrailingCommas ▶️ View recording
GlobalMigration_HandlesMalformedLegacyJson ▶️ View recording
GlobalMigration_PreservesAllValueTypes ▶️ View recording
GlobalMigration_SkipsWhenNewConfigExists ▶️ View recording
GlobalSettings_MigratedFromLegacyFormat ▶️ View recording
InitTypeScriptAppHost_AugmentsExistingViteRepoAtRoot ▶️ View recording
InteractiveCSharpInitCreatesExpectedFiles ▶️ View recording
InvalidAppHostPathWithComments_IsHealedOnRun ▶️ View recording
JavaScriptHostingApisRunFromTypeScriptAppHost ▶️ View recording
LatestCliCanStartStableChannelAppHost ▶️ View recording
LatestCliCanStartStableChannelTypeScriptAppHost ▶️ View recording
LegacySettingsMigration_AdjustsRelativeAppHostPath ▶️ View recording
LogLevelTrace_ProducesTraceEntriesInCliLogFile ▶️ View recording
LogsCommandShowsResourceLogs ▶️ View recording
OtelLogsReturnsStructuredLogsFromStarterApp ▶️ View recording
OtelLogsReturnsStructuredLogsFromStarterAppIsolated ▶️ View recording
PsCommandListsRunningAppHost ▶️ View recording
PsFormatJsonOutputsOnlyJsonToStdout ▶️ View recording
PublishJavaScriptPatternsGeneratesExpectedDockerComposeArtifacts ▶️ View recording
PublishWithConfigureEnvFileUpdatesEnvOutput ▶️ View recording
PublishWithDockerComposeServiceCallbackSucceeds ▶️ View recording
PublishWithoutOutputPathUsesAppHostDirectoryDefault ▶️ View recording
ResourceCommand_FailedExecution_DisplaysAppHostLogPathAndLogContainsEntries ▶️ View recording
ResourceCommand_FailsWhenInteractionServiceIsRequired ▶️ View recording
ResourceCommand_SetAndDeleteParameterUpdatesDescribeOutput ▶️ View recording
RestoreGeneratesSdkFiles ▶️ View recording
RestoreGeneratesSdkFiles_WithConfiguredToolchain ▶️ View recording
RestoreRefreshesGeneratedSdkAfterAddingIntegration ▶️ View recording
RestoreSupportsConfigOnlyHelperPackageAndCrossPackageTypes ▶️ View recording
RunFromParentDirectory_UsesExistingConfigNearAppHost ▶️ View recording
RunPublishFailureScenarioAsync ▶️ View recording
RunReportsSyntaxErrorsForDotNetAppHost ▶️ View recording
RunReportsSyntaxErrorsForTypeScriptAppHost ▶️ View recording
SecretCrudOnDotNetAppHost ▶️ View recording
SecretCrudOnTypeScriptAppHost ▶️ View recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels ▶️ View recording
StartAndWaitForTypeScriptSqlServerAppHostWithNativeAssets ▶️ View recording
StartReportsSyntaxErrorsForDotNetAppHost ▶️ View recording
StartReportsSyntaxErrorsForTypeScriptAppHost ▶️ View recording
StopAllAppHostsFromAppHostDirectory ▶️ View recording
StopJavaPolyglotAppHostUsingApphostDirectory ▶️ View recording
StopNonInteractiveSingleAppHost ▶️ View recording
StopTypeScriptPolyglotAppHostUsingApphostDirectory ▶️ View recording
StopWithNoRunningAppHostExitsSuccessfully ▶️ View recording
UnAwaitedChainsCompileWithAutoResolvePromises ▶️ View recording
UpdateProjectChannelToStable_TypeScript_PicksUpStablePackages ▶️ View recording

📹 Recordings uploaded automatically from CI run #26377911110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failing test]: Aspire.Cli.Tests.CliSmokeTests.MainReturnsExpectedExitCode\(args: \[\], expectedExitCode: 1\)

4 participants