Serialize CLI host tests to fix sample_rate race (#17450)#17451
Serialize CLI host tests to fix sample_rate race (#17450)#17451davidfowl wants to merge 3 commits into
Conversation
Aspire.Cli.Tests.CliSmokeTests.MainReturnsExpectedExitCode(args: [],
expectedExitCode: 1) intermittently failed in CI with:
System.InvalidOperationException : 'The collection already contains
item with same key microsoft.sample_rate'
at System.Diagnostics.ActivityTagsCollection.Add(String, Object)
at OpenTelemetry.Trace.TracerProviderSdk.ComputeActivitySamplingResult(...)
at System.Diagnostics.ActivitySource.CreateActivity(...)
at Aspire.Cli.Telemetry.AspireCliTelemetry.StartReportedActivity(...)
Root cause: Azure.Monitor.OpenTelemetry.Exporter uses RateLimitedSampler
by default, which (once its adaptive state matures ~200ms after creation)
returns a SamplingResult with a 'microsoft.sample_rate' attribute.
OpenTelemetry's TracerProviderSdk writes those attributes into the shared
ActivityCreationOptions.SamplingTags via a hard Add (no TryAdd), and
ActivitySource.CreateActivity reuses the same options across every
registered listener. When two listeners on the 'Aspire.Cli.Reported'
source both run such samplers, the second Add throws the duplicate-key
exception.
Aspire CLI production only ever has one live TelemetryManager (DI
singleton), so the bug is invisible at runtime. xUnit v3 runs test
classes in parallel by default, however, which allows two in-process
host-building tests to race. Serialize them via a new
CliHostTestCollection (DisableParallelization = true), mirroring the
existing EnvVarMutatingTestCollection pattern. Apply [Collection] to:
* CliSmokeTests (in-process Program.Main([]))
* CliBootstrapTests (Program.BuildApplicationAsync)
* TelemetryConfigurationTests (direct new TelemetryManager + host)
SdkDumpCommandTests only invokes Program.Main inside RemoteExecutor.Invoke
(separate process), so it does not contribute to the in-process race.
Verified locally:
* Repro with two live TelemetryManager instances + Thread.Sleep(300ms)
deterministically reproduces the exact CI stack trace before the fix.
* Full Aspire.Cli.Tests suite (3,660 tests) passes after the fix.
* MainReturnsExpectedExitCode passes across 5 stress runs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17451Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17451" |
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent CLI test failures caused by an in-process OpenTelemetry/Azure Monitor listener interaction by serializing a small set of host-building/telemetry-initializing test classes.
Changes:
- Introduces a new xUnit test collection (
CliHostTestCollection) with parallelization disabled. - Applies the collection to CLI smoke, bootstrap, and telemetry configuration test classes that build the real CLI host or create live
TelemetryManagerinstances in-process. - Leaves out tests that invoke
Program.Mainout-of-process viaRemoteExecutor, avoiding unnecessary serialization.
Show a summary per file
| File | Description |
|---|---|
| tests/Aspire.Cli.Tests/CliHostTestCollection.cs | Adds a non-parallelized xUnit collection to serialize in-process CLI host/telemetry tests. |
| tests/Aspire.Cli.Tests/CliSmokeTests.cs | Places smoke tests that call Program.Main in-process into the non-parallelized collection. |
| tests/Aspire.Cli.Tests/CliBootstrapTests.cs | Places host-building bootstrap tests into the non-parallelized collection. |
| tests/Aspire.Cli.Tests/Telemetry/TelemetryConfigurationTests.cs | Places telemetry configuration tests (including TelemetryManager construction) into the non-parallelized collection. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 0
|
What about using ASPIRE_CLI_TELEMETRY_OPTOUT with these tests to avoid setting up Azure Monitor? |
Per JamesNK and follow-up feedback on #17451, replace the in-code [Collection(CliHostTestCollection.Name)] approach with a project-local xunit.runner.json that disables test-collection parallelization for Aspire.Cli.Tests. This matches the existing repo convention used by Aspire.Cli.EndToEnd.Tests, Aspire.Templates.Tests, and others, and removes the need for any C# code annotations or environment-variable opt-out to work around the OpenTelemetry/Azure Monitor in-process 'microsoft.sample_rate' duplicate-key race (#17450). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the look @JamesNK — I went a step further and used the repo's existing
Full-project run: 3,660/3,681 pass (21 platform-skipped), 0 failures, ~28s end-to-end. 5× stress run of the originally failing |
The Aspire.Cli.Tests assembly now sets ASPIRE_CLI_TELEMETRY_OPTOUT=true process-wide via a [ModuleInitializer], so each in-process CliHost built by a test skips Azure Monitor by default. Tests that need to exercise the Azure-Monitor-enabled branch override the env-var-derived opt-out via the in-memory configuration passed to Program.BuildApplicationAsync (which is layered on top of AddEnvironmentVariables and therefore wins). This replaces the xunit.runner.json serialization workaround. With Azure Monitor disabled by default, only the few tests in TelemetryConfigurationTests build a TracerProvider — and those run serially within a single class — so the duplicate "microsoft.sample_rate" Add into the shared ActivityCreationOptions.SamplingTags can no longer occur across parallel test classes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Switched the fix to the env-var opt-out approach @JamesNK suggested. What this commit does (replaces the previous
No production code changes, no class-level Validation (local, macOS arm64):
|
JamesNK
left a comment
There was a problem hiding this comment.
Code change looks correct — the module-initializer approach cleanly prevents the microsoft.sample_rate race by keeping Azure Monitor out of the pipeline for parallel test classes, with focused opt-in only in TelemetryConfigurationTests (which runs sequentially within-class). One minor note about the PR description not matching the implemented approach.
|
Thanks for the careful read. You're right - the description was stale; I've updated it to describe the |
|
❓ CLI E2E Tests unknown — 96 passed, 0 failed, 5 unknown (commit View all recordings
📹 Recordings uploaded automatically from CI run #26377911110 |
Description
Fixes #17450.
Aspire.Cli.Tests.CliSmokeTests.MainReturnsExpectedExitCode(args: [], expectedExitCode: 1)intermittently failed in CI with:Root cause
Azure.Monitor.OpenTelemetry.ExporterusesRateLimitedSamplerby default, which (once its adaptive state has matured for ~200ms after construction) returns aSamplingResultcontaining amicrosoft.sample_rateattribute. OpenTelemetry'sTracerProviderSdk.ComputeActivitySamplingResultwrites those attributes intoActivityCreationOptions.SamplingTagsvia a hardAdd(key, value)(noTryAdd).ActivitySource.CreateActivityreuses the sameActivityCreationOptionsinstance across every registered listener, so when two listeners on theAspire.Cli.Reportedsource both run samplers that emitmicrosoft.sample_rate, the secondAddthrows.Aspire CLI production code only ever has one live
TelemetryManager(registered as a DI singleton), so the duplicate-tag race is impossible at runtime. xUnit v3 runs test classes in parallel by default, however, which allows two in-process host-building tests to overlap: when one of them callsStartReportedActivity, both listeners fire and the secondAddthrows.Fix
Opt the entire
Aspire.Cli.Testsprocess out of Azure Monitor by default, and re-enable it only for the handful of tests that actually need to assert on Azure Monitor configuration.tests/Aspire.Cli.Tests/TestTelemetryDefaults.csuses[ModuleInitializer]to setASPIRE_CLI_TELEMETRY_OPTOUT=truefor the test process before any test method runs. With opt-out enabled,TelemetryManagerdoes not register an Azure MonitorTracerProvider, so no Azure Monitor listener ever attaches to theAspire.Cli.Reportedsource from test code.TelemetryConfigurationTestsadds a smallWithTelemetryOptIn(...)helper that injects"ASPIRE_CLI_TELEMETRY_OPTOUT": "false"into the in-memory configuration.Program.BuildApplicationAsyncreads environment variables before in-memory values, so in-memory wins. The three tests that exercise Azure Monitor (AzureMonitor_Enabled_ByDefault,OtlpExporter_WithoutProfiling_EnablesOnlyDebugDiagnostics_WhenEndpointProvided,OtlpExporter_WithProfiling_KeepsReportedTelemetryAndProfilingSeparate) opt back in via this helper. Because all Azure MonitorTracerProvidercreations are now confined to a single class and xUnit runs methods within a class serially, the duplicate-Addrace cannot happen.No production code changes. No test collections, no parallelization toggles. The opt-out lives entirely outside test class code so it cannot be accidentally bypassed by adding new test classes that build a CLI host in-process.
Why not a product-side fix
The defect is fundamentally an OpenTelemetry/Azure Monitor interaction: any sampler that writes attributes into
SamplingTagswill collide when more than one listener is attached to the sameActivitySource. A product-side workaround would either (a) discard the Azure Monitor default sampler and lose rate-limiting + thesample_ratemetric, or (b) wrap the sampler chain to dedupe attributes - both worse trade-offs than disabling Azure Monitor in the test process where it provides no value anyway.Verification
TelemetryManagerinstances with Azure Monitor enabled, waiting 300ms for the samplers' adaptive state to mature, and callingStartReportedActivity. This confirmed the root cause; the repro test was removed before commit (no upstream issue to track, and a[ActiveIssue]-skipped repro added more noise than value).tests/Aspire.Cli.Testssuite locally: 3,660 / 3,681 pass (21 skipped on this platform), 0 failures.MainReturnsExpectedExitCodeacross 5 stress runs locally: all green.TelemetryConfigurationTests(all 3 Azure Monitor tests): pass with the opt-in helper.Checklist