Skip to content

[release/13.3] Fix unbounded collection growth in TelemetryRepository#16590

Merged
joperezr merged 5 commits intorelease/13.3from
fix/telemetry-repository-unbounded-collections
May 1, 2026
Merged

[release/13.3] Fix unbounded collection growth in TelemetryRepository#16590
joperezr merged 5 commits intorelease/13.3from
fix/telemetry-repository-unbounded-collections

Conversation

@JamesNK
Copy link
Copy Markdown
Member

@JamesNK JamesNK commented Apr 30, 2026

Description

Fix unbounded collection growth in TelemetryRepository and related OTLP model types that could lead to memory exhaustion in long-running dashboard sessions.

Problems fixed:

  • _logScopes, _traceScopes, _logPropertyKeys, _tracePropertyKeys, and _spanLinks were never cleaned up when logs/traces were cleared, causing indefinite accumulation.
  • _resources had no cap, allowing unbounded growth from dynamic services or high-cardinality peer addresses.
  • Per-resource _instruments, _meters, _resourceViews had no caps.
  • Per-instrument Dimensions and KnownAttributeValues had no caps, making high-cardinality metric tags a memory leak vector.
  • ClearMetrics() did not clear _meters (scopes).
  • ClearTraces() full-clear path did not clear _spanLinks (since CircularBuffer.Clear() doesn't fire ItemRemovedForCapacity).
  • Per-resource ClearTraces() path did not remove span links for removed traces.

Changes:

  • Add MaxResourceCount option to TelemetryLimitOptions (default 10,000). Throws InvalidOperationException on limit exceeded, handled by existing callers.
  • Clear _logScopes/_logPropertyKeys in ClearStructuredLogs and _traceScopes/_tracePropertyKeys/_spanLinks in ClearTraces on full clear.
  • Remove per-resource _logPropertyKeys/_tracePropertyKeys entries and span links on per-resource clear.
  • Clear _meters alongside _instruments in OtlpResource.ClearMetrics().
  • Add internal const limits on TelemetryRepository for resource views, instruments, dimensions, known attribute value keys, and values per key (all 10,000).
  • Enforce limits in OtlpResource.GetView, OtlpResource.AddMetrics, OtlpInstrument.FindScope, and OtlpInstrument.CreateDimensionScope.

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?

- Add MaxResourceCount option to TelemetryLimitOptions (default 10,000)
  to cap _resources growth. Throws on limit exceeded.
- Clear _logScopes, _logPropertyKeys on ClearStructuredLogs (full clear)
  and remove per-resource property keys on per-resource clear.
- Clear _traceScopes, _tracePropertyKeys, _spanLinks on ClearTraces
  (full clear) and clean up span links and property keys per-resource.
- Clear _meters alongside _instruments in OtlpResource.ClearMetrics.
- Add internal const limits on TelemetryRepository for resource views
  (10,000), instruments (10,000), dimensions (10,000), known attribute
  value keys (10,000), and values per key (10,000).
- Enforce instrument limit in OtlpResource.AddMetrics.
- Enforce resource view limit in OtlpResource.GetView.
- Enforce dimension limit in OtlpInstrument.FindScope.
- Cap KnownAttributeValues keys and per-key value lists in
  OtlpInstrument.CreateDimensionScope.
- Add clarifying comments to fields describing their bounds.
Copilot AI review requested due to automatic review settings April 30, 2026 06:50
@JamesNK JamesNK requested a review from adamint as a code owner April 30, 2026 06:50
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16590

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16590"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses potential memory exhaustion in long-running Aspire Dashboard sessions by bounding several telemetry data structures (resources, metrics dimensions/attributes, resource views/instruments) and by cleaning up previously unbounded collections when telemetry is cleared.

Changes:

  • Introduces TelemetryLimitOptions.MaxResourceCount and enforces it when creating new OtlpResource entries.
  • Ensures full-clear and per-resource clear paths remove additional cached state (scopes/property-key caches, span-link cache).
  • Adds caps for resource views, instruments, metric dimensions, and known attribute values; and fixes ClearMetrics() to also clear meters/scopes.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs Adds resource-count limit; clears scope/property-key/link caches on clear paths; per-resource clear removes related cached entries.
src/Aspire.Dashboard/Otlp/Model/OtlpResource.cs Enforces per-resource caps (resource views, instruments) and clears meters on metric clear.
src/Aspire.Dashboard/Otlp/Model/OtlpInstrument.cs Enforces caps on dimension scopes and known attribute values (keys/values).
src/Aspire.Dashboard/Configuration/DashboardOptions.cs Adds MaxResourceCount option to telemetry limits configuration.
Comments suppressed due to low confidence (1)

src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs:264

  • The resource limit check runs before _resources.GetOrAdd(...) and can throw even when the resource was concurrently added after the initial TryGetValue fast-path (false-positive limit hit). Also, callers like GetPeerResource(...) / CalculateTraceUninstrumentedPeers(...) don’t catch this exception, so hitting MaxResourceCount can bubble out of AddTracesCore and fail the entire trace ingestion request. Consider attempting the GetOrAdd first and only enforcing the limit when newResource == true (optionally removing the just-added entry if over limit), and/or ensure the uninstrumented-peer paths handle the limit without throwing out of ingestion.
        // Check resource limit before adding a new resource.
        if (_resources.Count >= _otlpContext.Options.MaxResourceCount)
        {
            throw new InvalidOperationException($"Resource limit of {_otlpContext.Options.MaxResourceCount} reached. Resource '{key}' will not be added.");
        }

        // Slower get or add path.
        // This GetOrAdd allocates a closure, so we avoid it if possible.
        var newResource = false;
        resource = _resources.GetOrAdd(key, _ =>
        {
            newResource = true;
            return new OtlpResource(key.Name, key.InstanceId, uninstrumentedPeer, _otlpContext);
        });

Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
Comment thread src/Aspire.Dashboard/Otlp/Model/OtlpResource.cs Outdated
Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
… limit

- Add TelemetryLimitTests with 5 tests for resource and instrument limits
- Add maxResourceCount parameter to CreateRepository test helper
- Wrap GetOrAddResource calls in GetPeerResource (return null), CalculateTraceUninstrumentedPeers, and OnPeerChanged with try/catch
- Fix _meters comment to not claim an unenforced bound
- Document TOCTOU soft-cap behavior on resource limit check
@github-actions
Copy link
Copy Markdown
Contributor

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

JamesNK added 2 commits April 30, 2026 17:16
- ClearTraces full-clear now resets HasTraces on all resources
- ClearStructuredLogs full-clear now resets HasLogs on all resources and clears _resourceUnviewedErrorLogs
@JamesNK JamesNK changed the title Fix unbounded collection growth in TelemetryRepository [release/13.3] Fix unbounded collection growth in TelemetryRepository Apr 30, 2026
@joperezr joperezr added the Servicing-approved Approved for servicing release label Apr 30, 2026
Copy link
Copy Markdown
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving modulo the inline comments — nothing blocking, but a few worth a look (esp. the O(N²) span link cleanup and the _meters growth path).

Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
Comment thread src/Aspire.Dashboard/Otlp/Model/OtlpResource.cs Outdated
Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs Outdated
Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
Comment thread src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs
@joperezr
Copy link
Copy Markdown
Member

@adamint can you please take a look here too?

- Add MaxScopeCount limit to TryGetOrAddScope, using TryGetValue instead
  of GetValueRefOrAddDefault to avoid add-then-remove on overflow
- Refactor instrument add in OtlpResource to use TryGetValue + count check
  before inserting, removing the add-then-remove pattern
- Fix AddLogs failure count: count log records, not scopes
- Fix AddMetrics failure count: count data points, not metrics
- Add tests for resource limit, scope limit, and correct failure counting
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🎬 CLI E2E Test Recordings — 66 recordings uploaded (commit f328daf)

View all recordings
Status Test Recording
AddPackageInteractiveWhileAppHostRunningDetached ▶️ View Recording
AddPackageWhileAppHostRunningDetached ▶️ View Recording
AgentCommands_AllHelpOutputs_AreCorrect ▶️ View Recording
AgentInitCommand_DefaultSelection_InstallsSkillOnly ▶️ View Recording
AgentInitCommand_MigratesDeprecatedConfig ▶️ View Recording
AspireAddPackageVersionToDirectoryPackagesProps ▶️ View Recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps ▶️ View Recording
Banner_DisplayedOnFirstRun ▶️ View Recording
Banner_DisplayedWithExplicitFlag ▶️ View Recording
Banner_NotDisplayedWithNoLogoFlag ▶️ View Recording
CertificatesClean_RemovesCertificates ▶️ View Recording
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate ▶️ View Recording
CertificatesTrust_WithUntrustedCert_TrustsCertificate ▶️ View Recording
ConfigSetGet_CreatesNestedJsonFormat ▶️ View Recording
CreateAndRunAspireStarterProject ▶️ View Recording
CreateAndRunAspireStarterProjectWithBundle ▶️ View Recording
CreateAndRunEmptyAppHostProject ▶️ View Recording
CreateAndRunJavaEmptyAppHostProject ▶️ View Recording
CreateAndRunJsReactProject ▶️ View Recording
CreateAndRunPythonReactProject ▶️ View Recording
CreateAndRunTypeScriptEmptyAppHostProject ▶️ View Recording
CreateAndRunTypeScriptStarterProject ▶️ View Recording
CreateJavaAppHostWithViteApp ▶️ View Recording
CreateTypeScriptAppHostWithViteApp_UsesConfiguredToolchain ▶️ View Recording
DashboardRunWithOtelTracesReturnsNoTraces ▶️ View Recording
DescribeCommandResolvesReplicaNames ▶️ View Recording
DescribeCommandShowsRunningResources ▶️ View Recording
DetachFormatJsonProducesValidJson ▶️ View Recording
DetachFormatJsonProducesValidJsonWhenRestartingExistingInstance ▶️ View Recording
DoListStepsShowsPipelineSteps ▶️ View Recording
DocsCommand_RendersInteractiveMarkdownFromLocalSource ▶️ View Recording
DoctorCommand_DetectsDeprecatedAgentConfig ▶️ View Recording
DoctorCommand_TypeScriptAppHostReportsMissingConfiguredToolchain ▶️ View Recording
DoctorCommand_WithSslCertDir_ShowsTrusted ▶️ View Recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted ▶️ View Recording
GlobalMigration_HandlesCommentsAndTrailingCommas ▶️ View Recording
GlobalMigration_HandlesMalformedLegacyJson ▶️ View Recording
GlobalMigration_PreservesAllValueTypes ▶️ View Recording
GlobalMigration_SkipsWhenNewConfigExists ▶️ View Recording
GlobalSettings_MigratedFromLegacyFormat ▶️ View Recording
InitTypeScriptAppHost_AugmentsExistingViteRepoAtRoot ▶️ View Recording
InteractiveCSharpInitCreatesExpectedFiles ▶️ View Recording
InvalidAppHostPathWithComments_IsHealedOnRun ▶️ View Recording
LegacySettingsMigration_AdjustsRelativeAppHostPath ▶️ View Recording
LogsCommandShowsResourceLogs ▶️ View Recording
OtelLogsReturnsStructuredLogsFromStarterAppCore ▶️ View Recording
PsCommandListsRunningAppHost ▶️ View Recording
PsFormatJsonOutputsOnlyJsonToStdout ▶️ View Recording
PublishWithConfigureEnvFileUpdatesEnvOutput ▶️ View Recording
PublishWithDockerComposeServiceCallbackSucceeds ▶️ View Recording
PublishWithoutOutputPathUsesAppHostDirectoryDefault ▶️ View Recording
RestoreGeneratesSdkFiles ▶️ View Recording
RestoreGeneratesSdkFiles_WithConfiguredToolchain ▶️ View Recording
RestoreRefreshesGeneratedSdkAfterAddingIntegration ▶️ View Recording
RestoreSupportsConfigOnlyHelperPackageAndCrossPackageTypes ▶️ View Recording
RunFromParentDirectory_UsesExistingConfigNearAppHost ▶️ View Recording
SecretCrudOnDotNetAppHost ▶️ View Recording
SecretCrudOnTypeScriptAppHost ▶️ View Recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels ▶️ View Recording
StartAndWaitForTypeScriptSqlServerAppHostWithNativeAssets ▶️ View Recording
StopAllAppHostsFromAppHostDirectory ▶️ View Recording
StopAllAppHostsFromUnrelatedDirectory ▶️ View Recording
StopNonInteractiveMultipleAppHostsShowsError ▶️ View Recording
StopNonInteractiveSingleAppHost ▶️ View Recording
StopWithNoRunningAppHostExitsSuccessfully ▶️ View Recording
UnAwaitedChainsCompileWithAutoResolvePromises ▶️ View Recording

📹 Recordings uploaded automatically from CI run #25197587185

@JamesNK
Copy link
Copy Markdown
Member Author

JamesNK commented May 1, 2026

Test errors are on the release/13.3 branch and are unrelated.

@joperezr joperezr merged commit 529f2ca into release/13.3 May 1, 2026
807 of 819 checks passed
@microsoft-github-policy-service microsoft-github-policy-service Bot added this to the 13.3 milestone May 1, 2026
@aspire-repo-bot
Copy link
Copy Markdown
Contributor

Pull request created: #797

Generated by PR Documentation Check

@aspire-repo-bot
Copy link
Copy Markdown
Contributor

A draft documentation PR has been opened on microsoft/aspire.dev targeting main (falling back from release/13.3 because that branch does not yet exist on aspire.dev).

Summary of documentation changes:

The dashboard [configuration docs]((learn.microsoft.com/redacted) were updated to document the new Dashboard:TelemetryLimits:MaxResourceCount option (default 10,000) added by this PR. This option caps the number of resources tracked by the dashboard to prevent unbounded memory growth in long-running sessions.

File modified: src/frontend/src/content/docs/dashboard/configuration.mdx

The draft PR needs human review before merging.

Generated by PR Documentation Check for issue #16590 · ● 832.2K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Servicing-approved Approved for servicing release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants