[release/13.3] Fix unbounded collection growth in TelemetryRepository#16590
Conversation
- Add MaxResourceCount option to TelemetryLimitOptions (default 10,000) to cap _resources growth. Throws on limit exceeded. - Clear _logScopes, _logPropertyKeys on ClearStructuredLogs (full clear) and remove per-resource property keys on per-resource clear. - Clear _traceScopes, _tracePropertyKeys, _spanLinks on ClearTraces (full clear) and clean up span links and property keys per-resource. - Clear _meters alongside _instruments in OtlpResource.ClearMetrics. - Add internal const limits on TelemetryRepository for resource views (10,000), instruments (10,000), dimensions (10,000), known attribute value keys (10,000), and values per key (10,000). - Enforce instrument limit in OtlpResource.AddMetrics. - Enforce resource view limit in OtlpResource.GetView. - Enforce dimension limit in OtlpInstrument.FindScope. - Cap KnownAttributeValues keys and per-key value lists in OtlpInstrument.CreateDimensionScope. - Add clarifying comments to fields describing their bounds.
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 16590Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 16590" |
There was a problem hiding this comment.
Pull request overview
This PR addresses potential memory exhaustion in long-running Aspire Dashboard sessions by bounding several telemetry data structures (resources, metrics dimensions/attributes, resource views/instruments) and by cleaning up previously unbounded collections when telemetry is cleared.
Changes:
- Introduces
TelemetryLimitOptions.MaxResourceCountand enforces it when creating newOtlpResourceentries. - Ensures full-clear and per-resource clear paths remove additional cached state (scopes/property-key caches, span-link cache).
- Adds caps for resource views, instruments, metric dimensions, and known attribute values; and fixes
ClearMetrics()to also clear meters/scopes.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs | Adds resource-count limit; clears scope/property-key/link caches on clear paths; per-resource clear removes related cached entries. |
| src/Aspire.Dashboard/Otlp/Model/OtlpResource.cs | Enforces per-resource caps (resource views, instruments) and clears meters on metric clear. |
| src/Aspire.Dashboard/Otlp/Model/OtlpInstrument.cs | Enforces caps on dimension scopes and known attribute values (keys/values). |
| src/Aspire.Dashboard/Configuration/DashboardOptions.cs | Adds MaxResourceCount option to telemetry limits configuration. |
Comments suppressed due to low confidence (1)
src/Aspire.Dashboard/Otlp/Storage/TelemetryRepository.cs:264
- The resource limit check runs before
_resources.GetOrAdd(...)and can throw even when the resource was concurrently added after the initialTryGetValuefast-path (false-positive limit hit). Also, callers likeGetPeerResource(...)/CalculateTraceUninstrumentedPeers(...)don’t catch this exception, so hittingMaxResourceCountcan bubble out ofAddTracesCoreand fail the entire trace ingestion request. Consider attempting theGetOrAddfirst and only enforcing the limit whennewResource == true(optionally removing the just-added entry if over limit), and/or ensure the uninstrumented-peer paths handle the limit without throwing out of ingestion.
// Check resource limit before adding a new resource.
if (_resources.Count >= _otlpContext.Options.MaxResourceCount)
{
throw new InvalidOperationException($"Resource limit of {_otlpContext.Options.MaxResourceCount} reached. Resource '{key}' will not be added.");
}
// Slower get or add path.
// This GetOrAdd allocates a closure, so we avoid it if possible.
var newResource = false;
resource = _resources.GetOrAdd(key, _ =>
{
newResource = true;
return new OtlpResource(key.Name, key.InstanceId, uninstrumentedPeer, _otlpContext);
});
… limit - Add TelemetryLimitTests with 5 tests for resource and instrument limits - Add maxResourceCount parameter to CreateRepository test helper - Wrap GetOrAddResource calls in GetPeerResource (return null), CalculateTraceUninstrumentedPeers, and OnPeerChanged with try/catch - Fix _meters comment to not claim an unenforced bound - Document TOCTOU soft-cap behavior on resource limit check
|
Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
|
- ClearTraces full-clear now resets HasTraces on all resources - ClearStructuredLogs full-clear now resets HasLogs on all resources and clears _resourceUnviewedErrorLogs
|
@adamint can you please take a look here too? |
- Add MaxScopeCount limit to TryGetOrAddScope, using TryGetValue instead of GetValueRefOrAddDefault to avoid add-then-remove on overflow - Refactor instrument add in OtlpResource to use TryGetValue + count check before inserting, removing the add-then-remove pattern - Fix AddLogs failure count: count log records, not scopes - Fix AddMetrics failure count: count data points, not metrics - Add tests for resource limit, scope limit, and correct failure counting
|
🎬 CLI E2E Test Recordings — 66 recordings uploaded (commit View all recordings
📹 Recordings uploaded automatically from CI run #25197587185 |
|
Test errors are on the release/13.3 branch and are unrelated. |
|
Pull request created: #797
|
|
A draft documentation PR has been opened on Summary of documentation changes: The dashboard [configuration docs]((learn.microsoft.com/redacted) were updated to document the new File modified: The draft PR needs human review before merging.
|
Description
Fix unbounded collection growth in
TelemetryRepositoryand related OTLP model types that could lead to memory exhaustion in long-running dashboard sessions.Problems fixed:
_logScopes,_traceScopes,_logPropertyKeys,_tracePropertyKeys, and_spanLinkswere never cleaned up when logs/traces were cleared, causing indefinite accumulation._resourceshad no cap, allowing unbounded growth from dynamic services or high-cardinality peer addresses._instruments,_meters,_resourceViewshad no caps.DimensionsandKnownAttributeValueshad no caps, making high-cardinality metric tags a memory leak vector.ClearMetrics()did not clear_meters(scopes).ClearTraces()full-clear path did not clear_spanLinks(sinceCircularBuffer.Clear()doesn't fireItemRemovedForCapacity).ClearTraces()path did not remove span links for removed traces.Changes:
MaxResourceCountoption toTelemetryLimitOptions(default 10,000). ThrowsInvalidOperationExceptionon limit exceeded, handled by existing callers._logScopes/_logPropertyKeysinClearStructuredLogsand_traceScopes/_tracePropertyKeys/_spanLinksinClearTraceson full clear._logPropertyKeys/_tracePropertyKeysentries and span links on per-resource clear._metersalongside_instrumentsinOtlpResource.ClearMetrics().internal constlimits onTelemetryRepositoryfor resource views, instruments, dimensions, known attribute value keys, and values per key (all 10,000).OtlpResource.GetView,OtlpResource.AddMetrics,OtlpInstrument.FindScope, andOtlpInstrument.CreateDimensionScope.Checklist
<remarks />and<code />elements on your triple slash comments?aspire.devissue: