Fix telemetry streaming to wait for resource to appear#17821
Conversation
When streaming spans/logs with a resource filter via follow=true, if the resource hasn't sent any telemetry yet, the stream would immediately return empty. This fixes the issue by waiting for the resource to appear via the OnNewResources subscription before starting the watcher. Uses a bounded channel (capacity 1, DropOldest) as the signaling mechanism to avoid race conditions between notifications and state checks. Fixes #17802
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17821Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17821" |
There was a problem hiding this comment.
Pull request overview
This PR fixes telemetry streaming behavior in the Aspire Dashboard so that follow=true span/log streams don’t immediately end when a resource filter is specified but the resource hasn’t emitted telemetry yet. Instead, the service now waits for the resource(s) to appear and then begins streaming.
Changes:
- Updated
FollowSpansAsyncandFollowLogsAsyncto wait for resource discovery via a newWaitForResourceKeysAsynchelper rather than returning an empty, closed stream. - Added
WaitForResourceKeysAsyncwhich subscribes to resource-add notifications and blocks (cancellable) until the requested resource key(s) can be resolved. - Added new unit tests validating that streaming started before telemetry arrival will later yield data once the resource appears.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/Aspire.Dashboard/Api/TelemetryApiService.cs |
Adds a resource-wait mechanism before starting span/log watchers, preventing premature stream termination for filtered follow streams. |
tests/Aspire.Dashboard.Tests/TelemetryApiServiceTests.cs |
Adds tests covering “subscribe before emission” scenarios for both spans and logs. |
adamint
left a comment
There was a problem hiding this comment.
reviewed with gpt-5.5 and opus 4.8/4.7/4.6. I don't see anything high severity here, so approving.
One non-blocking issue I think is worth looking at: WaitForResourceKeysAsync waits on OnNewResources, but TelemetryRepository.GetResources() excludes uninstrumented peers. If a resource already exists only as an uninstrumented peer, later telemetry upgrades it via SetUninstrumentedPeer(false) without raising OnNewResources because the resource wasn't newly added. A follow stream for that resource can stay stuck until some unrelated resource is added. Maybe the wait path should include uninstrumented peers when resolving names, or promotion should raise the resource subscription.
I also saw the fixed-delay test comments that the bot already left. I don't think those block this PR.
|
Tested this locally:
I tried to do the dashboard Playwright smoke check too, but couldn't get a dashboard app running for a reason that looks unrelated to this PR: the generated starter app fails to compile with |
|
Is there a PR testing report? |
PR Testing ReportPR Information
CLI Version Verification
Changes AnalyzedFiles Changed
Change Categories
Test Setup
Test Scenarios ExecutedScenario 1: Follow logs for an already-running resource (regression check)Objective: Verify the fix didn't break the existing case where the resource is already known.
Scenario 2: Follow logs before the resource emits telemetry (the bug being fixed)Objective: Validate the PR's central intent — that a stream opened before a resource has emitted any telemetry waits and then delivers data once the resource appears. Direct E2E reproduction would require either restarting the AppHost while a stream is open or adding a resource to the running model — neither is straightforward without invasive rigging. The PR includes targeted unit tests for this exact path:
These exercise the Scenario 3: Follow logs for a resource that never appears (key behavior change)Objective: Validate the most important new behavior introduced by the PR — that an unresolvable resource name no longer terminates the stream immediately.
Pre-fix behavior would have returned HTTP 200 with an immediately-closed empty NDJSON body. The fix achieves "stream stays open" as designed. Important nuance for consumers: Because
Consumers (the AI assistant / MCP-style tools) that wait on response headers as a readiness signal will hang. Most HTTP clients will treat this as a request timeout rather than a successful empty stream. Scenario 4: Partial resolution — one resource exists, one doesn't (semantics preserved)Objective: Confirm the all-or-nothing semantics of
Scenario 5: Spans endpoint symmetric checkObjective: Confirm both
Summary
Overall Result✅ PR verified — fix behaves as advertised, with one notable refinement worth flagging: Recommendations / things to consider
|
3142278 to
7af8f60
Compare
…ve Task.Delay from tests
7af8f60 to
7e40583
Compare
…ogsRequest to use ResourceKeys Change ResourceKey? ResourceKey property to IReadOnlyList<ResourceKey> ResourceKeys on all telemetry request types for consistency with GetLogsContext. This pushes multi-resource filtering into the repository layer, eliminating per-resource loops and client-side filtering in TelemetryApiService.
f096320 to
331ea74
Compare
331ea74 to
70d1d5e
Compare
|
❓ CLI E2E Tests unknown — 110 passed, 0 failed, 2 unknown (commit View all recordings
📹 Recordings uploaded automatically from CI run #26805992660 |
Document that when using ?follow=true with a resource filter, the stream waits for the named resource(s) to appear in the dashboard before streaming data. This matches the behavior fixed in microsoft/aspire#17821. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pull request created: #1182
|
|
📝 Documentation has been drafted in microsoft/aspire.dev#1182 targeting Updated Note This draft PR needs human review before merging. |
Summary
Fixes #17802
When streaming spans/logs with a resource filter via
follow=true, if the resource hasn't sent any telemetry yet, the stream would immediately return empty (HTTP 200 with closed NDJSON stream). This is a problem because the dashboard starts streaming before the resource has had a chance to emit telemetry.Changes
src/Aspire.Dashboard/Api/TelemetryApiService.csWaitForResourceKeysAsyncmethod that subscribes toOnNewResourcesnotifications and waits for the requested resource(s) to appear before starting the span/log watcherDropOldest) as the signaling mechanism to avoid race conditions between notifications and state checksFollowSpansAsyncandFollowLogsAsyncnow call this instead of immediately yielding break on unknown resourcestests/Aspire.Dashboard.Tests/TelemetryApiServiceTests.csFollowSpansAsync_WaitsForResourceToAppear_ThenStreams- verifies that streaming waits for a resource and yields data once it appearsFollowLogsAsync_WaitsForResourceToAppear_ThenStreams- same for logsDesign
The subscribe-before-check pattern ensures no notification is lost:
OnNewResourceschannel signal firstOperationCanceledExceptionpropagates naturally when the client disconnectsThe bounded channel with
DropOldestcollapses multiple rapid resource notifications into a single "wake up and re-check" signal, which is simpler and more robust than a TCS-reset approach.