impl(o11y): record telemetry attributes on LRO span#5695
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances tracing for Long Running Operations (LROs) by recording operation names and resource IDs in the aip151 and discovery modules, and updating the 'LRO Wait' span with additional metadata. Feedback was provided regarding the removal of instrumentation from the into_stream method in src/lro/src/internal/tracing.rs, which would lead to fragmented traces and a loss of parent span context for operations executed within the stream.
1471b6c to
5978295
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5695 +/- ##
==========================================
- Coverage 97.89% 97.89% -0.01%
==========================================
Files 226 226
Lines 55471 55485 +14
==========================================
+ Hits 54304 54315 +11
- Misses 1167 1170 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5978295 to
7355448
Compare
| query, | ||
| ); | ||
| let p0 = poller.poll().await; | ||
| let p0 = poller.poll().instrument(test_span()).await; |
There was a problem hiding this comment.
Shouldn't we verify the span contents?
Aside: consider writing new tests just for tracing. Often it is quickest to use existing tests to test a new feature... but when you do that enough times the tests get complicated and lose their focus.
There was a problem hiding this comment.
ack. I added unit tests for both aip151 and discovery to verify the 2 specific attributes we populate for LROs.
7355448 to
a983c67
Compare
| span.record("gcp.longrunning.operation_name", name); | ||
| span.record("gcp.resource.destination.id", name); |
There was a problem hiding this comment.
are these ever different?
There was a problem hiding this comment.
in the final design we decided that they're always the same. we expected that downstream tools (e.g., AppHub) will use Operation ID as the default.
| async fn poll(&mut self) -> Option<PollingResult<ResponseType, MetadataType>> { | ||
| if let Some(start) = self.start.take() { | ||
| let result = start().await; | ||
| let (op, poll) = crate::details::handle_start(result); |
There was a problem hiding this comment.
Is it possible that the operation is marked done inside the start()? If so we might want to extract the name from the result above handle_start?
There was a problem hiding this comment.
That's a good point. I fixed the loops to extract the name directly from result (if Ok) before calling handle_start. Now we can make sure we don't miss these attributes.
I added a test for this case.
| let (op, poll) = crate::details::handle_start(result); | ||
| #[cfg(google_cloud_unstable_tracing)] | ||
| if let Some(ref name) = op { | ||
| let span = tracing::Span::current(); |
There was a problem hiding this comment.
if tracing is disabled, this might interfere with the customers' spans. Can we attach the span directly to the PollerImpl?
There was a problem hiding this comment.
Yeah you're right. I considered a few approaches for context propagation:
- adding
operation_name()as a trait method (impl(o11y): add operation name support toPollerinterface #5694): this pollutes our public API and intermediate decorators. - direct span mutation (what you were looking at): this might interfere with customers' spans.
tokio::task_local!(the newer commits).
We also used task-local back then for gaxi RequestRecorder, so I guess it's okay to adopt this.
There was a problem hiding this comment.
Regarding attaching the span directly to PollerImpl: I wanted to avoid adding a tracing field to PollerImpl because that forces our polling struct to actively manage and carry o11y variables even when tracing is disabled.
3b62b4a to
5b5d0da
Compare
5b5d0da to
27f2c17
Compare
…tion IDs internally
58f49ef to
e8ac197
Compare
In our previous iteration (#5694), we introduced an
operation_name(&self)query method to the publicPollertrait.To avoid polluting the public API, we utilize thread-local span propagation: because the LRO span is active in the thread context, the internal pollers (
PollerImplandDiscoveryPoller) record the LRO operation name and destination resource ID directly onto the current active span using tracing::Span::current().record().