Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error.hint semantic convention #822

Closed
wants to merge 11 commits into from

Conversation

mwear
Copy link
Member

@mwear mwear commented Aug 18, 2020

Changes

This PR adds an error.hint semantic convention, and updates the span.RecordException API method to optionally set it.

The purpose of this attribute is to convey an instrumentation author's best guess as to whether an exception represents an error condition. This will be a flag that tracing backends can use to identify candidate spans where an error may have been observed.

Tracing backends can use this information:

  • for tail-based sampling decisions
  • to compute an application's error rate
  • as a cue to do more analysis on a span's attributes / events
  • to highlight a span in a trace

It serves a similar purpose to theerror: true tag from OpenTracing. It's called error.hint rather than error in order to impart that this is not a definitive decision and a tracing backend may override it.

If we remove the Span.status API as proposed in in #706 a similar semantic convention will be needed as Jaeger exporters from a number of projects use it as a proxy for the error = true OpenTracing semantic convention.

See:

Related issues #599 #706

@Oberon00
Copy link
Member

Oberon00 commented Aug 18, 2020

How do you see the relation of this PR to #784's exception.escaped? I'm a bit wary of adding error.hint to RecordException, as it seems to be a generally useful event attribute and I think we should find a more general solution to record non-exception-specific attributes with RecordException, see #814. Oh, it's a span attribute. Then why add it to recordException at all? It can be set with setAttribute just fine. Or if we want a dedicated API, I would make a separate one. EDIT: Still, see #814

@iNikem
Copy link
Contributor

iNikem commented Aug 18, 2020

Btw, @Oberon00 are you aware that there is a meeting of Errors WG every week? As it seems that you have great interest in this topic, care to join us there for real-time discussion?

@Oberon00
Copy link
Member

Oberon00 commented Aug 18, 2020

I copied the meeting into my calendar, thanks for the reminder 👍

Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes because I think this should be separate from exceptions.

specification/trace/semantic_conventions/exceptions.md Outdated Show resolved Hide resolved
@Oberon00 Oberon00 added area:api Cross language API specification issue area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions labels Aug 18, 2020
@Oberon00
Copy link
Member

Based on #706 (comment), I'm marking this after-ga for now.

@Oberon00 Oberon00 added the release:after-ga Not required before GA release, and not going to work on before GA label Aug 18, 2020
@mwear
Copy link
Member Author

mwear commented Aug 18, 2020

Based on #706 (comment), I'm marking this after-ga for now.

If we don't have this (or something similar) for GA, we cannot remove span status. At the very least, Jaeger exporters depend on this information as I linked in the description (repeated below):

@iNikem
Copy link
Contributor

iNikem commented Aug 18, 2020

@mwear Is this "error" tag in Jaeger somehow specified? Or is it just a de facto convention?

@mwear
Copy link
Member Author

mwear commented Aug 18, 2020

@mwear Is this "error" tag in Jaeger somehow specified? Or is it just a de facto convention?

It's a semantic convention from OpenTracing that Jaeger has built functionality around: https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table,

@Oberon00 Oberon00 added release:required-for-ga Must be resolved before GA release, or nice to have before GA and removed release:after-ga Not required before GA release, and not going to work on before GA labels Aug 18, 2020
@Oberon00
Copy link
Member

From the SIG mtg: We have Jaeger in our GA requirements and Jaeger more or less requires the ability to mark errored spans, so marking this as required for GA.

@bogdandrutu
Copy link
Member

@Oberon00 we have required to support exporting Jaeger which does not require to support a way to record error in the API. You can argue that we have OpenTracing -> OpenTelemetry bridge to support which require a solution for "error = true".

Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original complaint is resolved (#822 (comment)). Thank you!

But I'm still marking this as "Request changes" until we have resolved how to deal with ever-growing parameter count for RecordException, see my review comment in api.md #822 (comment).

## Error Attributes

A span or event can be annotated with an `error.hint` attribute to indicate that
an error condition was observed. If an [exception event](exceptions) is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the exception special case in a separate paragraph and ensure that it is clear that this attribute can be used without exceptions. With the way it's currently worded, it seems like this can only be used with exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On similar lines, do we want to define a recommended mapping of http status code to error hint? Is that a use case for this attribute?

Copy link
Member

@Oberon00 Oberon00 Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would be a good idea. If you do that, you don't know if the error hint comes from an "intelligent" guess or is just a redundantly mapped value. Or maybe we should make error.hint an enum / add an additional attribute to indicate "this is based on another (main status) attribute" vs "this is a more intelligent guess" (e.g. a 404 is unexpected here).
I have no strong opinion on this yet though.

specification/trace/api.md Outdated Show resolved Hide resolved
@Oberon00
Copy link
Member

@carlosalberto @bogdandrutu If you think this is after-GA, please relabel. I only relabelled this because I thought this was our decision in the SIG meeting.

@tigrannajaryan
Copy link
Member

@carlosalberto @bogdandrutu If you think this is after-GA, please relabel. I only relabelled this because I thought this was our decision in the SIG meeting.

I would prefer to resolve this before GA. Especially given that there is an intent to remove Span Status. We will loose normalized ability to indicate errors in the spans, and many products need some hint to show erroneous spans in the UI.

@carlosalberto
Copy link
Contributor

carlosalberto commented Aug 23, 2020

I would prefer to resolve this before GA. Especially given that there is an intent to remove Span Status.

Agreed. For any other semantic convention, I'd be fine to postpone it, but this is a special case (also, the Jaeger exporter needs a way to report errors).

Edit: No need to re-label, as it actually stands as "required-for-ga" already ;)

Copy link
Member

@arminru arminru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwear Please add a changelog entry.

specification/trace/api.md Outdated Show resolved Hide resolved
specification/trace/api.md Outdated Show resolved Hide resolved
specification/trace/semantic_conventions/span-general.md Outdated Show resolved Hide resolved
specification/trace/semantic_conventions/span-general.md Outdated Show resolved Hide resolved
mwear and others added 3 commits August 28, 2020 13:38
@iNikem
Copy link
Contributor

iNikem commented Aug 31, 2020

@open-telemetry/specs-approvers @open-telemetry/technical-committee Accepting and merging this PR is one of the ways to speed up open-telemetry/oteps#134. Is it possible to expedite this?

@Oberon00 Oberon00 dismissed their stale review August 31, 2020 06:53

See my comment I'm about to post

as for the `AddEvent` method). If attributes with the same name would be
generated by the method already, the additional attributes take precedence.

- A boolean error hint, as defined by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a question I'd like to have an answer to before merging this: Why is this additional parameter in the RecordException spec instead of the AddEvent spec. Do we expect more intelligent guesses (beyond exception.escaped / #784) to be more likely for exceptions than any other event? Personally, I don't see a reason for that to be true, but there may be one.

I would thus propose to move this parameter to AddEvent. The "Note: RecordException may be seen as a variant of AddEvent with additional exception-specific parameters and all other parameters being optional" then implicitly already allows languages to add error.hint (as well as event name and timestamp) as additional optional parameters, though I would be fine with making these more explicit or even SHOULD requirements here (I'm also OK with a specific MUST requirement for error.hint for RecordException that references the actual spec wording in AddEvent.

The rationale being, that I'd like to see error.hint not unnecessarily coupled with exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would thus propose to move this parameter to AddEvent.

Why? I imagine two main cases for this semantic convention:

  1. There was some exception and you want to mark it as an actual irrecoverable error, thus you call RecordException(error_hint=true)
  2. There was something wrong with some parameters, a validation failed, or some timeout happened, and you want to set the convention directly (no need for an event to happen), without an actual event, so you call SetAttribute() for it.

If users actually want to have a error_hint parameter for AddEvent() we will know for sure, and we can add it in the future. It's easier to add it than to remove it later on IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There was some exception and you want to mark it as an actual irrecoverable error, thus you call RecordException(error_hint=true)

First, I don't think that it is usual to know whether an exception is irrecoverable or not, usually you only know if it escaped. Second, I would be OK with leaving the parameter a MUST for RecordException, I just don't want the primary definition to be associated with exceptions more than required. I fear that error.hint could become a more widely used but less consistent synonym for exception.escaped, and am happy about every bit of distance it has from exceptions.

  1. There was something wrong with some parameters, a validation failed, or some timeout happened, and you want to set the convention directly (no need for an event to happen), without an actual event, so you call SetAttribute() for it.

It's true that with the current semantic conventions, exceptions are special in that they are the only AFAIK semantic convention for events yet (besides gRPC stream messages). Yet no one is arguing to remove AddEvent. I just think that error.hint is more logical there but I agree that currently there is no use case for having it in AddEvent (other than manually building an exception event).

It's easier to add it than to remove it later on IMHO.

As far as that goes, since #874 it is possible to have any additional attributes in RecordException, so the errorHint parameter could be replaced by calling RecordException with that additional attribute and then calling SetAttribute("error.hint", true) (except for the "only set to false if not set yet" part, which would have to be substituted by "only call SetAttribute if true"). The main thing we accomplish by having this parameter, is guiding users to use the semantic attribute. And personally I am not sure if having an "intelligent guess" that goes beyond exception.escaped is common enough to have it baked into the API. To that end, I would be fine with removing the API changes without replacement (instead of moving/copying the parameter to AddEvent) too, maybe it would be even better after all to move the API changes to a separate PR.

@carlosalberto
Copy link
Contributor

@open-telemetry/specs-approvers Please review/approve this PR. As @iNikem said, this is blocking an OTEP, so your help is needed (and I don't want to play the dictator by approving/merging this based on my own judgment ;) )

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Aug 31, 2020

Looks like the are still disagreements about how the API should look like. In order to make progress I suggest that we split this PR into 2 parts:

  1. Introduction of "error.hints" semantic convention. I think everybody is in agreement about this part. I do not see unresolved comments and this can be merged quickly. This will be also sufficient to unblock and move forward Remove Span.Status oteps#134
  2. Discussion around API, whether it needs to be RecordException or AddEvent and what the parameters should look like. This can take a bit more time, but will no longer block OTEP 134.

@tigrannajaryan
Copy link
Member

@Oberon00 can you please attend the Error WG meeting this Thursday 8am PT? We want to make the final call on this PR and on the error.hint (provided that is not resolved offline in this thread before that).

@mwear
Copy link
Member Author

mwear commented Aug 31, 2020

I appreciate the willingness to split this up to facilitate moving this and related PRs forward, but am conflicted in that I think it's essential that we have an API to help manage the interplay between spans and events when setting the error.hint semantic convention. My concern with splitting this PR is that we could end up with the error.hint semantic convention included in GA, without the corresponding API. The other consideration, is that we are proposing a replacement for span status. We should ensure that the replacement is actually an improvement. I'm not sure that the semantic convention alone would be an improvement.

@Oberon00
Copy link
Member

Oberon00 commented Aug 31, 2020

we could end up with the error.hint semantic convention included in GA, without the corresponding API

I don't think that's all that concerning. We also don't have an API for any other semantic convention. OpenTracing also did not have an API for setting the error attribute AFAIK. EDIT: You still can set error.hint using the additional attributes argument to RecordException plus SetAttribute on the span.

I'm not sure that the semantic convention alone would be an improvement.

Why is that?

@mwear
Copy link
Member Author

mwear commented Aug 31, 2020

I had two main objectives when opening this PR:

  • To have a span level attribute to indicate whether an error was (likely) observed during the execution of span
  • Encourage a user to think about, and set that attribute when recording an exception (via the RecordException API).

As a result of the discussions on the PR, we realized that if you also add the an error.hint attribute to the exception event, it would be possible to link exceptions that were a cause of the error.hint on the span. This added some additional complexity as it requires setting attributes in multiple places. This makes the API all that more important as it will be easy for a user to forget, or simply be unaware that the attribute should be recorded in more than one place.

To answer your questions @Oberon00:

Why is the API important?

  • The API has always been important since the beginning of this PR, even when it was just recording a span level error.hint attribute. The initial reason was to improve on the situation that existed in OpenTracing, which had the error: true semantic convention. It was common to record an error as a log (event in Otel), but forget to record the error: true span tag.

  • Since this PR evolved to setting the matching error.hint attributes on a span and their corresponding events, we've introduced additional complexity that the API shields the user from.

  • It will help backends that want to be able to build functionality on the existence of matching error.hint attributes between spans and events, and it will likely be correct, most of the time. Requiring users to a) know about and b) use multiple API methods will be error prone

Why might we want to keep span status without the API component of this PR?

Revisiting the two objectives for this PR mentioned above, one was having a span level indication of whether or not an error (likely) was observed, and the other was an API to encourage users to think about, and set that attribute when recording an exception. Without the API we are left with the span level indicator of (likely) error. A non-ok span status can be used as a proxy for that situation. One benefit is that it is more nuanced than a boolean, another benefit is that is has a first class API method on the span to help and encourage users to set it. One downside, is that the API method doesn't encourage a user to to consider setting it when recording an exception, as the current proposal does, but given the choice between an error.hint convention (without an API) and span status, span status looks more appealing.

@tedsuo
Copy link
Contributor

tedsuo commented Aug 31, 2020

Regardless of whether it is error.hint or span.status, I think we need to extend the Event/Exception API to include the option of setting this marker. It will be inevitable that users fail to set it if we don't.

Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that could convince me that the API is useful would be concrete examples of instrumentations that would set error.hint different from exception.escaped.

@iNikem
Copy link
Contributor

iNikem commented Sep 1, 2020

One thing that could convince me that the API is useful would be concrete examples of instrumentations that would set error.hint different from exception.escaped.

Easy: servlet auto-instrumentation which sets error.hint=true whenever returned http status is >500

@Oberon00
Copy link
Member

Oberon00 commented Sep 1, 2020

That makes sense, but would it call RecordException for that?

@iNikem
Copy link
Contributor

iNikem commented Sep 1, 2020

That makes sense, but would it call RecordException for that?

Oi, sorry, of course not :) I forgot for a second that we are talking about RecordException API here.

Regarding error.escaped: I have a service, which queries some data. Data query throws some network exception, I catch it, log it and throw another, service-specific exception. Now, manual instrumentation inside that service method can record data querying exception with RecordException API setting error.hint=true but error.escaped=false. If I understand correctly the meaning of error.escaped.

@Oberon00
Copy link
Member

Oberon00 commented Sep 1, 2020

It's exception.escaped, not error.escaped, but I think you described the meaning correctly. #761 has an example and a definition of "escaped".
You would probably record another exception for the service exception with exception.escaped=true

@iNikem
Copy link
Contributor

iNikem commented Sep 1, 2020

You would probably record another exception for the service exception with exception.escaped=true

Yes, I think so.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed in another comment #784 (comment) we should make a clear recommendation when something is API concept vs Semantic convention. We should not allow some implementation to have an API concept that others have as semantic convention because there may be different requirements for backwards compatibility between the two for example.

@Oberon00
Copy link
Member

Oberon00 commented Sep 3, 2020

@bogdandrutu Probably it is too late for you to see this now, but maybe you can jump into today's (8 AM Pacific) error WG meeting? https://www.google.com/calendar/event?eid=MXM5Y3QydWhhdGltZjY3cDNiOHNxN2szaWJfMjAyMDA5MDNUMTUwMDAwWiBnb29nbGUuY29tX2I3OWUzZTkwajdiYnNhMm4ycDVhbjVsZjYwQGc&ctz=America/Los_Angeles

Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the error WG meeting just now (see Notes on Google Docs), I think we should not merge this PR, not even just the error.hint convention without API change, currently. The reason is that it is unclear when you are supposed to set error.hint:

  • Is it a "weak" hint that is to be used by backends as a fallback when it cannot determine whether this was an error
    otherwise? I.e. is this a helper hint for backends which do not interpret http.status_code and I should set it on an HTTP span whenever I have a status >=400?
  • Is it a "strong" hint that should override the backends' decision on whether this is an error? For example I might have an "delete_user_account" or "force_abort_operation" operation that I want to treat as an error even when it has a status of 200, or I might have a low priority "is this ready" API where a 503 is not considered an error.
  • There is the third case, that @mwear pointed out, that you are tracing some operation for which there are no other semantic conventions to tell whether it is an error or not (for example I want to trace some internal function in my code). In that case, it would not matter if you have a strong or a weak hint.

It seems that jaeger/OpenTelemetry error tag was used mostly as a "weak" hint and was expected to be set whenever there was an indication of an error. But maybe we want both things. I'd say it would be best if we did not put the burden of providing a weak hint on instrumentations. And a strong hint can usually not be supplied by generic instrumentations but only by application's manual instrumentation (if at all).

One more thought: We may even think about a convention error.hint.<priority-integer/enum>. Or maybe we should record an indicator on whether this is a generic or application instrumentation in the InstrumentationLibraryInfo. Lot's of things to be considered...

@tigrannajaryan
Copy link
Member

I agree. This is not ready for merging, still too much disagreement. I believe @tedsuo is going to come up with an alternate proposal.

@tigrannajaryan
Copy link
Member

We may even think about a convention error.hint.<priority-integer/enum>.

Do I sense that this is gravitating towards my earlier proposal to use SeverityNumber? :-)

@Oberon00
Copy link
Member

Oberon00 commented Sep 3, 2020

We would have a severity linked with a success value. So you could have a success with a high "severity". More a decision weight / probability of being right. But this is just an idea I threw out, probably too complex to be useful anyway.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 11, 2020
@mwear
Copy link
Member Author

mwear commented Sep 17, 2020

Closing this in favor of: #965.

@mwear mwear closed this Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions release:required-for-ga Must be resolved before GA release, or nice to have before GA Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants