Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog-exporter: don't set tags which are None #272

Closed

Conversation

alertedsnake
Copy link
Contributor

@alertedsnake alertedsnake commented Dec 22, 2020

Description

This is often None, but tags are always strings, and so things get broken when spans get passed along to other client calls.

Fixes #262

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

This passes existing tests, and I've added one to verify when the field is set to None, it's skipped.

Does This PR Require a Core Repo Change?

  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added

@alertedsnake alertedsnake marked this pull request as ready for review December 23, 2020 03:31
@alertedsnake alertedsnake requested a review from a team as a code owner December 23, 2020 03:31
@alertedsnake alertedsnake requested review from codeboten and lzchen and removed request for a team December 23, 2020 03:31
@alertedsnake
Copy link
Contributor Author

One of the tests failed with a "performance alert" when I merged master in, not sure what that's about?

@@ -97,7 +97,7 @@ def inject(
self.SAMPLING_PRIORITY_KEY,
str(constants.AUTO_KEEP if sampled else constants.AUTO_REJECT),
)
if constants.DD_ORIGIN in span.context.trace_state:
if span.context.trace_state.get(constants.DD_ORIGIN) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will not happen from the latest version. Earlier TraceState was simple dict and there were no validation upon add/update key-value pair. We have fixed that in latest release and key-value pair much adhere to W3C tracestate format rules otherwise they will be discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very nice. I'll have to see if I can recreate the scenario I described in #262 with the current versions. Though, I am a little concerned that by disallowing that to be None we'll have moved the error from this code I patched, to whatever code was actually causing it in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, rebased on the current main branch, and after some more testing, this PR still fixes the issue described in #262, so whatever change you're describing did not fix things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange that you are still running into that error. From here every value in the tracestate must be a str type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I still haven't figured out where it's coming from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you update the opentelemetry-api and propagator versions you are using? Can you provide a reproducible example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is with code from the main branch of both main and contrib repositories.

I'll see if I can strip down something - so far I've only seen when making gRPC client calls to a gRPC service running the botocore instrumentation and datadog exporter, so it's a little complex to reproduce, but I might be able to simplify more. Give me a day or two.

Base automatically changed from master to main January 29, 2021 19:28
@alertedsnake alertedsnake force-pushed the datadog-propagator-bugfix branch 2 times, most recently from 926fcba to da614b6 Compare February 14, 2021 19:18
@alertedsnake
Copy link
Contributor Author

alertedsnake commented Feb 14, 2021

Aha, so I've managed to nearly re-recreate the scenario in a simplified way, and I've seen this message in the log:

WARNING:opentelemetry.trace.span:Invalid key/value pair (dd_origin, ) found.

I assume this is the thing you were talking about @lonewolf3739 - it does make my patch not necessary. So it's quite possible this is fixed, let me recreate all of my virtualenvs and pull in the latest, I must have missed something.

edit: actually what I should be doing is trying to find the source of that log message ^ - clearly that's the actual bug.

@lzchen
Copy link
Contributor

lzchen commented Feb 24, 2021

@alertedsnake
Any updates on this?

@alertedsnake
Copy link
Contributor Author

@lzchen yeah, there's some more detail in #262 - at this point I have found where the problem happens, but I don't really know the intent of the code that's setting this field to None, was asking if you guys knew more about what it's supposed to be doing.

This comment specifically: #262 (comment)

@alertedsnake
Copy link
Contributor Author

This fix no longer seems necessary as there isn't an exception thrown anymore, but the underlying bug does still seem present, see further discussion in #262 .

@alertedsnake
Copy link
Contributor Author

Hi, so I'm going to re-open this, since it does actually fix an issue I'm finding with None making it into HTTP headers being sent via botocore requests;

TypeError: expected string or bytes-like object

The header that's incorrect looks like this:

'x-datadog-origin': None,

The code in this PR does indeed fix the issue. But again, this is only fixing an incorrect issue from propagating, it's not fixing the issue from happening - see all of the previous discussion about that.

@srikanthccv
Copy link
Member

Can you share some minimal reproducible example for this problem? I don't see the possibility of tracestate value to be set to None.

@alertedsnake
Copy link
Contributor Author

As I mentioned, there's a lot of previous discussion wherein I went through this in detail, both here and in the linked issue. In this case, I've got a gRPC service that makes a query to DynamoDB, and it's when that Dynamo call gets made (with the botocore instrumentation) that one of the HTTP headers on the request is 'x-datadog-origin': None.

In this message I mention the code that causes the problem and asked for some insight from others who have worked on this - getting an answer here would help quite a bit:
#262 (comment)

There isn't really a minimal case, I could probably make one up if necessary, but I think it's pretty clear where the problem occurs from that question.

@srikanthccv
Copy link
Member

I haven't followed that discussion closely. If this indeed fixes the problem then there is a possibility that same problem exists in other propagators as well. If you could provide some reproducible example then we could find the root of problem and fix that instead of adding checks to prevent None getting propagated.

I don't think this fix changes anything by looking at theclass TraceState(typing.Mapping[str, str]): implementation unless I am missing something.

@alertedsnake
Copy link
Contributor Author

Here is my test case: https://github.com/alertedsnake/otel-test with some instructions on how to run it and how to discover the problem. Let me know if this helps.

@srikanthccv
Copy link
Member

Here is my test case: https://github.com/alertedsnake/otel-test with some instructions on how to run it and how to discover the problem. Let me know if this helps.

Thanks, I will check it.

@srikanthccv
Copy link
Member

Here is my test case: https://github.com/alertedsnake/otel-test with some instructions on how to run it and how to discover the problem. Let me know if this helps.

Looks like I need api keys/creds of various service providers otherwise I won't be able to run this example. I am fine with this getting in although I think this doesn't really fix the underlying problem. @alertedsnake Did you try pip editable install or something similar to see if this really fixed your issue?

@alertedsnake
Copy link
Contributor Author

Yeah, you'll need data dog keys, or to use one of the agent mocking Docker images, I can try to get that working.

As for the fix in this PR, yes, it does prevent the exception. But yes, it doesn't fix the actual problem - I was still hoping to get some insight from the original author or someone who knows the suspect code well.

@alertedsnake
Copy link
Contributor Author

@lonewolf3739 Ok I updated the test case to not require Datadog credentials, and added a line to the readme about the AWS credentails - apparently you don't need either to see the error happen.

@srikanthccv
Copy link
Member

@alertedsnake Thanks for sharing the updated example. I have done the troubleshooting and found the underlying problem. This line if constants.DD_ORIGIN in span.context.trace_state: evaluates to true even when there is no dd_origin key. The TraceState doesn't implement the __contains__ so the fallback as described in python docs returns true.

For user-defined classes which do not define contains() but do define iter(), x in y is True if some value z, for which the expression x is z or x == z is true, is produced while iterating over y. If an exception is raised during the iteration, it is as if in raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines getitem(), x in y is True if and only if there is a non-negative integer index i such that x is y[i] or x == y[i], and no lower integer index raises the IndexError exception. (If any other exception is raised, it is as if in raised that exception).

And this line span.context.trace_state[constants.DD_ORIGIN], returns None which is the reason for this problem.

@alertedsnake
Copy link
Contributor Author

alertedsnake commented Apr 14, 2021

This line if constants.DD_ORIGIN in span.context.trace_state: evaluates to true even when there is no dd_origin key. The TraceState doesn't implement the __contains__ so the fallback as described in python docs returns true.

This is the kind of thing I was hoping someone with more knowledge of OTEL could figure out, nice work! I guess this means my fix in this PR is the actual fix, I thought I was just fixing the symptom.

Michael Stella added 4 commits April 14, 2021 14:07
This is often `None`, but tags are always strings, and so things get
broken when spans get passed along to other client calls.

Fixes open-telemetry#262
@srikanthccv
Copy link
Member

No this is not the actual fix. We need to change tracestate to behave correctly with membership. Once we do that this is no longer needed.

@alertedsnake
Copy link
Contributor Author

Okay, that sounds even better. I'm happy to close this in favor of a ticket in the main repo.

@alertedsnake
Copy link
Contributor Author

Okay, that sounds even better. I'm happy to close this in favor of a ticket in the main repo.

In fact, I've got a fix, will make a PR shortly.

@alertedsnake
Copy link
Contributor Author

open-telemetry/opentelemetry-python#1773 - I don't think it needs anything more than this, but let me know if so.

Happy to open an issue in that repo if we think it needs one.

@lzchen
Copy link
Contributor

lzchen commented Apr 16, 2021

@lonewolf3739
Is this PR still relevant now that this is merged?

@srikanthccv
Copy link
Member

This can be closed.

@lzchen lzchen closed this Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC client + botocore + datadog exporter instrumentation error
3 participants