You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two services instrumented in production: a Ruby service that talks to a Python service. I just spent a bunch of time debugging two issues I was getting with my distributed traces (which were actually kind of one issue), and it's a bit of a long story. For want of a better place to share the story and the questions left in its wake, I'm opening the issue here. So, strap in kids, we're going for a ride...
The whole crux of my bug is that I add a trace-level field whose value happened to be a URI object.
When I added the Python service, I saw two types of errors:
When the Python traces were present, they weren't nested in the Ruby spans as expected.
Sometimes, the request from the Ruby service to the Python service would fail with an HTTP 400.
Both of these smelled like trace header issues. So I added the X-Honeycomb-Trace header value and the HTTP 400 response body to the Ruby spans.
Turns out the HTTP 400s were from nginx rejecting the requests with
<html><head><title>400 Request Header Or Cookie Too Large</title></head><bodybgcolor="white"><center><h1>400 Bad Request</h1></center><center>Request Header Or Cookie Too Large</center></body></html>
Okay, I have a healthy amount of trace fields, but I didn't think I had that many. So I grab the header value from the span and go about trying to parse it. Just how much data is getting sent?
The beeline-ruby problem
I'm more familiar with the Ruby beeline, so I chose to use it to parse the header in my local console. Using a constructed example for the sake of this write-up (now that I know what the actual issue is), let's suppose the trace header looked like this:
🔴 Problem: Unrelated to Honeycomb, yajl's interaction with active_support leads to differences in JSON encoding that may cause the trace fields serialization to balloon in size.
The beeline-python problem
So what about when the header isn't too big? Why weren't the Python traces showing up nested?
🔴 Problem: The Honeycomb beelines are sometimes inconsistent when it comes to interoperability. E.g., honeycombio/beeline-python#113. This makes distributed tracing kind of fraught.
Outro
This was quite an issue to hunt down! For now, I can skirt it by simply URI#to_s-ing the trace-level field before adding it. But I thought it touched on a lot of important considerations for the beelines.
The text was updated successfully, but these errors were encountered:
We will be closing this issue as it is a low priority for us. It is unlikely that we'll ever get to it, and so we'd like to set expectations accordingly.
As we enter 2022 Q1, we are trimming our OSS backlog. This is so that we can focus better on areas that are more aligned with the OpenTelemetry-focused direction of telemetry ingest for Honeycomb.
If this issue is important to you, please feel free to ping here and we can discuss/re-open.
Background
I have two services instrumented in production: a Ruby service that talks to a Python service. I just spent a bunch of time debugging two issues I was getting with my distributed traces (which were actually kind of one issue), and it's a bit of a long story. For want of a better place to share the story and the questions left in its wake, I'm opening the issue here. So, strap in kids, we're going for a ride...
The whole crux of my bug is that I add a trace-level field whose value happened to be a
URI
object.How this breaks is pretty circuitous.
When I only had the Ruby service instrumented, I didn't notice this was a
URI
instead of a string. In the Honeycomb UI it showed up just fine, but that's becauseLibhoney::Cleaner
callsURI#to_s
automatically before sending the event: https://github.com/honeycombio/libhoney-rb/blob/4457c71b311f340967cc138f6e3afcf8d7f9c0f2/lib/libhoney/cleaner.rb#L41-L46When I added the Python service, I saw two types of errors:
Both of these smelled like trace header issues. So I added the
X-Honeycomb-Trace
header value and the HTTP 400 response body to the Ruby spans.Turns out the HTTP 400s were from nginx rejecting the requests with
Okay, I have a healthy amount of trace fields, but I didn't think I had that many. So I grab the header value from the span and go about trying to parse it. Just how much data is getting sent?
The beeline-ruby problem
I'm more familiar with the Ruby beeline, so I chose to use it to parse the header in my local console. Using a constructed example for the sake of this write-up (now that I know what the actual issue is), let's suppose the trace header looked like this:
The context parsing apparently fails to parse the JSON in this block:
beeline-ruby/lib/honeycomb/propagation.rb
Lines 38 to 44 in 45587a1
That being the case, what does the base 64 decoded value look like?
Yikes! No wonder JSON parsing fails. Why all the binary data?
Because while the header is generated with URL-safe base 64 encoding here:
beeline-ruby/lib/honeycomb/propagation.rb
Line 55 in 45587a1
It's not decoded URL-safe here:
beeline-ruby/lib/honeycomb/propagation.rb
Line 38 in 45587a1
What happens when we decode in a URL-safe way?
There we go! That looks like it'll parse as JSON.
🔴 Problem: The Ruby beeline is inconsistent about URL-safe encoding.
The JSON problem
Indeed, it does parse as JSON. But as you can already tell, it's not as simple as the
URI#to_s
:So how was I getting this structure in my traces?
If you just load up the stock Ruby JSON library, this isn't a problem.
In my project, though, I was using both active_support and yajl.
Individually, either works as expected.
But together, their interactions can lead to something...quite different. What's more, it's load order dependent!
If you load yajl before active_support, the behavior is OK:
But when active_support is loaded before yajl, we get to the cause of the gnarly
URI
expansion:🔴 Problem: Unrelated to Honeycomb, yajl's interaction with active_support leads to differences in JSON encoding that may cause the trace fields serialization to balloon in size.
The beeline-python problem
So what about when the header isn't too big? Why weren't the Python traces showing up nested?
Well, much like my local testing in the Ruby beeline, the URL-safe encoding breaks the URL-unsafe decoding on the Python end. The beeline parses the context on this line: https://github.com/honeycombio/beeline-python/blob/43a3f9b4f4e2844e7030b33f62695567332c73df/beeline/trace.py#L352
Feeding in the context from above in Python 2.7 (the old version I'm using in my service):
This error remains in even Python 3.x, because it's a semantic error (not a bug).
This gets swallowed by the beeline, resulting in a brand new trace: https://github.com/honeycombio/beeline-python/blob/43a3f9b4f4e2844e7030b33f62695567332c73df/beeline/middleware/flask/__init__.py#L19-L22
If we were to use the URL-safe variant, the context parsing would work:
🔴 Problem: The Honeycomb beelines are sometimes inconsistent when it comes to interoperability. E.g., honeycombio/beeline-python#113. This makes distributed tracing kind of fraught.
Outro
This was quite an issue to hunt down! For now, I can skirt it by simply
URI#to_s
-ing the trace-level field before adding it. But I thought it touched on a lot of important considerations for the beelines.The text was updated successfully, but these errors were encountered: