Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use local service name as X-Ray subsegment name #88

Closed
wants to merge 5 commits into from

Conversation

robyf
Copy link
Contributor

@robyf robyf commented May 25, 2018

With reference to #87 this PR adds the possibility to use local service names as segmente name for span of type different than SERVER and CONSUMER.
By default "unknown" is still sent if the remote service name is null, only if the AWS_XRAY_USE_LOCAL_SPAN_FOR_MISSING_REMOTES environment variable is set to true the local service name is used.

As an example (using spring-boot 2 based Java applications) by default these spans:

[{
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "f3d698d7b353651f",
  "id": "164d63874559fbdf",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527232127973431,
  "duration": 22109,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "02e913691d305aa0",
  "id": "f3d698d7b353651f",
  "name": "hystrix",
  "timestamp": 1527232127972476,
  "duration": 23595,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  }
}, {
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "70cbdfe07b779e32",
  "id": "02e913691d305aa0",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527232127972240,
  "duration": 23838,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "id": "70cbdfe07b779e32",
  "kind": "SERVER",
  "name": "get",
  "timestamp": 1527232127970111,
  "duration": 29131,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "remoteEndpoint": {
    "ipv4": "127.0.0.1",
    "port": 33060
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}]

result in these X-Ray segments:

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "f3d698d7b353651f",
  "id": "164d63874559fbdf",
  "type": "subsegment",
  "namespace": "remote",
  "name": "unknown",
  "start_time": 1.527232127973431E9,
  "end_time": 1.52723212799554E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "02e913691d305aa0",
  "id": "f3d698d7b353651f",
  "type": "subsegment",
  "name": "unknown",
  "start_time": 1.527232127972476E9,
  "end_time": 1.527232127996071E9
}

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "70cbdfe07b779e32",
  "id": "02e913691d305aa0",
  "type": "subsegment",
  "namespace": "remote",
  "name": "unknown",
  "start_time": 1.52723212797224E9,
  "end_time": 1.527232127996078E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "id": "70cbdfe07b779e32",
  "name": "security-gateway-aws",
  "start_time": 1.527232127970111E9,
  "end_time": 1.527232127999242E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

while with the environment variable set these spans:

[{
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "parentId": "2561d4b051f72922",
  "id": "0d32b6ca15c75375",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527235805895055,
  "duration": 6568,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "parentId": "8c9369499184df04",
  "id": "7b25c19865b5ee2c",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527235805888945,
  "duration": 19229,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "parentId": "7b25c19865b5ee2c",
  "id": "2561d4b051f72922",
  "name": "hystrix",
  "timestamp": 1527235805890880,
  "duration": 21261,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  }
}, {
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "id": "8c9369499184df04",
  "kind": "SERVER",
  "name": "get",
  "timestamp": 1527235805883133,
  "duration": 29543,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "remoteEndpoint": {
    "ipv4": "127.0.0.1",
    "port": 42386
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}]

result in these X-Ray segments:

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "parent_id": "2561d4b051f72922",
  "id": "0d32b6ca15c75375",
  "type": "subsegment",
  "namespace": "remote",
  "name": "security-gateway-aws",
  "start_time": 1.527235805895055E9,
  "end_time": 1.527235805901623E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "parent_id": "8c9369499184df04",
  "id": "7b25c19865b5ee2c",
  "type": "subsegment",
  "namespace": "remote",
  "name": "security-gateway-aws",
  "start_time": 1.527235805888945E9,
  "end_time": 1.527235805908174E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "parent_id": "7b25c19865b5ee2c",
  "id": "2561d4b051f72922",
  "type": "subsegment",
  "name": "security-gateway-aws",
  "start_time": 1.52723580589088E9,
  "end_time": 1.527235805912141E9
}

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "id": "8c9369499184df04",
  "name": "security-gateway-aws",
  "start_time": 1.527235805883133E9,
  "end_time": 1.527235805912676E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

@codefromthecrypt
Copy link
Member

codefromthecrypt commented May 26, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented May 26, 2018

I added the test now. One thing related to that: I had to add a test dependency to be able to mock system libraries as described here: https://stackoverflow.com/a/35393623. I hope that's ok!

@robyf
Copy link
Contributor Author

robyf commented May 26, 2018

I have to admit I don't understand how that test failed :( I haven't touched anything related to SQS publishing.
Just to be on the safe side I cloned the repo in another machine and ran the build there, all tests passed (as they were passing before I pushed).

@devinsba
Copy link
Member

That SQS test has been flaking for a month or so. Neither I nor @llinder have had time to fix it it seems

// using "unknown" subsegment name will help to detect missing names but will also
// result in the service map displaying services depending on an "unknown" one.
if (span.remoteServiceName() == null) {
if ("true".equalsIgnoreCase(System.getenv("AWS_XRAY_USE_LOCAL_SPAN_FOR_MISSING_REMOTES"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this would be the only time we use environment variables directly in java code of a component. Let's not do that. At worst a flag, but this also seems bad semantically. Also, localServiceName can be null in odd situations (not in brave, but possible if people are writing spans themselves).

@devinsba since you use this code what do you think? Is there another option? What's the impact of the graph not having a correct remote service name?

I suppose it would be telling to see a trace where this happens, too.. When thinking about tests, I was actually more thinking that but didn't mention it. Sometimes seeing the trace data can show what options might not be obvious from looking at the code without any context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan about using environment variables in java code, but as the address of the X-Ray server is passed as AWS_XRAY_DAEMON_ADDRESS env var I used the same for consistency.

Having unknown as X-Ray subsegment name creates the issues described here: #87. I believe that unknown segments when looking at a single trace are not a big issue, but having almost all services reported as depending on unknown in the service map is a problem if the main reason for using Zipkin or X-Ray is to have a map of the service dependencies.

I have copy/pasted examples of a trace (received from a Spring Boot application that uses the Spring Cloud stack, OpenFeign and Hystrix) and the related data sent to X-Ray, will do it again here:

First span (the request is received)

{
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "id": "70cbdfe07b779e32",
  "kind": "SERVER",
  "name": "get",
  "timestamp": 1527232127970111,
  "duration": 29131,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "remoteEndpoint": {
    "ipv4": "127.0.0.1",
    "port": 33060
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}

and the related X-Ray segment:

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "id": "70cbdfe07b779e32",
  "name": "security-gateway-aws",
  "start_time": 1.527232127970111E9,
  "end_time": 1.527232127999242E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

No problems here.

Second span (the service invokes a remote API via OpenFeign)

{
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "70cbdfe07b779e32",
  "id": "02e913691d305aa0",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527232127972240,
  "duration": 23838,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}

and the related X-Ray subsegment:

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "70cbdfe07b779e32",
  "id": "02e913691d305aa0",
  "type": "subsegment",
  "namespace": "remote",
  "name": "unknown",
  "start_time": 1.52723212797224E9,
  "end_time": 1.527232127996078E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

Besides the conceptual issue that this is not a remote call, yet (but Zipkin cannot know this) here the first time the unknown problem appears.

Third span (Hystrix circuit breaker used by OpenFeign)

{
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "02e913691d305aa0",
  "id": "f3d698d7b353651f",
  "name": "hystrix",
  "timestamp": 1527232127972476,
  "duration": 23595,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  }
}

and the related X-Ray segment:

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "02e913691d305aa0",
  "id": "f3d698d7b353651f",
  "type": "subsegment",
  "name": "unknown",
  "start_time": 1.527232127972476E9,
  "end_time": 1.527232127996071E9
}

This is correctly not reported as remote segment but the name is unknown. Of all I believe this is the most wrong one as it should definitely use the local service name.

Fourth span (the real HTTP client)

{
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "f3d698d7b353651f",
  "id": "164d63874559fbdf",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527232127973431,
  "duration": 22109,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}

and the X-Ray segment:

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "f3d698d7b353651f",
  "id": "164d63874559fbdf",
  "type": "subsegment",
  "namespace": "remote",
  "name": "unknown",
  "start_time": 1.527232127973431E9,
  "end_time": 1.52723212799554E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

Same situation as for the second span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, as the AWS X-Ray documentation (https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html) states that for subsegments the name should be "The logical name of the subsegment. For downstream calls, name the subsegment after the resource or service called. For custom subsegments, name the subsegment after the code that it instruments (e.g., a function name)." should the span name be used as subsegment name for spans that don't have any kind set and use the current unknown (or the local service name) for CLIENT spans only?

@devinsba
Copy link
Member

I haven't seen this on the graph yet myself actually, we haven't traced anything that we didn't have end to end. (other than AWS SDK, but that sets the remote endpoint correctly)

I honestly think I'd prefer to see unknown in the graph, over using the local service name, as that would indicate to me that I have some more code to instrument. I understand not everyone is doing greenfield development so I get huge luxury of knowing all of my applications external touchpoints

@robyf
Copy link
Contributor Author

robyf commented May 29, 2018

@devinsba That's why I left it optional, as it all depends by the context one is in. For greenfield development or when you're in full control of the production of traces and spans it makes sense to immediately see if you have missed something, but when delegating the instrumentation and report to Zipkin to libraries like Sleuth you just have to accept the way they decided to report (or try to send PRs there).

@codefromthecrypt
Copy link
Member

codefromthecrypt commented May 30, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented May 31, 2018

I will go on testing and report here my findings. So far two things:

  1. X-Ray mandates segment ids inside traces to be unique while Sleuth/Zipkin expects the CLIENT span from one services and the SERVER span from the called service to share the same span id. I don't know how to easily solve this.

  2. When using zuul (we use it in our security gateway) every incoming request forwarded to services produces 2 CLIENT spans in the trace (as in the example I posted before). I believe one is the ribbon command and the second is the actual call to the service.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented May 31, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented May 31, 2018

Adding the supports-join=false property has improved the situation a lot. This is what I'm getting now (using my local fork of zipkin-aws but with the "use local service name as remote" disabled):
selection_009

Using span name for custom spans seems to do the trick (see the hystrix span in the screenshot).

@codefromthecrypt
Copy link
Member

I think marcin adjusted the ribbon stuff so to not create duplicate remote spans. Can you verify this? If so I'd like to keep your changes except the special-cased flag..

@robyf
Copy link
Contributor Author

robyf commented Jun 11, 2018

Yes that has been fixed and verified. There's still one thing I'd like to try for those remote calls without remote service in the span: use the span name as segment name and not mark that as remote. As the X-Ray documentation talks about using the name of the function called that would maybe make more sense than using the local service name and mark the segment as remote, but I don't know how does it play out with client responses and such, so I've to try.

The best of course would be that the spans would contain the remote service section, I don't know how many places in sleuth would have to be changed to add that to the instrumentation.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 11, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented Jun 11, 2018

I have contacted now the AWS Business Support explaining the issue, won't make any code change until I get a reply from them. Usually they reply in 24 hours.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 11, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented Jun 25, 2018

It took a while but I got this reply from AWS:

X-Ray does not have zipkin's concept of client vs server spans. It does have the concept of remote segments and inferred segments.
In the github link for the issue #88 ( #88 ), there is discussion about how to name the server spans
when the caller does not know the downstream dependency and your library is naming the span as "unknown".
Please refer to the following document for details about how the segments are named:
https://github.com/awsdocs/aws-xray-developer-guide/blob/master/doc-source/xray-sdk-java-httpclients.md

When making a downstream call to a service the segment may include a child subsegment with any name and an ID to show the Client side experience (errors, latency). Typically the SDK populates the CNAME of the downstream service in the subsegment name field. The downstream service generates a new segment with the parent_id set to the id of the previous subsegment. The name of this segment is used as the name of the service graph node.

If the downstream service does not send any segment with the parent id set to the id of the caller subsegment then an inferred segment is created for the remote subsegment with the name of the segment set to the name of the subsegment. These inferred segments may result into new nodes in the service graph with the name of the subsegment. Although since these nodes are inferred, the statistics of the node will be inferred from the edges.

In this case, please let us know why are you sending subsegments with name set to "unknown" and namespace as "remote". The namespace is correct as long as that indicates a RPC. It is not clear why the name of the subsegment is not to the correct name of the downstream service and is being referring to as "unknown".

Looking in the link: #59, that has an update by @adriancole which reads:
"Usually span name isn't null, but it could be sent as null when updating a span post factum",
please note that segments or subsegments in X-Ray cannot be updated once they have been sent. Are you trying to update the segment with more information and hence sending an "unknown" name for the same id? If so we would like to understand why it is not possible to send the correct name?

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 25, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented Jun 25, 2018

AWS is asking to have a phone call with me, but I start feeling I'll be just the man-in-the-middle here adding little if not no value at all. Plus the timing will be inconvenient because I'm located in Finland and the AWS X-Ray team somewhere in the PST time zone. Could I suggest AWS to have some kind of video calls with you guys instead?

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 25, 2018 via email

@robyf
Copy link
Contributor Author

robyf commented Jun 25, 2018

I just replied them asking to have the call done between you and them directly. Hopefully they'll be ok with it!

@robyf
Copy link
Contributor Author

robyf commented Jun 27, 2018

The AWS X-Ray service team is ok to have the call with you, could you pass me your contact details? I'll forward those to the support contact.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 27, 2018 via email

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jul 6, 2018

plan resulting from feedback from meeting with AWS

For the remote subsegment name, use a fallback model:

  • start with the remote service name
  • fallback to the hostname (ex "http.host" tag if available)
  • fallback to the span name (because it is already hinted as low cardinality)
  • fallback to "unknown"

When users report a problem, tell them to configure a valid remote service name (possibly by decorating a reporter if there's no other way).

Also tell users to fix or ask for fixes when there is double-instrumentation (client child of client) as this is a problem and affects statistics. Meanwhile, Amazon will look into skipping bad data when rendering the service graph on their end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants