Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed tracing via Zipkin. #354

Merged
merged 9 commits into from Mar 15, 2018
Merged

Add distributed tracing via Zipkin. #354

merged 9 commits into from Mar 15, 2018

Conversation

mdemirhan
Copy link
Contributor

No description provided.

@mdemirhan mdemirhan requested a review from a team March 13, 2018 21:25
@google-prow-robot google-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 13, 2018
@mdemirhan
Copy link
Contributor Author

/retest

@grantr
Copy link
Contributor

grantr commented Mar 13, 2018

@mdemirhan General question: how does Zipkin overlap with Jaeger? Will we install both?

@mdemirhan
Copy link
Contributor Author

@grantr Both Zipkin and Jaeger implement OpenTracing (http://opentracing.io/documentation/pages/supported-tracers). However; as far as I understand; OpenTracing doesn't really define the wire format or the backend contract and these different implementations aren't necessarily compatible. That being said, Jaeger is supposed to be Zipkin compatible and we should be able to use Jaeger instead of Zipkin if necessary.

Istio supports both Zipkin and Jaeger and we can replace Zipkin with Jaeger if there are features in Jaeger that you prefer over Zipkin. Let me know and I will be happy to take a look.

@@ -55,32 +61,106 @@ func main() {
flag.Parse()
glog.Info("Telemetry sample started.")

// Zipkin setup
reporter := httpreporter.NewReporter("http://zipkin.istio-system.svc.cluster.local:9411/api/v2/spans")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the choice to run zipkin in istio-system instead of either the monitoring or ela-system namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Istio has a hard-coded URL to zipkin and it assumes that zipkin is installed in istio-system with a service named zipkin. I will open an issue to them and ask them to provide us an override. In the meantime, I will add comments in the code and YAML mentioning that this is a workaround for the istio issue.


// Add a random delay to simulate some actual work here
time.Sleep(time.Duration(rand.Intn(1000)) * time.Millisecond)
func rootHandler(client *zipkinhttp.Client) http.HandlerFunc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem complicated. I don't think it's immediately obvious which parts of the code are necessary setup vs. demonstrating the tracing capabilities. Can you add detailed comments explaining the various parts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I will add extra comments explaining what each section is doing.

@grantr
Copy link
Contributor

grantr commented Mar 13, 2018

True, the individual tracer implementation defines trace propagation. I have no experience with either (though I do have experience with LightStep) but even so, Jaeger seems like a better choice than Zipkin:

  • owned by the CNCF
  • active community (though https://github.com/openzipkin/zipkin seems active as well)
  • compatible with both Jaeger and Zipkin traces (presumably Zipkin is only compatible with Zipkin)
  • written in Go (its Go client, which we'd use for Elafros, is likely to be very good)

@mdemirhan
Copy link
Contributor Author

@grantr All good points. The reason I chose Zipkin over Jaeger was my impression on its higher usage. I couldn't really find a good reference that compares both to make a good informed decision. I will look into Jaeger more and see if we can replace Zipkin. Though, I would like to check this in now and do that investigation next week. Does that sound reasonable?

@grantr
Copy link
Contributor

grantr commented Mar 13, 2018

Does that sound reasonable?

Definitely, I didn't mean to stop this PR; was just curious.

@mdemirhan
Copy link
Contributor Author

/retest

@vaikas
Copy link
Contributor

vaikas commented Mar 14, 2018

I just have a clarifying question about this. It seems to me like we're instrumenting the application to talk directly to a particular tool, and I had been thinking that we should be able to get some of these traces by preserving the headers and not actually requiring the application to have to choose which of these tools is being used.
https://istio.io/docs/tasks/telemetry/distributed-tracing.html

@mdemirhan
Copy link
Contributor Author

@vaikas-google If you don't instrument within the code, you will get a one liner summary view, which is the call from Istio's point of view.
screen1

To see the whole view of the trace like:
screen2

you need to ensure that the open tracing headers are passed to all other http calls you are making. Telemetry sample shows how to do that using the go client library of zipkin. It can be done manually as well. Or if you are using something like gRPC and enable open tracing, the rpc layer will take care of getting the trace context from the request and copying them to any outgoing calls.

@mdemirhan
Copy link
Contributor Author

That being said, it seems like copying headers manually might actually be a lot more straightforward and I could change the code to just do that. The main disadvantage of that would be the loss of some data (for example, instead of the service name, you will see istio-proxy everywhere).

I will give that a shot and see if that looks better. Thanks for bringing this up.

@mdemirhan
Copy link
Contributor Author

@vaikas-google Digging deeper into this, we will need this type of instrumentation for any service that is not behind istio proxy and that is not local within our cluster. In a service mesh, if you are calling only local services behind an istio-proxy, this instrumentation is not needed. If you are going to call an external service, or if you are going to call an internal service that is not behind istio-proxy, then this instrumentation is needed. I will add a note to the beginning of the file and to the documentation stating that.

Let me know if that sounds reasonable.

@mattmoor Let me know if the updated documentation is clear and if I have your approval to check this in.

@vaikas
Copy link
Contributor

vaikas commented Mar 14, 2018

That's unfortunate especially in the case of functions that are lightweight in code and having them require to manually instrument the code. Thanks for doing the investigation!

@mdemirhan
Copy link
Contributor Author

@vaikas-google Agreed. I will look into this further and see if anything can be done to make the situation better. I will capture this investigation as a Github issue and hopefully get to it within the next couple of days.

@mdemirhan
Copy link
Contributor Author

/retest

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these comments are super helpful to understand what's going on. thanks for so much detail.

I have a handful more comments.

@@ -62,6 +78,18 @@ func main() {
glog.Info("Telemetry sample started.")

// Zipkin setup
// If your service only calls services in the local cluster and if those services are
// behind istio-proxy (all revisions are behind istio-proxy), then Zipkin setup below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "... Elafros Revisions ..." may be clearer

// requests and copy them to your outgoing requests.
// x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context
//
// For richer instrumentation, or to instrument calls that are mad to services not behind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mad/made/g

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need this.

// x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context
//
// For richer instrumentation, or to instrument calls that are mad to services not behind
// istio-proxy, you can instrument the code using Zipkin's Go client library:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trailing colon is weird here.

defer func(start time.Time) {
// Counters only support incrementing. Increment the count by one
// to capture the single call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any built-in metrics that you get by simply registering the middleware?

One of the nice things about our internal systems is well-defined built-in metrics that just work. :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go client library adds builtin metrics for golang (memstats, thread, go routines, ...etc) but nothing else. I will create an issue and see if we can simplify this further and add a library that can add some common metrics by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention: There are set of metrics around request time, count & result sent by envoy automatically (whether you instrument your code or not). Those should be good for most use cases. This example here was to show more of a custom metric and how to emit it.

@@ -147,15 +191,23 @@ func rootHandler(client *zipkinhttp.Client) http.HandlerFunc {
}
}

// Makes an http call with distributed tracing enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a more idiomatic example if this was done by creating an http.Roundtripper that instruments outbound requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #374 to track this.

@mdemirhan
Copy link
Contributor Author

/retest

// requests and copy them to your outgoing requests.
// x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, x-ot-span-context
//
// For richer instrumentation, or to instrument calls that are mad to services not behind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need this.

@mdemirhan
Copy link
Contributor Author

/retest

@mdemirhan mdemirhan merged commit b06988f into knative:master Mar 15, 2018
@mdemirhan mdemirhan deleted the mdemirhan/zipkin2 branch March 15, 2018 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants