New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding tracing support for OpenTracing (or similar) #26507
Comments
Xref #815 |
@kubernetes/sig-api-machinery |
xref etcd-io/etcd#5425 |
+100 This would be highly useful (no matter what tracing mechanism we decide for). |
Sounds like something that can & should be added to Context... On Mon, May 30, 2016 at 5:32 AM, Wojciech Tyczynski <
|
From the OpenTracing side, we would be glad to help... please be in touch with me or other on the OT Gitter/etc. Thanks! |
grpc will have census but it's a WIP. https://github.com/grpc/grpc/blob/master/src/core/ext/census/README.md |
+1! I am very in favor of this. Have been talking with some customers about tracing being a first-class concern in K8s. Could solve a world of problems in larger scale deployments. I just spoke with @bensigelman about this. Would we need to use something like gRPC to provide a common cluster-wide method for trace propagation data? gRPC supports OT now: https://github.com/grpc-ecosystem/grpc-opentracing |
@josephjacks FYI, the Go support for gRPC is in PR at the moment... it was blocked on client-side interceptors which were just introduced into gRPC-Go last week. FWIW, integration with gRPC+OT in Go is now O(1) code per gRPC service... Clients: var tracer opentracing.Tracer = ...
...
// Set up a connection to the server peer.
conn, err := grpc.Dial(
address,
... // other options
grpc.WithUnaryInterceptor(
otgrpc.OpenTracingClientInterceptor(tracer)))
// All future RPC activity involving `conn` will be automatically traced. Servers: var tracer opentracing.Tracer = ...
// Initialize the gRPC server.
s := grpc.NewServer(
... // other options
grpc.UnaryInterceptor(
otgrpc.OpenTracingServerInterceptor(tracer)))
// All future RPC activity involving `s` will be automatically traced. |
@bensigelman that is awesome! thanks for sharing this. |
b/c this hits across every component I'd like to discuss this on @kubernetes/sig-scalability-feature-requests for initial investigation in 1.7. @gmarek and I have chatted about it a number of times now. |
Can we use this to avoid having to have an entire reconstruction of scheduling events in plain logs ? If so that helps solve some of our logging issues there (and in other places) as well. @eparis . |
@gmarek and I starting on this in earnest now... I'm going to hit the scheduler 1st. |
@timothysc @gmarek: Marek and I had a quick chat this AM. I am happy to help with modeling / conceptual issues if they crop up. Please just ping me on gitter or via email / whatever. |
@bhs thanks. Are you on kubernetes.slack.com ? |
@timothysc I am not, and frankly I am too swamped to pay attention to another firehose in my life. :) But I take IRQs happily. |
Mucking the the client <> api is the hardest part from what we've seen so far. |
@bhs mentioned that it's probably best to start by adding support of context.Context to things that are of interest to us, and then using it to pass span id across process/machine boundaries. It'd mean some big changes in our client-go library, which includes, but is not restricted to, figuring out how we want to handle our generated protobufs. As we want to be able to trace flow of a single object in the system, we probably would need to add some id/context to object metadata, so we'd able to track e.g. Pod from creation, through scheduling to running. This would probably require changes in Generally it seems like loads of fun, and requires more thought and careful design. @kubernetes/sig-api-machinery-feature-requests @kubernetes/client-go-maintainers |
Why not use the object uid? |
Then the span will span over whole object lifetime, and we probably don't want to do that as well. |
There was an effort to do so, and there's my old PR (#45962) that adds tracing to master-client communication, but I didn't have time to push it through. There's fundamental mismatch between kubernetes model and distributed tracing model (i.e. there's no notion of 'operation' in Kubernetes, hence there's no one 'thing' to which single span can be attached to). |
The alternative to this is to fill out events and use them as the means of tracing work through the system. |
Forgive my later-comer confusion, but it seems like there are two different concerns being discussed in the one ticket? Concern a) tracing the work that goes into a single RPC request: e.g. client -> apiserver -> kubelet -> docker -> lastlog-of-a-container and back up the stack. Concern b) tracing the lifecycle of resources in the system - e.g. the pod creation, scheduling, deletion-request, actual-free What I don't understand is the 'kubernetes model mismatch vs distributed tracing' vis-a-vis case a): all RPCs seem to be pretty clearly defined to me, with the one arguable exception being watches, though there you could trace, but a single trace may generate thousands of spans over a long time period. Relatedly, wearing my operator-of-k8s hat, while figuring out why a pod was deleted or why a container was killed is a mild nuisance figuring out why k8s has decided to suddenly go slow - case a - is really the key thing we need in this space, and having opentracing glue for that alone - reaching down into docker - would be very nice. |
Unassigning, the watch model vs. RPC calls has proven to break down when trying to integrate. If it gets fixed at some other date I'd be interested to see how it could be incorporated. However there is an events v2 api that is meant to address the issues uncovered in using tracing. |
Hi all, Ted from the OpenTracing project here. Post kubecon, there has been renewed interest in integrating kubernetes and opentracing. How can I be of service, and who should I talk to? I see there are some concerns around long running traces, but possibly there are some low hanging fruit that could be addressed. On a related note, there is opentracing for Nginx-ingress, but it’s currently only enabled for zipkin. In both ingress and control plane tracing, there’s the issue of how to package a tracer with k8’s. It would be helpful to me to get up to speed with someone on the ingress side of things as well, and discuss tracing on both fronts. Cheers, |
Hi @tedsuo. The question you're asking have a lot of layers. From the very high level perspective this issue and my investigation was about adding tracing to Kubernetes control plane in such a way that we'd be able to analyze systems performance and easily find scalability bottlenecks. This proved to be infeasible for number of reasons (e.g. k8s lacks the concept of 'operation', so it's not clear when a span for, say, Pod creation, should finish, not to mention Deployment or Service creation; or the fact that passing context through etcd would require a lot of nontrivial code that would work dangerously close to the etcd itself). As scale analysis was our main goal and we were not able to get there using tracing, whole tracing effort was destaffed. On the other hand you're probably more interested in a very simple tracing integration, where you'd just trace single api calls, which is obviously doable (no consensus issues, natural span concept). I had a ~working POC PR (#45962) for that I never had time to finish. The main challenge here is to allow user to inject a context into clients calls (issue #46503), but no one is working on that currently AFAIK (@kubernetes/sig-api-machinery-misc). |
Thanks for the rundown @gmarek. I agree that basic API tracing (both for control plane activity and ingress) is a good starting point. "Long-lived traces" are not currently supported by most tracing systems (though, ironically, scale analysis for container scheduling is what initially got me involved with tracing), so there is currently no off-the-shelf solution for tracing complex operations where the work ends up queued in various databases and potentially lasts for more than 5 minutes. You were correct to build an Events API (or log format) for that problem, which can then be dumped into various tracing/analysis systems. I'll try to find the correct SIG and see how we can provide assistance. BTW, https://github.com/orgs/kubernetes/teams/sig-api-machinery-misc seems to not exist (or be protected)? |
I don't know how to refer to sigs anymore:( @smarterclayton @lavalamp @caesarxuchao @sttts @deads2k @liggitt |
I'm interested in having both intra- and inter- process tracing, but the best I could do is offer review time. I've generally used this kind of information to chase bugs in multi-threaded code. Using it to track bottlenecks instead of using a metrics gather tool would be new for me. |
@tedsuo do you see anything in the OpenTracing API that is inimical to long lived traces? The data model seemed entirely fine for tracing k8s's needs to me. We're hoping to work on this space in the medium future. (Not a forward looking commitment etc etc). |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I've found these, the topic is under discussion. kubernetes/enhancements#650 |
I've found this, conntainerd add traceing. |
In 1.3 we took most of the fat out of the internal apiserver and Client machinery. However, we have fairly complex internal flows for controllers, caches, the watches, and other distribution mechanisms that are more difficult to trace / profile. In addition, we are adding larger numbers of control loops and feedback mechanisms into the state of the system - we are seeing more inconsistent outcomes that are hard to reason about.
For 1.4, I think it would be valuable to broaden the traces started in #8806, possibly by adding OpenTracing support. It would also give us a bit more contextual information in failures. We would also need to consider how traces could propagate across control loops, such as whether we do level driven traces from the kubelet or elsewhere.
The text was updated successfully, but these errors were encountered: