Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding tracing support for OpenTracing (or similar) #26507

Closed
smarterclayton opened this issue May 29, 2016 · 54 comments
Closed

Consider adding tracing support for OpenTracing (or similar) #26507

smarterclayton opened this issue May 29, 2016 · 54 comments
Labels
area/monitoring help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@smarterclayton
Copy link
Contributor

In 1.3 we took most of the fat out of the internal apiserver and Client machinery. However, we have fairly complex internal flows for controllers, caches, the watches, and other distribution mechanisms that are more difficult to trace / profile. In addition, we are adding larger numbers of control loops and feedback mechanisms into the state of the system - we are seeing more inconsistent outcomes that are hard to reason about.

For 1.4, I think it would be valuable to broaden the traces started in #8806, possibly by adding OpenTracing support. It would also give us a bit more contextual information in failures. We would also need to consider how traces could propagate across control loops, such as whether we do level driven traces from the kubelet or elsewhere.

@smarterclayton
Copy link
Contributor Author

Xref #815

@smarterclayton
Copy link
Contributor Author

@kubernetes/sig-api-machinery

@smarterclayton
Copy link
Contributor Author

xref etcd-io/etcd#5425

@wojtek-t
Copy link
Member

+100

This would be highly useful (no matter what tracing mechanism we decide for).
We re currently using those very simple traces added in #8806, although over time we added them in multiple other places all over the code. And they are super useful in debugging performance issue.

@lavalamp
Copy link
Member

Sounds like something that can & should be added to Context...

On Mon, May 30, 2016 at 5:32 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

+100

This would be highly useful (no matter what tracing mechanism we decide
for).
We re currently using those very simple traces added in #8806
#8806, although over time
we added them in multiple other places all over the code. And they are
super useful in debugging performance issue.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#26507 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAngluIZWzhVusNS29SeDJm5erflo0Suks5qGtj7gaJpZM4IpU4Q
.

@bhs
Copy link

bhs commented May 31, 2016

From the OpenTracing side, we would be glad to help... please be in touch with me or other on the OT Gitter/etc. Thanks!

@mikedanese mikedanese added area/monitoring team/none sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed team/none labels Jun 1, 2016
@mikedanese
Copy link
Member

grpc will have census but it's a WIP.

https://github.com/grpc/grpc/blob/master/src/core/ext/census/README.md

@josephjacks
Copy link

josephjacks commented Sep 8, 2016

+1! I am very in favor of this. Have been talking with some customers about tracing being a first-class concern in K8s. Could solve a world of problems in larger scale deployments.

I just spoke with @bensigelman about this. Would we need to use something like gRPC to provide a common cluster-wide method for trace propagation data?

gRPC supports OT now: https://github.com/grpc-ecosystem/grpc-opentracing

@bhs
Copy link

bhs commented Sep 8, 2016

@josephjacks FYI, the Go support for gRPC is in PR at the moment... it was blocked on client-side interceptors which were just introduced into gRPC-Go last week. FWIW, integration with gRPC+OT in Go is now O(1) code per gRPC service...

Clients:

var tracer opentracing.Tracer = ...
...

// Set up a connection to the server peer.
conn, err := grpc.Dial(
    address,
    ... // other options
    grpc.WithUnaryInterceptor(
        otgrpc.OpenTracingClientInterceptor(tracer)))

// All future RPC activity involving `conn` will be automatically traced.

Servers:

var tracer opentracing.Tracer = ...

// Initialize the gRPC server.
s := grpc.NewServer(
    ... // other options
    grpc.UnaryInterceptor(
        otgrpc.OpenTracingServerInterceptor(tracer)))

// All future RPC activity involving `s` will be automatically traced.

@josephjacks
Copy link

@bensigelman that is awesome! thanks for sharing this.

@timothysc timothysc added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Mar 7, 2017
@timothysc
Copy link
Member

b/c this hits across every component I'd like to discuss this on @kubernetes/sig-scalability-feature-requests for initial investigation in 1.7. @gmarek and I have chatted about it a number of times now.

@timothysc timothysc added this to the v1.7 milestone Mar 7, 2017
@jayunit100
Copy link
Member

Can we use this to avoid having to have an entire reconstruction of scheduling events in plain logs ? If so that helps solve some of our logging issues there (and in other places) as well. @eparis .

@timothysc timothysc self-assigned this Apr 20, 2017
@timothysc
Copy link
Member

@gmarek and I starting on this in earnest now... I'm going to hit the scheduler 1st.
/cc @kubernetes/sig-scheduling-feature-requests

@bhs
Copy link

bhs commented Apr 27, 2017

@timothysc @gmarek: Marek and I had a quick chat this AM. I am happy to help with modeling / conceptual issues if they crop up. Please just ping me on gitter or via email / whatever.

@timothysc
Copy link
Member

@bhs thanks. Are you on kubernetes.slack.com ?

@bhs
Copy link

bhs commented Apr 27, 2017

@timothysc I am not, and frankly I am too swamped to pay attention to another firehose in my life. :) But I take IRQs happily.

@timothysc
Copy link
Member

Mucking the the client <> api is the hardest part from what we've seen so far.

@gmarek
Copy link
Contributor

gmarek commented Apr 28, 2017

@bhs mentioned that it's probably best to start by adding support of context.Context to things that are of interest to us, and then using it to pass span id across process/machine boundaries. It'd mean some big changes in our client-go library, which includes, but is not restricted to, figuring out how we want to handle our generated protobufs.

As we want to be able to trace flow of a single object in the system, we probably would need to add some id/context to object metadata, so we'd able to track e.g. Pod from creation, through scheduling to running. This would probably require changes in runtime.Object interface, as it'd need to support that. This of course adds even more challenges of figuring out when we want to finish this "Pod-tracing-span".

Generally it seems like loads of fun, and requires more thought and careful design. @kubernetes/sig-api-machinery-feature-requests @kubernetes/client-go-maintainers

@0xmichalis 0xmichalis added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 28, 2017
@0xmichalis
Copy link
Contributor

we probably would need to add some id/context to object metadata

Why not use the object uid?

@gmarek
Copy link
Contributor

gmarek commented Apr 28, 2017

Then the span will span over whole object lifetime, and we probably don't want to do that as well.

@gmarek
Copy link
Contributor

gmarek commented Aug 17, 2017

There was an effort to do so, and there's my old PR (#45962) that adds tracing to master-client communication, but I didn't have time to push it through. There's fundamental mismatch between kubernetes model and distributed tracing model (i.e. there's no notion of 'operation' in Kubernetes, hence there's no one 'thing' to which single span can be attached to).

@timothysc timothysc added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 28, 2017
@timothysc
Copy link
Member

The alternative to this is to fill out events and use them as the means of tracing work through the system.

@rbtcollins
Copy link

Forgive my later-comer confusion, but it seems like there are two different concerns being discussed in the one ticket?

Concern a) tracing the work that goes into a single RPC request: e.g. client -> apiserver -> kubelet -> docker -> lastlog-of-a-container and back up the stack.

Concern b) tracing the lifecycle of resources in the system - e.g. the pod creation, scheduling, deletion-request, actual-free

What I don't understand is the 'kubernetes model mismatch vs distributed tracing' vis-a-vis case a): all RPCs seem to be pretty clearly defined to me, with the one arguable exception being watches, though there you could trace, but a single trace may generate thousands of spans over a long time period.

Relatedly, wearing my operator-of-k8s hat, while figuring out why a pod was deleted or why a container was killed is a mild nuisance figuring out why k8s has decided to suddenly go slow - case a - is really the key thing we need in this space, and having opentracing glue for that alone - reaching down into docker - would be very nice.

@timothysc
Copy link
Member

Unassigning, the watch model vs. RPC calls has proven to break down when trying to integrate. If it gets fixed at some other date I'd be interested to see how it could be incorporated. However there is an events v2 api that is meant to address the issues uncovered in using tracing.

@tedsuo
Copy link

tedsuo commented Dec 13, 2017

Hi all, Ted from the OpenTracing project here. Post kubecon, there has been renewed interest in integrating kubernetes and opentracing. How can I be of service, and who should I talk to? I see there are some concerns around long running traces, but possibly there are some low hanging fruit that could be addressed.

On a related note, there is opentracing for Nginx-ingress, but it’s currently only enabled for zipkin. In both ingress and control plane tracing, there’s the issue of how to package a tracer with k8’s. It would be helpful to me to get up to speed with someone on the ingress side of things as well, and discuss tracing on both fronts.

Cheers,
Ted

@gmarek
Copy link
Contributor

gmarek commented Dec 28, 2017

Hi @tedsuo. The question you're asking have a lot of layers. From the very high level perspective this issue and my investigation was about adding tracing to Kubernetes control plane in such a way that we'd be able to analyze systems performance and easily find scalability bottlenecks. This proved to be infeasible for number of reasons (e.g. k8s lacks the concept of 'operation', so it's not clear when a span for, say, Pod creation, should finish, not to mention Deployment or Service creation; or the fact that passing context through etcd would require a lot of nontrivial code that would work dangerously close to the etcd itself). As scale analysis was our main goal and we were not able to get there using tracing, whole tracing effort was destaffed.

On the other hand you're probably more interested in a very simple tracing integration, where you'd just trace single api calls, which is obviously doable (no consensus issues, natural span concept). I had a ~working POC PR (#45962) for that I never had time to finish. The main challenge here is to allow user to inject a context into clients calls (issue #46503), but no one is working on that currently AFAIK (@kubernetes/sig-api-machinery-misc).

@tedsuo
Copy link

tedsuo commented Jan 2, 2018

Thanks for the rundown @gmarek. I agree that basic API tracing (both for control plane activity and ingress) is a good starting point.

"Long-lived traces" are not currently supported by most tracing systems (though, ironically, scale analysis for container scheduling is what initially got me involved with tracing), so there is currently no off-the-shelf solution for tracing complex operations where the work ends up queued in various databases and potentially lasts for more than 5 minutes. You were correct to build an Events API (or log format) for that problem, which can then be dumped into various tracing/analysis systems.

I'll try to find the correct SIG and see how we can provide assistance. BTW, https://github.com/orgs/kubernetes/teams/sig-api-machinery-misc seems to not exist (or be protected)?

@gmarek
Copy link
Contributor

gmarek commented Jan 3, 2018

I don't know how to refer to sigs anymore:( @smarterclayton @lavalamp @caesarxuchao @sttts @deads2k @liggitt

@deads2k
Copy link
Contributor

deads2k commented Jan 4, 2018

I'm interested in having both intra- and inter- process tracing, but the best I could do is offer review time. I've generally used this kind of information to chase bugs in multi-threaded code. Using it to track bottlenecks instead of using a metrics gather tool would be new for me.

@rbtcollins
Copy link

@tedsuo do you see anything in the OpenTracing API that is inimical to long lived traces? The data model seemed entirely fine for tracing k8s's needs to me. We're hoping to work on this space in the medium future. (Not a forward looking commitment etc etc).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2018
@nikhita
Copy link
Member

nikhita commented Apr 9, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 7, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@warmchang
Copy link
Contributor

warmchang commented Mar 14, 2019

There are any news about this feature? Thanks!

I've found these, the topic is under discussion.

kubernetes/enhancements#650
containernetworking/cni#561
containerd/containerd#3057

@lengrongfu
Copy link
Contributor

I've found this, conntainerd add traceing.
containerd/containerd#5731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests