Consider adding tracing support for OpenTracing (or similar) #26507

smarterclayton · 2016-05-29T16:37:19Z

In 1.3 we took most of the fat out of the internal apiserver and Client machinery. However, we have fairly complex internal flows for controllers, caches, the watches, and other distribution mechanisms that are more difficult to trace / profile. In addition, we are adding larger numbers of control loops and feedback mechanisms into the state of the system - we are seeing more inconsistent outcomes that are hard to reason about.

For 1.4, I think it would be valuable to broaden the traces started in #8806, possibly by adding OpenTracing support. It would also give us a bit more contextual information in failures. We would also need to consider how traces could propagate across control loops, such as whether we do level driven traces from the kubelet or elsewhere.

smarterclayton · 2016-05-29T16:38:03Z

Xref #815

smarterclayton · 2016-05-29T16:38:29Z

@kubernetes/sig-api-machinery

smarterclayton · 2016-05-29T16:39:04Z

xref etcd-io/etcd#5425

wojtek-t · 2016-05-30T12:32:48Z

+100

This would be highly useful (no matter what tracing mechanism we decide for).
We re currently using those very simple traces added in #8806, although over time we added them in multiple other places all over the code. And they are super useful in debugging performance issue.

lavalamp · 2016-05-31T14:24:39Z

Sounds like something that can & should be added to Context...

On Mon, May 30, 2016 at 5:32 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

+100

This would be highly useful (no matter what tracing mechanism we decide
for).
We re currently using those very simple traces added in #8806
#8806, although over time
we added them in multiple other places all over the code. And they are
super useful in debugging performance issue.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#26507 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAngluIZWzhVusNS29SeDJm5erflo0Suks5qGtj7gaJpZM4IpU4Q
.

bhs · 2016-05-31T22:24:21Z

From the OpenTracing side, we would be glad to help... please be in touch with me or other on the OT Gitter/etc. Thanks!

mikedanese · 2016-06-01T07:04:35Z

grpc will have census but it's a WIP.

https://github.com/grpc/grpc/blob/master/src/core/ext/census/README.md

josephjacks · 2016-09-08T20:51:28Z

+1! I am very in favor of this. Have been talking with some customers about tracing being a first-class concern in K8s. Could solve a world of problems in larger scale deployments.

I just spoke with @bensigelman about this. Would we need to use something like gRPC to provide a common cluster-wide method for trace propagation data?

gRPC supports OT now: https://github.com/grpc-ecosystem/grpc-opentracing

bhs · 2016-09-08T20:55:04Z

@josephjacks FYI, the Go support for gRPC is in PR at the moment... it was blocked on client-side interceptors which were just introduced into gRPC-Go last week. FWIW, integration with gRPC+OT in Go is now O(1) code per gRPC service...

Clients:

var tracer opentracing.Tracer = ...
...

// Set up a connection to the server peer.
conn, err := grpc.Dial(
    address,
    ... // other options
    grpc.WithUnaryInterceptor(
        otgrpc.OpenTracingClientInterceptor(tracer)))

// All future RPC activity involving `conn` will be automatically traced.

Servers:

var tracer opentracing.Tracer = ...

// Initialize the gRPC server.
s := grpc.NewServer(
    ... // other options
    grpc.UnaryInterceptor(
        otgrpc.OpenTracingServerInterceptor(tracer)))

// All future RPC activity involving `s` will be automatically traced.

josephjacks · 2016-09-08T20:58:45Z

@bensigelman that is awesome! thanks for sharing this.

timothysc · 2017-03-07T14:54:39Z

b/c this hits across every component I'd like to discuss this on @kubernetes/sig-scalability-feature-requests for initial investigation in 1.7. @gmarek and I have chatted about it a number of times now.

jayunit100 · 2017-04-12T19:55:37Z

Can we use this to avoid having to have an entire reconstruction of scheduling events in plain logs ? If so that helps solve some of our logging issues there (and in other places) as well. @eparis .

timothysc · 2017-04-27T20:06:21Z

@gmarek and I starting on this in earnest now... I'm going to hit the scheduler 1st.
/cc @kubernetes/sig-scheduling-feature-requests

bhs · 2017-04-27T21:19:49Z

@timothysc @gmarek: Marek and I had a quick chat this AM. I am happy to help with modeling / conceptual issues if they crop up. Please just ping me on gitter or via email / whatever.

timothysc · 2017-04-27T21:23:54Z

@bhs thanks. Are you on kubernetes.slack.com ?

bhs · 2017-04-27T21:27:36Z

@timothysc I am not, and frankly I am too swamped to pay attention to another firehose in my life. :) But I take IRQs happily.

timothysc · 2017-04-27T21:30:31Z

Mucking the the client <> api is the hardest part from what we've seen so far.

gmarek · 2017-04-28T08:49:03Z

@bhs mentioned that it's probably best to start by adding support of context.Context to things that are of interest to us, and then using it to pass span id across process/machine boundaries. It'd mean some big changes in our client-go library, which includes, but is not restricted to, figuring out how we want to handle our generated protobufs.

As we want to be able to trace flow of a single object in the system, we probably would need to add some id/context to object metadata, so we'd able to track e.g. Pod from creation, through scheduling to running. This would probably require changes in runtime.Object interface, as it'd need to support that. This of course adds even more challenges of figuring out when we want to finish this "Pod-tracing-span".

Generally it seems like loads of fun, and requires more thought and careful design. @kubernetes/sig-api-machinery-feature-requests @kubernetes/client-go-maintainers

0xmichalis · 2017-04-28T11:48:58Z

we probably would need to add some id/context to object metadata

Why not use the object uid?

gmarek · 2017-04-28T11:50:31Z

Then the span will span over whole object lifetime, and we probably don't want to do that as well.

gmarek · 2017-08-17T09:32:35Z

There was an effort to do so, and there's my old PR (#45962) that adds tracing to master-client communication, but I didn't have time to push it through. There's fundamental mismatch between kubernetes model and distributed tracing model (i.e. there's no notion of 'operation' in Kubernetes, hence there's no one 'thing' to which single span can be attached to).

timothysc · 2017-08-28T20:27:08Z

The alternative to this is to fill out events and use them as the means of tracing work through the system.

rbtcollins · 2017-09-07T11:30:45Z

Forgive my later-comer confusion, but it seems like there are two different concerns being discussed in the one ticket?

Concern a) tracing the work that goes into a single RPC request: e.g. client -> apiserver -> kubelet -> docker -> lastlog-of-a-container and back up the stack.

Concern b) tracing the lifecycle of resources in the system - e.g. the pod creation, scheduling, deletion-request, actual-free

What I don't understand is the 'kubernetes model mismatch vs distributed tracing' vis-a-vis case a): all RPCs seem to be pretty clearly defined to me, with the one arguable exception being watches, though there you could trace, but a single trace may generate thousands of spans over a long time period.

Relatedly, wearing my operator-of-k8s hat, while figuring out why a pod was deleted or why a container was killed is a mild nuisance figuring out why k8s has decided to suddenly go slow - case a - is really the key thing we need in this space, and having opentracing glue for that alone - reaching down into docker - would be very nice.

timothysc · 2017-12-11T22:04:40Z

Unassigning, the watch model vs. RPC calls has proven to break down when trying to integrate. If it gets fixed at some other date I'd be interested to see how it could be incorporated. However there is an events v2 api that is meant to address the issues uncovered in using tracing.

tedsuo · 2017-12-13T18:11:57Z

Hi all, Ted from the OpenTracing project here. Post kubecon, there has been renewed interest in integrating kubernetes and opentracing. How can I be of service, and who should I talk to? I see there are some concerns around long running traces, but possibly there are some low hanging fruit that could be addressed.

On a related note, there is opentracing for Nginx-ingress, but it’s currently only enabled for zipkin. In both ingress and control plane tracing, there’s the issue of how to package a tracer with k8’s. It would be helpful to me to get up to speed with someone on the ingress side of things as well, and discuss tracing on both fronts.

Cheers,
Ted

gmarek · 2017-12-28T14:31:03Z

Hi @tedsuo. The question you're asking have a lot of layers. From the very high level perspective this issue and my investigation was about adding tracing to Kubernetes control plane in such a way that we'd be able to analyze systems performance and easily find scalability bottlenecks. This proved to be infeasible for number of reasons (e.g. k8s lacks the concept of 'operation', so it's not clear when a span for, say, Pod creation, should finish, not to mention Deployment or Service creation; or the fact that passing context through etcd would require a lot of nontrivial code that would work dangerously close to the etcd itself). As scale analysis was our main goal and we were not able to get there using tracing, whole tracing effort was destaffed.

On the other hand you're probably more interested in a very simple tracing integration, where you'd just trace single api calls, which is obviously doable (no consensus issues, natural span concept). I had a ~working POC PR (#45962) for that I never had time to finish. The main challenge here is to allow user to inject a context into clients calls (issue #46503), but no one is working on that currently AFAIK (@kubernetes/sig-api-machinery-misc).

tedsuo · 2018-01-02T23:19:47Z

Thanks for the rundown @gmarek. I agree that basic API tracing (both for control plane activity and ingress) is a good starting point.

"Long-lived traces" are not currently supported by most tracing systems (though, ironically, scale analysis for container scheduling is what initially got me involved with tracing), so there is currently no off-the-shelf solution for tracing complex operations where the work ends up queued in various databases and potentially lasts for more than 5 minutes. You were correct to build an Events API (or log format) for that problem, which can then be dumped into various tracing/analysis systems.

I'll try to find the correct SIG and see how we can provide assistance. BTW, https://github.com/orgs/kubernetes/teams/sig-api-machinery-misc seems to not exist (or be protected)?

gmarek · 2018-01-03T15:34:55Z

I don't know how to refer to sigs anymore:( @smarterclayton @lavalamp @caesarxuchao @sttts @deads2k @liggitt

deads2k · 2018-01-04T21:30:28Z

I'm interested in having both intra- and inter- process tracing, but the best I could do is offer review time. I've generally used this kind of information to chase bugs in multi-threaded code. Using it to track bottlenecks instead of using a metrics gather tool would be new for me.

rbtcollins · 2018-01-09T12:11:42Z

@tedsuo do you see anything in the OpenTracing API that is inimical to long lived traces? The data model seemed entirely fine for tracing k8s's needs to me. We're hoping to work on this space in the medium future. (Not a forward looking commitment etc etc).

fejta-bot · 2018-04-09T12:23:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

nikhita · 2018-04-09T12:58:45Z

/remove-lifecycle stale

fejta-bot · 2018-07-08T13:11:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-08-07T13:59:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2018-09-06T14:45:48Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2018-09-06T14:45:57Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

warmchang · 2019-03-14T07:54:49Z

There are any news about this feature? Thanks!

I've found these, the topic is under discussion.

kubernetes/enhancements#650
containernetworking/cni#561
containerd/containerd#3057

lengrongfu · 2021-08-06T00:59:39Z

I've found this, conntainerd add traceing.
containerd/containerd#5731

mikedanese added area/monitoring team/none sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed team/none labels Jun 1, 2016

timothysc added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Mar 7, 2017

timothysc added this to the v1.7 milestone Mar 7, 2017

ncdc mentioned this issue Apr 12, 2017

Scheduler test improvements and forensics. #44342

Closed

2 tasks

timothysc self-assigned this Apr 20, 2017

timothysc assigned gmarek Apr 27, 2017

0xmichalis added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 28, 2017

timothysc added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 28, 2017

timothysc unassigned timothysc and gmarek Dec 11, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2018

bboreham mentioned this issue Jun 27, 2018

libcni should support distributed tracing containernetworking/cni#561

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 7, 2018

k8s-ci-robot closed this as completed Sep 6, 2018

Monkeyanator mentioned this issue Oct 3, 2018

OpenCensus tracing support added to Kubelet Monkeyanator/kubernetes#1

Open

bboreham mentioned this issue Jun 24, 2020

Update to v1alpha3 weaveworks/wksctl#172

Merged

Consider adding tracing support for OpenTracing (or similar) #26507

Consider adding tracing support for OpenTracing (or similar) #26507

Comments

smarterclayton commented May 29, 2016

smarterclayton commented May 29, 2016

smarterclayton commented May 29, 2016

smarterclayton commented May 29, 2016

wojtek-t commented May 30, 2016

lavalamp commented May 31, 2016

bhs commented May 31, 2016

mikedanese commented Jun 1, 2016

josephjacks commented Sep 8, 2016 • edited

bhs commented Sep 8, 2016

josephjacks commented Sep 8, 2016

timothysc commented Mar 7, 2017

jayunit100 commented Apr 12, 2017

timothysc commented Apr 27, 2017

bhs commented Apr 27, 2017

timothysc commented Apr 27, 2017

bhs commented Apr 27, 2017

timothysc commented Apr 27, 2017

gmarek commented Apr 28, 2017

0xmichalis commented Apr 28, 2017

gmarek commented Apr 28, 2017 • edited

gmarek commented Aug 17, 2017

timothysc commented Aug 28, 2017

rbtcollins commented Sep 7, 2017

timothysc commented Dec 11, 2017

tedsuo commented Dec 13, 2017

gmarek commented Dec 28, 2017

tedsuo commented Jan 2, 2018 • edited

gmarek commented Jan 3, 2018

deads2k commented Jan 4, 2018

rbtcollins commented Jan 9, 2018

fejta-bot commented Apr 9, 2018

nikhita commented Apr 9, 2018

fejta-bot commented Jul 8, 2018

fejta-bot commented Aug 7, 2018

fejta-bot commented Sep 6, 2018

k8s-ci-robot commented Sep 6, 2018

warmchang commented Mar 14, 2019 • edited

lengrongfu commented Aug 6, 2021

josephjacks commented Sep 8, 2016 •

edited

gmarek commented Apr 28, 2017 •

edited

tedsuo commented Jan 2, 2018 •

edited

warmchang commented Mar 14, 2019 •

edited