New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose KEP: Leveraging Distributed Tracing to Understand Kubernetes Object Lifecycles #650

Open
wants to merge 7 commits into
base: master
from

Conversation

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 7, 2018

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Monkeyanator

This comment has been minimized.

Copy link
Author

Monkeyanator commented Dec 7, 2018

@kubernetes/sig-instrumentation-feature-requests

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 7, 2018

@Monkeyanator: Reiterating the mentions to trigger a notification:
@kubernetes/sig-instrumentation-feature-requests

In response to this:

@kubernetes/sig-instrumentation-feature-requests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Monkeyanator

This comment has been minimized.

Copy link
Author

Monkeyanator commented Dec 7, 2018

/assign @brancz

@brancz
Copy link
Member

brancz left a comment

First run. I think before I'm comfortable deciding on this architecture I'd like us to do some research and reflect on different possible solutions and dependencies. Generally super excited about this though!

Show resolved Hide resolved keps/sig-instrumentation/0034-distributed-tracing-kep.md Outdated
Show resolved Hide resolved keps/sig-instrumentation/0034-distributed-tracing-kep.md Outdated
Show resolved Hide resolved keps/sig-instrumentation/0034-distributed-tracing-kep.md Outdated
Show resolved Hide resolved keps/sig-instrumentation/0034-distributed-tracing-kep.md Outdated

Distributed tracing, on the other hand, provides a single window into latency information from across many components and plugins. Trace data is structured, and there are numerous established backends for visualizing and querying over it. This KEP would make it possible to, for instance, retrieve and visualize all pod startups that took more than 30 seconds, involved an `nginx` container, and which mounted more than two volumes.

In addition, due to the self-healing nature of Kubernetes, regressions wherein latencies are affected but the overall task is eventually accomplished are not uncommon. With our current monitoring architecture, these "soft regressions" are often difficult to observe and diagnose. Collecting structured trace data on per-object latencies would enable us to detect these long-term regressions automatically, and quickly determine their root causes.

This comment has been minimized.

@brancz

brancz Dec 8, 2018

Member

Can you be more specific in what you mean by "soft regressions" and how the monitoring architecture is not sufficient?

For this specific use case it sounds like to me that a combination of both improving the metrics instrumentation (which is indeed not good enough today) plus sampling "bad" traces would significantly improve the current debugging process.

This comment has been minimized.

@Monkeyanator

Monkeyanator Dec 10, 2018

Author

By "soft regression", I mean an issue that doesn't result in a definitive failure, but rather in degraded performance.

You are definitely correct in that even just improving the metrics and sampling bad traces would improve the current process. I think what I was trying to highlight here was that there is potential to plug into existing trace analysis tools to perform automatic root-cause-analysis.

This could make it possible to, for example, detect a latency regression in pod startup, and then attribute that regression to a change in some metadata (such as a container version, or notice that the regression shows when a pod mounts a certain volume, etc etc). Latency metrics lack the structure / context required to perform this kind of analysis.

Will clarify the KEP on this point.

This comment has been minimized.

@Monkeyanator

Monkeyanator Dec 10, 2018

Author

@dashpole on this as well, who might have a better idea on how this will fit in with existing latency metrics

This comment has been minimized.

@dashpole

dashpole Dec 10, 2018

Contributor

The general point here is that in addition to identifying that a regression has occurred, tracing also helps identifies the root-cause of the issue.

This comment has been minimized.

@dashpole

dashpole Jan 3, 2019

Contributor

I updated this section to be specific on the problems we are solving.

This KEP proposes the use of the [OpenCensus tracing framework](https://opencensus.io/) to create and export spans to configured backends. The OpenCensus framework was chosen for various reasons:

1) Provides concrete, tested implementations for creating and exporting spans to diverse backends, rather than providing an API specification, as is the case with [OpenTracing](https://opentracing.io/specification/)
2) [Provides an agent](https://github.com/census-instrumentation/opencensus-service) which enables lazy configuration for exporters, batching of spans, and other features

This comment has been minimized.

@brancz

brancz Dec 8, 2018

Member

This is likely going to need a sig-architecture discussion, as I'm not sure this heavy of a dependency is something we want to carry long term. I don't know enough about OpenCensus, is this really a required component?

This comment has been minimized.

@Monkeyanator

Monkeyanator Dec 10, 2018

Author

The OpenCensus agent is not required, but it is the solution we're leaning towards for the initial version. The attractive feature about the agent is that it allows us to configure the destination for our exported traces on-the-fly, and in an out-of-tree component (less in-tree changes).

The main alternative to using the agent would be to export spans from the instrumented components themselves directly to the tracing backends (which is what our current implementation work has been doing). This is a valid alternative, and I will update this section in the KEP to discuss it.

This comment has been minimized.

@brancz

brancz Jan 17, 2019

Member

I see the reason and benefit of extracting this into the sidecar, but I'm not seeing this feature ever leaving preview or alpha state without this issue being resolved. I'm ok with it at this stage, but I want to have mentioned it upfront, as I have doubts with sig-architecture approving this even as an optional feature, as it's a significant change to how Kubernetes is used/deployed/operated. The OpenCensus team even encountered problems suggesting to deploy the agent as DaemonSets. See "Open Questions" here: https://docs.google.com/document/d/1U2McyGwPIm0win_0uNQqUlPJrrQh1WH5J4m8q8KQyv4/edit#heading=h.rgbw704usq10

This comment has been minimized.

@dashpole

dashpole Jan 23, 2019

Contributor

Added review from sig-instrumentation and sig-architecture on this for beta


To correlate work done between components as belonging to the same trace, we must pass span context across process boundaries. In traditional distributed systems, this context can be passed down through RPC metadata or HTTP headers. Kubernetes, however, due to its watch-based nature, requires us to attach trace context directly to the target object.

In this proposal, we choose to propagate this span context as an encoded string an object annotation called `trace.kubernetes.io/context`. This annotation value is regenerated and replaced when an object's trace ends, to achieve the desired behavior from [section one](#trace-lifecycle).

This comment has been minimized.

@brancz

brancz Dec 8, 2018

Member

I might be missing something, but this seems like it's prone to multiple "traces being started" concurrently causing race conditions where trace contexts are concurrently overwritten.

This comment has been minimized.

@Monkeyanator

Monkeyanator Dec 10, 2018

Author

As long as we ensure that there's a single state transition that we consider the beginning of a trace, and a single state transition that marks its end, I believe we should be able to avoid any race conditions here.

@dashpole on this as well.

This comment has been minimized.

@dashpole

dashpole Dec 10, 2018

Contributor

It is worth noting that updates to an object's trace annotation should only be done by a single component, usually the controller responsible for updating the status of the object. For example, the kubelet updates the annotation after updating the pod from pending -> running.

This comment has been minimized.

@dashpole

dashpole Jan 3, 2019

Contributor

On more thought, I think I understand where this is coming from. Concurrent updates shouldn't be an issue, as the last update should be the trace context used, but there could be a race between "ending a trace" by replacing the trace context, and "starting a trace" from an update, for example.


This KEP proposes the use of the [OpenCensus tracing framework](https://opencensus.io/) to create and export spans to configured backends. The OpenCensus framework was chosen for various reasons:

1) Provides concrete, tested implementations for creating and exporting spans to diverse backends, rather than providing an API specification, as is the case with [OpenTracing](https://opentracing.io/specification/)

This comment has been minimized.

@brancz

brancz Dec 8, 2018

Member

I'm generally a big fan of the motivations and intentions of the OpenCensus project, but I'm a little concerned about it being a rather young project.

This comment has been minimized.

@Monkeyanator

Monkeyanator Dec 10, 2018

Author

Agreed, the OC project is still quite young. However, based on the fact that this would be an experimental, opt-in alpha feature, it might be acceptable for us to bring in for use provided we stick to its stable features (starting, ending, and exporting spans).


#### Context propagation

To correlate work done between components as belonging to the same trace, we must pass span context across process boundaries. In traditional distributed systems, this context can be passed down through RPC metadata or HTTP headers. Kubernetes, however, due to its watch-based nature, requires us to attach trace context directly to the target object.

This comment has been minimized.

@brancz

brancz Dec 8, 2018

Member

I've thought about this before, and I'm not entirely sure this is 100% true, just properly solving this sounds like a larger effort, being making etcd context/tracing aware, where any modification call to etcd is carried through etcd and published in the watch event.

This comment has been minimized.

@Monkeyanator

Monkeyanator Dec 10, 2018

Author

Since the proposal suggests attaching span context to the object metadata, as an annotation, it shouldn't introduce any additional complexity to etcd.

While some of the previous discussion around tracing has called for adding trace awareness to etcd, and hooking into writes for trace points, our proposal doesn't suggest this route. Is this what you mean by "making etcd context/tracing aware?"

This comment has been minimized.

@brancz

brancz Jan 17, 2019

Member

I meant that we should technically be able to trace everything even though Kubernetes and "its watch-based nature". Any event from a watch could have the trace ID of the origin change done against the API.

This comment has been minimized.

@dashpole

dashpole Jan 23, 2019

Contributor

added this to the KEP

@wojtek-t wojtek-t self-requested a review Dec 10, 2018


* **Logs**: are fragmented, and finding out which process was the bottleneck involves digging through troves of unstructured text. In addition, logs do not offer higher-level insight into overall system behavior without an extensive background on the process of interest.
* **Events**: in Kubernetes are only kept for an hour by default, and don't integrate with visualization of analysis tools. To gain trace-like insights would require a large investment in custom tooling.
* **Latency metrics**: are gathered in some places, but these don't provide understanding into _why_ a given process was slow.

This comment has been minimized.

@dashpole

dashpole Dec 20, 2018

Contributor

Part of the reason why latency metrics aren't a great way to determine why a process was slow is for carnality reasons. You wouldn't, for example, want to attach the container ID of a hypothetical container_start_latency metric because you would be making a new metric stream for each container, each with only a single sample taken.

@wojtek-t
Copy link
Member

wojtek-t left a comment

@gmarek - FYI


Kubernetes is unique in that it is constantly reconciling its actual state towards some desired state. As a result, it has no definitive concept of an "operation", which breaks the traditional model for distributed tracing. This raises the question of when to begin traces, and when to end them.

In this proposal, we choose to _only_ trace phases of an object's lifecycle wherein it's correcting from an undesired state to its desired state, and to end the trace when it enters this desired state. This means that the same object will export traces for each reconciliation it undergoes. This decision was made because:

This comment has been minimized.

@wojtek-t

wojtek-t Dec 28, 2018

Member

What if in the meantime desired state changes and we will never reach the original desired one?

This comment has been minimized.

@dashpole

dashpole Jan 2, 2019

Contributor

The original trace will end prematurely, and subsequent traced actions are attributed to the new desired state. Since we generally care about the slowest reconciliations, ending a trace before the process is complete should be fine.

This comment has been minimized.

@wojtek-t

wojtek-t Jan 3, 2019

Member

Assuming that it's the component that has the knowledge about that previous trace...
Anyway - I think any option here is potentially fine, but I would like to see that written down in the KEP to give people chance to discuss that.

This comment has been minimized.

@dashpole

dashpole Jan 3, 2019

Contributor

I added a pretty lengthy example of how this should work, and an explanation.


In the standard model for distributed tracing, there exists a span in each trace that all other spans are descendents of and which extends the length of the entire trace, called the `root span`.

The Kubernetes component that kicks off an operation might not be the same component that ends it. In this proposal, when we are at the point where we want to end a root span, we craft a span to export which acts as the root span for the trace. For example, when the kubelet updates a pod from `Pending` to `Running`, it creates a root span using the start time of the pod as the start, and the current time as the end.

This comment has been minimized.

@wojtek-t

wojtek-t Dec 28, 2018

Member

What about cases when the component that is finishing doesn't know the time when it was started?
As an example, the action may be triggered by updating an object, and we generally don't persistent anywhere information about when the object was update (even last one).

This comment has been minimized.

@dashpole

dashpole Jan 2, 2019

Contributor

Root spans are useful to have, but not critical for using tracing. Essentially what you get by adding a root span is being able to collapse the entire trace during visualizations, as all spans have a common parent. Tracing backends still calculate the total duration of all spans.

The current plan for alpha is to add root spans where possible (creation, deletion), and not where it isn't (update, reconcile).

This comment has been minimized.

@dashpole

dashpole Jan 3, 2019

Contributor

I made a note of this.


In this proposal, we choose to propagate this span context as an encoded string an object annotation called `trace.kubernetes.io/context`. This annotation value is regenerated and replaced when an object's trace ends, to achieve the desired behavior from [section one](#trace-lifecycle).

This proposal chooses to use annotations as a less invasive alternative to adding a field to object metadata, but as this proposal matures, adding trace context to the official API should be considered.

This comment has been minimized.

@wojtek-t

wojtek-t Dec 28, 2018

Member

I would like to see it mentioned very clearly that additing tracing will not result in increasing amount of requests to apiserver - otherwise it may visibly impact performance of the system, which we definitely don't want.

[That also implicitly means, that "end of an operation" has to be associated with some write request to apiserver, which I'm not 100% convinced will always be the case in cases that we're interested about].

@kubernetes/sig-scalability-api-reviews

This comment has been minimized.

@dashpole

dashpole Jan 2, 2019

Contributor

Thats a great point. It definitely will have an impact during alpha, as we are using annotations. We can remove the extra write, at least in theory, if we moved context propagation in-tree by adding the ability to update/regenerate the trace context during a status update.

The "end of an operation" always coincides with a status update from a non-desired state to the desired state in the current proposal. This implicitly means objects without a status don't receive new trace contexts outside of creation/update/deletion (i'm not convinced tracing is applicable to such objects). Do you have a case in-mind you are not sure about?

This comment has been minimized.

@wojtek-t

wojtek-t Jan 3, 2019

Member

It definitely will have an impact during alpha, as we are using annotations.

I think it depends. If e.g. we say that pod creation should start a span, then we can build that into the machinery so that it will be done automatically. So I actually don't agree it has to be the case.

And just to be clear on that: I can live with this requirement not being satisfied in Alpha state, but I'm not going to approve it for beta+ if it will generally be creating higher load on apiserver (there can be some exceptions for some rare flows, but in general it cannot cause additional writes).

This comment has been minimized.

@dashpole

dashpole Jan 3, 2019

Contributor

I added a note, and a graduation requirement for this

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jan 2, 2019

@wojtek-t I will update the KEP once @Monkeyanator gives me write access.

Responding to comments
Updated the motivation, added trace lifecycle description for traces that start before the previous one ends, update root span description, add scalability requirement.
@justaugustus
Copy link
Member

justaugustus left a comment

Please remove any references to NEXT_KEP_NUMBER and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 23, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Monkeyanator
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: brancz

If they are not already assigned, you can assign the PR to them by writing /assign @brancz in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- "@Random-Liu"
- "@bogdandrutu"
approvers:
- "@brancz"

This comment has been minimized.

@mattfarina

mattfarina Jan 25, 2019

Member

Since @piosz is also a chair of the SIG should @piosz be listed as an approver?

This comment has been minimized.

@dashpole

dashpole Jan 25, 2019

Contributor

I added him, although I believe only one approver is required.

- "@Monkeyanator"
editors:
- "@dashpole"
owning-sig: sig-instrumentation

This comment has been minimized.

@mattfarina

mattfarina Jan 25, 2019

Member

Can you use the participating-sigs: to add sig-architecture.

This comment has been minimized.

@dashpole

dashpole Jan 25, 2019

Contributor

done

@MikeSpreitzer

This comment has been minimized.

Copy link
Member

MikeSpreitzer commented Feb 8, 2019

The Kubernetes control plane, and other distributed systems built like it, is indeed lacking an important form of performance observability. In systems built out of procedure calls (local and remote), this need is often addressed by a concept of "tracing" that is built around "spans", where a span corresponds to a procedure call. The Kubernetes control plane, and systems like it, are not primarily built out of procedure calls, and observability based on a concept of spans is not a good fit. This is not to deny that "latency from point A to point B" is a very relevant concept. What I am denying is that the original data should look like procedure calls with relationships among them. Rather, the original data should look like those individual points and relationships between them, because those relationships are much richer than can reasonably be captured by a collection of non-degenerate spans (by "degenerate span" I mean one of zero length, essentially representing an individual point).

In the Kubernetes control plane, work on an object is not done by a tree of procedure calls. The Kubernetes control plane is built out of controllers that monitor the state of various objects and occasionally write part of the state of certain objects. Each write is based on what was revealed by certain earlier reads --- which in turn are simply conveying what was written earlier. In short, the fundamental stuff of control plane activity is partial state writes based on other partial state writes.

For example, consider a pod. We could try to characterize what happens to a pod as a sequence of spans, where each span starts with some client requesting a change (i.e., a create, update, or delete) and ends with the implementation --- the relevant kubelet --- satisfying that request. But that is not even a good explanation of the events at the start of the life a pod. The first major state-setting event of a pod's lifecycle is a client creating the pod API object. That initial state typically does not include a binding to a particular node. The next major event is typically a scheduler doing another state write that binds the pod to a node. The final major event in the startup of a pod is the relevant kubelet doing a state write that indicates that the pod is running.

We could try to model this with spans by building into the model the idea that a pod's startup has a sequence of two spans: one from creation of API object to node binding, another from node binding to running state. We could say that the primary performance data for pod startup is built out of these two kinds of spans.

A pod is a relatively low-level API object in Kubernetes. There are many higher level objects of interest. Analysts whosse concern with pods is only about the full startup latency of a pod --- from API object create to running state --- could write queries or code that synthesizes the full startup latency out of the two constituent spans.

But it is not always that way: it is allowed for a pod to be created in a bound state. So a given pod will not necessarily have both spans. The aforementioned analysts could write more complicated queries or code to handle both scenarios.

Perhaps more likely, we could make it "the implementation's" responsibility to create the single span that represents the full startup, and identify the one or two constituent spans as children of the full startup span. What would that implementation code look like? In both OpenTracing and OpenCensus, the parent has to exist before the child is created. So a scheduler would have to create the full-startup span as well as the scheduler-work span. The kubelet would have to be prepared to create the full-startup span if it has not already been created, as well as create the kubelet-work span.

Where are those three spans stored? If the scheduler-work span and the kubelet-work span are sinks in the DAG of spans then they can simply be created when completed and emitted into the span collection framework, leaving only the full-startup span as something that needs to be stored with the pod API object. This also requires the time of the binding write (or create, whichever is appropriate for the pod at hand) to be stored in the API object, so that it is available when the kubelet opens its leaf span. So now we are also storing a state write timestamp in addition to a span. Alternatively, we can say that as soon as the binding is determined for a pod the kubelet-work span is started. This means that we are storing two spans with the API object: the full-startup span and the kubelet-work span. But we will not really be satisfied with requiring the scheduler-work span and the kubelet-work span to be leaves. In both the scheduler and the kubelet there may be a sequence of spans wherein a queue worker works on a given pod, and the parent of those spans (i.e., the full scheduler-work span or the full kubelet-work span) has to be stored with the API object. So we need the API object to hold onto multiple spans: at least the full-startup span plus one for scheduler work or one for kubelet work.

If every object alternated between an idle period, in which the desired state is fully implemented, and an active period, in which "the implementation" is working through a linear sequence of intermediate states (which always occur in the same order, and we allow an intermediate state to take zero time for some objects) along the lines discussed above, then we could always impose a span-based model as discussed above. If the implementation can follow a more general state machine during an active period then it gets more complicated. Each state transition could be modeled as a span, but an analyst interested in anything other than individual state transitions, or code trying to synthesize higher level spans, has a fair amount of complexity to cope with.

The idea of defining a state machine for an object is explicitly rejected as a good general design pattern. See the remarks about "phase" at https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md . Instead of a single monolithic phase it is recommended to have more a granular concept of status (and thus state as a whole). With state divided into several independent parts, what defines the spans for an object? I think someone or something analyzing performance may query some specific intervals, but asking the code to know beforehand exactly which pairs of events will be queried is problematic in general.

Even if we could set aside the generality problems discussed above, we are not done with pods. Consider the case of a StatefulSet, which creates and deletes pods. A StatefulSet also creates and deletes PVC objects to be used in volumes of those pods. The normal scheduler does not attempt to schedule a pod until all its volumes are ready for use; this includes waiting until referenced PVCs are bound to PVs. The controller that binds PVCs to PVs does not know or care whether or not a particular PVC was created by a StatefulSet. The scheduler does not know or care whether a particular pod is a member of a StatefulSet. The relationship between the binding of such a PVC and the scheduling of such a pod is a critical part of the performance story, but this pair of events does not look like a <update desired state of an object, indicate completion of implementation of that object> pair. That is, it does not look like what we have been talking about a span representing. It is an interesting interval, so we could require the scheduler to make a span for every <PVC used in pod volume got bound, pod-got-scheduled> interval. Note that such a span is not all about a single object, which violates the mental model we started with. Note also that a PVC's status already has "conditions", which can be used to record when the PVC got bound. But not every PVC is created by a StatefulSet for one of its pods; a PVC can be created and bound independently of a pod. Even for a PVC created for a pod in a StatefulSet, the PVC could get bound before the pod API object is created. We can not generally make a <PVC got bound, pod got scheduled> span a child of the pod's scheduling span because the former may start before the latter. Similarly, we can not generally use the parent/child relationship in the other direction either. The "FollowsFrom" relation in OpenTracing has the same problem. Actually, I do not see an explicit absolute requirement between start times of related spans in either OpenTracing nor OpenCensus, but I think that there is an intended constraint. OpenCensus also presents the additional difficulty that a given span can have at most one parent.

A more natural model would be to define a span for the PVC controller's work on binding a PVC to a PV and then ask the scheduler to establish a relationship between the PVC binding span and the scheduler's pod scheduling span. This requires the PVC constroller's span to persist on the PVC object after the span is finished. This also has the problem that there is no fixed relationship in familiar tracing terms, because, again, either of the two spans in question could start before the other.

In short, the relationship between work on a PVC and work on a pod does not fit into the existing models for relationships between spans.

There are many other examples in Kubernetes of relationships between different kinds of objects. And we can not put API objects into a containment tree. For example, the pods of one ReplicaSet may also contribute to an Endpoints object --- and also that Endpoints object may draw additional content from pods not in that ReplicaSet.

As we have already seen with pods, it is not a given that an object's implementation lies entirely in one controller; even forgetting about PVCs and such, a pod's implemetation is divided between scheduler and kubelet. With general granular state, it is not necessarily true that implementation work is handed off along a sequence of controllers.

With a web of relationships between objects with granular state with concurrent bits of implementation in progress, I do not see a clearly good way to model this with spans.

What I do see is that each state write done by a controller is based on some state that controller got in earlier reads (either explicit requests or watch notifications), where each part of that state was, in turn, set by an earlier such write. It is these state writes that are the primitive performance data, and the relationships just stated are the primitive relationships. In addition to drawing what is relevant to a given individual we may want --- just as in Prometheus, or in an SQL database --- to allow an analyst to make various queries against this primitive data and its relationships.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Feb 13, 2019

@MikeSpreitzer thanks for the feedback. I've had time to digest it, and think I understand your perspective slightly better now.

The Kubernetes control plane, and other distributed systems built like it, is indeed lacking an important form of performance observability. In systems built out of procedure calls (local and remote), this need is often addressed by a concept of "tracing" that is built around "spans", where a span corresponds to a procedure call. The Kubernetes control plane, and systems like it, are not primarily built out of procedure calls, and observability based on a concept of spans is not a good fit. This is not to deny that "latency from point A to point B" is a very relevant concept. What I am denying is that the original data should look like procedure calls with relationships among them. Rather, the original data should look like those individual points and relationships between them, because those relationships are much richer than can reasonably be captured by a collection of non-degenerate spans (by "degenerate span" I mean one of zero length, essentially representing an individual point).

Distributed tracing is context-aware, structured, distributed latency logging. Though it is mainly used with procedure calls, it isn't limited to procedure calls. The only requirements I can see for using tracing in any system is being able to attach a context to a description of user intent, and propagate it to all components that act on that intent. That is fundamentally why associating a given trace context with an object's desired state is a good way to adopt the tracing model to the watch-based k8s model.

For example, consider a pod. We could try to characterize what happens to a pod as a sequence of spans, where each span starts with some client requesting a change (i.e., a create, update, or delete) and ends with the implementation --- the relevant kubelet --- satisfying that request. But that is not even a good explanation of the events at the start of the life a pod. The first major state-setting event of a pod's lifecycle is a client creating the pod API object. That initial state typically does not include a binding to a particular node. The next major event is typically a scheduler doing another state write that binds the pod to a node. The final major event in the startup of a pod is the relevant kubelet doing a state write that indicates that the pod is running.

We could try to model this with spans by building into the model the idea that a pod's startup has a sequence of two spans: one from creation of API object to node binding, another from node binding to running state. We could say that the primary performance data for pod startup is built out of these two kinds of spans.

A pod is a relatively low-level API object in Kubernetes. There are many higher level objects of interest. Analysts whose concern with pods is only about the full startup latency of a pod --- from API object create to running state --- could write queries or code that synthesizes the full startup latency out of the two constituent spans.

But it is not always that way: it is allowed for a pod to be created in a bound state. So a given pod will not necessarily have both spans. The aforementioned analysts could write more complicated queries or code to handle both scenarios.

Tracing tools already handle absent spans gracefully. For viewing the single trace, the span would simply be absent. Analysis tools aggregate spans with a single span name. So if we had a parent span k8s.CreatePod, and child spans scheduler.SchedulePod and kubelet.StartPod, we can already query over any of the three, regardless of whether scheduler.SchedulePod is present in all traces.

Perhaps more likely, we could make it "the implementation's" responsibility to create the single span that represents the full startup, and identify the one or two constituent spans as children of the full startup span. What would that implementation code look like? In both OpenTracing and OpenCensus, the parent has to exist before the child is created. So a scheduler would have to create the full-startup span as well as the scheduler-work span. The kubelet would have to be prepared to create the full-startup span if it has not already been created, as well as create the kubelet-work span.

Where are those three spans stored? If the scheduler-work span and the kubelet-work span are sinks in the DAG of spans then they can simply be created when completed and emitted into the span collection framework, leaving only the full-startup span as something that needs to be stored with the pod API object. This also requires the time of the binding write (or create, whichever is appropriate for the pod at hand) to be stored in the API object, so that it is available when the kubelet opens its leaf span. So now we are also storing a state write timestamp in addition to a span. Alternatively, we can say that as soon as the binding is determined for a pod the kubelet-work span is started. This means that we are storing two spans with the API object: the full-startup span and the kubelet-work span. But we will not really be satisfied with requiring the scheduler-work span and the kubelet-work span to be leaves. In both the scheduler and the kubelet there may be a sequence of spans wherein a queue worker works on a given pod, and the parent of those spans (i.e., the full scheduler-work span or the full kubelet-work span) has to be stored with the API object. So we need the API object to hold onto multiple spans: at least the full-startup span plus one for scheduler work or one for kubelet work.

We don't actually have to store any spans with the API object to accomplish this. As long as you have the timestamp of the start of a process, you can retroactively construct the parent span. You are correct, that by storing a few more timestamps, we could get a few more traces to wrap, for example, all of the kubelet work in a single span. But the nice thing for now is that we can just skip adding those spans when we don't have the start time, and add them in if/when we add those timestamps. Tracing tools still function even when we are missing parent spans, and just have a collection of child spans. For example, we can have scheduler.SchedulePod and kubelet.StartPod, but not have k8s.CreatePod, and things work just fine. You just can't answer queries about the distribution of k8s.CreatePod latencies.

If every object alternated between an idle period, in which the desired state is fully implemented, and an active period, in which "the implementation" is working through a linear sequence of intermediate states (which always occur in the same order, and we allow an intermediate state to take zero time for some objects) along the lines discussed above, then we could always impose a span-based model as discussed above. If the implementation can follow a more general state machine during an active period then it gets more complicated. Each state transition could be modeled as a span, but an analyst interested in anything other than individual state transitions, or code trying to synthesize higher level spans, has a fair amount of complexity to cope with.

When a kubernetes controller attempts to reconcile desired and actual state for an object, it does at least two steps:

  1. Take some action(s)
  2. Update state

For example, the scheduler does these two steps:

  1. Run algorithm to find the node on which it can place the pod.
  2. Bind the pod to the node.

While (2) is an important part of the reconciliation process, as you point out, it isn't that interesting on its own. Wrapping (1) in a span is far more interesting and useful. As we would expect from the example, the scheduler folks care immensely about how fast the schedule pod algorithm takes, and not at all about how long (2) takes.

The idea of defining a state machine for an object is explicitly rejected as a good general design pattern. See the remarks about "phase" at https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md . Instead of a single monolithic phase it is recommended to have more a granular concept of status (and thus state as a whole). With state divided into several independent parts, what defines the spans for an object? I think someone or something analyzing performance may query some specific intervals, but asking the code to know beforehand exactly which pairs of events will be queried is problematic in general.

I am not suggesting we model anything after a state machine. Let me know if something in the proposal led you to think that, and I can update it to make it clearer.

The spans for an object are any operation which advances the actual state toward the desired state. It doesn't need to be linear, and often many steps happen in parallel. This includes actions like:

  • Running an algorithm (e.g. schedule pod algorithm)
  • Calling out to another service (e.g. create a Persistent Disk from a cloud provider or calling the container runtime to create a container).
  • Updating the actual state
  • Creating/updating another object

Even if we could set aside the generality problems discussed above, we are not done with pods. Consider the case of a StatefulSet, which creates and deletes pods. A StatefulSet also creates and deletes PVC objects to be used in volumes of those pods. The normal scheduler does not attempt to schedule a pod until all its volumes are ready for use; this includes waiting until referenced PVCs are bound to PVs. The controller that binds PVCs to PVs does not know or care whether or not a particular PVC was created by a StatefulSet. The scheduler does not know or care whether a particular pod is a member of a StatefulSet. The relationship between the binding of such a PVC and the scheduling of such a pod is a critical part of the performance story, but this pair of events does not look like a <update desired state of an object, indicate completion of implementation of that object> pair. That is, it does not look like what we have been talking about a span representing. It is an interesting interval, so we could require the scheduler to make a span for every <PVC used in pod volume got bound, pod-got-scheduled> interval. Note that such a span is not all about a single object, which violates the mental model we started with. Note also that a PVC's status already has "conditions", which can be used to record when the PVC got bound. But not every PVC is created by a StatefulSet for one of its pods; a PVC can be created and bound independently of a pod. Even for a PVC created for a pod in a StatefulSet, the PVC could get bound before the pod API object is created. We can not generally make a <PVC got bound, pod got scheduled> span a child of the pod's scheduling span because the former may start before the latter. Similarly, we can not generally use the parent/child relationship in the other direction either. The "FollowsFrom" relation in OpenTracing has the same problem. Actually, I do not see an explicit absolute requirement between start times of related spans in either OpenTracing nor OpenCensus, but I think that there is an intended constraint. OpenCensus also presents the additional difficulty that a given span can have at most one parent.

A more natural model would be to define a span for the PVC controller's work on binding a PVC to a PV and then ask the scheduler to establish a relationship between the PVC binding span and the scheduler's pod scheduling span. This requires the PVC controller's span to persist on the PVC object after the span is finished. This also has the problem that there is no fixed relationship in familiar tracing terms, because, again, either of the two spans in question could start before the other.

In short, the relationship between work on a PVC and work on a pod does not fit into the existing models for relationships between spans.

Ok, I think I owe you at least a hypothetical way we could handle hierarchies in kubernetes... I haven't implemented this, but I hope it shows that it is possible to handle such object relationships relatively elegantly.
There are a couple of key observations I want to start out with:

  • Our goal is to attach the context to user intent, not necessarily a specific object's spec per-se.
  • While a controller is reconciling the desired and actual state of object A, creating or updating object B is an expression of the same user intent as object A.
    • For example, when the StatefulSet controller creates a pod object, that pod represents the same user intent as the StatefulSet.

Therefore, I propose that when a controller, acting in the context of object A, modifies the desired state of object B, it should propagate that context to object B. This means each user-initiated object creation results in a single trace, since all objects created as a result of this have the trace context propagated to them. This captures the relationship between multiple objects created on behalf of a higher-level object, such as a StatefulSet, which creates both PVCs and Pods, as they are connected by the fact that they both are associated with the same StatefulSet. Parent-child relationships mirror kubernetes object owner relationships. This is similar, though not identical, to owner relationships. The owner of an object never changes, whereas the context of a given object is determined by the last controller to modify it, which may not be the same one that created it.

There are many other examples in Kubernetes of relationships between different kinds of objects. And we can not put API objects into a containment tree. For example, the pods of one ReplicaSet may also contribute to an Endpoints object --- and also that Endpoints object may draw additional content from pods not in that ReplicaSet.

The Endpoints object does not have a desired and actual state to reconcile. It is simply a statement of fact.

There is generally a class of "selector" objects, such as a Service, which do no "own" other objects, but rather select over them. We have started moving to a model where such objects, rather than having their own state, inject their state into other objects. See the Pod Ready++ KEP, where the "readiness" of other objects, such as endpoints, is included in the pod status, rather than a separate service status, for example. In this case, setting up a service or setting endpoints actually becomes part of the process of reconciling the pod's actual state, and thus the action of setting up a service or endpoint should use the context of the pod it is acting on when performing the actions/updates required.

As we have already seen with pods, it is not a given that an object's implementation lies entirely in one controller; even forgetting about PVCs and such, a pod's implementation is divided between scheduler and kubelet. With general granular state, it is not necessarily true that implementation work is handed off along a sequence of controllers.

With a web of relationships between objects with granular state with concurrent bits of implementation in progress, I do not see a clearly good way to model this with spans.

I have done my best to answer this above.

What I do see is that each state write done by a controller is based on some state that controller got in earlier reads (either explicit requests or watch notifications), where each part of that state was, in turn, set by an earlier such write. It is these state writes that are the primitive performance data, and the relationships just stated are the primitive relationships. In addition to drawing what is relevant to a given individual we may want --- just as in Prometheus, or in an SQL database --- to allow an analyst to make various queries against this primitive data and its relationships.

I think we should be just as interested in the actual work done by components as the status updates that reflect this work.

owning-sig: sig-instrumentation
participating-sigs:
- sig-architecture
- sig-node

This comment has been minimized.

@lavalamp

lavalamp Feb 13, 2019

Member

If this proposes changes to api calls (parameters, headers, etc) or api objects (storing new interesting things) then api machinery probably needs to be involved...

This comment has been minimized.

@dashpole

dashpole Feb 13, 2019

Contributor

It doesn't currently, as I plan to use annotations for the alpha stage, but it will if it moves beyond that stage. I'll add api-machinery.

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Feb 13, 2019

Therefore, I propose that when a controller, acting in the context of object A, modifies the desired state of object B, it should propagate that context to object B. This means each user-initiated object creation results in a single trace, since all objects created as a result of this have the trace context propagated to them. This captures the relationship between multiple objects created on behalf of a higher-level object, such as a StatefulSet, which creates both PVCs and Pods, as they are connected by the fact that they both are associated with the same StatefulSet. Parent-child relationships mirror kubernetes object owner relationships.

(disclaimer: I haven't read anything but the prior comment)

Kubernetes doesn't make a clear distinction between users and system components.

If the information you want really does form trees, then what is missing from the existing owner references? (Also note that they are not guaranteed to be trees!)

If the information does not form trees (as I expect) then I think it is not a good idea to propagate everything. I do think it would be useful and interesting to store exactly one level of this information (e.g., list the immediate objects that caused the update, but NOT the full context that caused those objects to be last updated).

This was discussed a small amount in today's api machinery SIG. (which I haven't uploaded yet, sorry)

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Feb 13, 2019

Kubernetes doesn't make a clear distinction between users and system components.

Agreed. The behavior would be that if you set the trace context on an API call (currently by setting an annotation), then it uses that context. Otherwise, it generates a new one (done today via an admission controller). If we assume controllers are always acting on the desired state of an object, which has a context, and always create/update other objects using that context, then actions taken by controllers never set new contexts. Only "users", or anything other than k8s components, will generate a new trace context, and thus a trace.

If the information you want really does form trees, then what is missing from the existing owner references? (Also note that they are not guaranteed to be trees!)

There isn't anything missing exactly, except that we wouldn't want to actually have to follow the owner reference(s) to record a span.

Edit: Actually, I lied. The context can change through the lifecycle of an object. It might be created with one context, but then updated with a different one. For example, if I create a Pod it will have one context, but then if I update the pod it will get a different context. So owner references won't align the context with user intent the way I am hoping to.

If the information does not form trees (as I expect) then I think it is not a good idea to propagate everything. I do think it would be useful and interesting to store exactly one level of this information (e.g., list the immediate objects that caused the update, but NOT the full context that caused those objects to be last updated).

I don't think it making a tree structure is a hard requirement for what I suggested above. Cycles would just continuously propagate the same context around to all actions taken. I would certainly be interested in hearing more about why you don't think it is a good idea, as it isn't immediately obvious to me.

This was discussed a small amount in today's api machinery SIG. (which I haven't uploaded yet, sorry)

I look forward to seeing it! I would also be interested in joining future discussions about the topic, as some of the details are still a bit fuzzy to me (as you can tell).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment