Propose KEP: Leveraging Distributed Tracing to Understand Kubernetes Object Lifecycles #650
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.
It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
@Monkeyanator: Reiterating the mentions to trigger a notification:
The Kubernetes control plane, and other distributed systems built like it, is indeed lacking an important form of performance observability. In systems built out of procedure calls (local and remote), this need is often addressed by a concept of "tracing" that is built around "spans", where a span corresponds to a procedure call. The Kubernetes control plane, and systems like it, are not primarily built out of procedure calls, and observability based on a concept of spans is not a good fit. This is not to deny that "latency from point A to point B" is a very relevant concept. What I am denying is that the original data should look like procedure calls with relationships among them. Rather, the original data should look like those individual points and relationships between them, because those relationships are much richer than can reasonably be captured by a collection of non-degenerate spans (by "degenerate span" I mean one of zero length, essentially representing an individual point).
In the Kubernetes control plane, work on an object is not done by a tree of procedure calls. The Kubernetes control plane is built out of controllers that monitor the state of various objects and occasionally write part of the state of certain objects. Each write is based on what was revealed by certain earlier reads --- which in turn are simply conveying what was written earlier. In short, the fundamental stuff of control plane activity is partial state writes based on other partial state writes.
For example, consider a pod. We could try to characterize what happens to a pod as a sequence of spans, where each span starts with some client requesting a change (i.e., a create, update, or delete) and ends with the implementation --- the relevant kubelet --- satisfying that request. But that is not even a good explanation of the events at the start of the life a pod. The first major state-setting event of a pod's lifecycle is a client creating the pod API object. That initial state typically does not include a binding to a particular node. The next major event is typically a scheduler doing another state write that binds the pod to a node. The final major event in the startup of a pod is the relevant kubelet doing a state write that indicates that the pod is running.
We could try to model this with spans by building into the model the idea that a pod's startup has a sequence of two spans: one from creation of API object to node binding, another from node binding to running state. We could say that the primary performance data for pod startup is built out of these two kinds of spans.
A pod is a relatively low-level API object in Kubernetes. There are many higher level objects of interest. Analysts whosse concern with pods is only about the full startup latency of a pod --- from API object create to running state --- could write queries or code that synthesizes the full startup latency out of the two constituent spans.
But it is not always that way: it is allowed for a pod to be created in a bound state. So a given pod will not necessarily have both spans. The aforementioned analysts could write more complicated queries or code to handle both scenarios.
Perhaps more likely, we could make it "the implementation's" responsibility to create the single span that represents the full startup, and identify the one or two constituent spans as children of the full startup span. What would that implementation code look like? In both OpenTracing and OpenCensus, the parent has to exist before the child is created. So a scheduler would have to create the full-startup span as well as the scheduler-work span. The kubelet would have to be prepared to create the full-startup span if it has not already been created, as well as create the kubelet-work span.
Where are those three spans stored? If the scheduler-work span and the kubelet-work span are sinks in the DAG of spans then they can simply be created when completed and emitted into the span collection framework, leaving only the full-startup span as something that needs to be stored with the pod API object. This also requires the time of the binding write (or create, whichever is appropriate for the pod at hand) to be stored in the API object, so that it is available when the kubelet opens its leaf span. So now we are also storing a state write timestamp in addition to a span. Alternatively, we can say that as soon as the binding is determined for a pod the kubelet-work span is started. This means that we are storing two spans with the API object: the full-startup span and the kubelet-work span. But we will not really be satisfied with requiring the scheduler-work span and the kubelet-work span to be leaves. In both the scheduler and the kubelet there may be a sequence of spans wherein a queue worker works on a given pod, and the parent of those spans (i.e., the full scheduler-work span or the full kubelet-work span) has to be stored with the API object. So we need the API object to hold onto multiple spans: at least the full-startup span plus one for scheduler work or one for kubelet work.
If every object alternated between an idle period, in which the desired state is fully implemented, and an active period, in which "the implementation" is working through a linear sequence of intermediate states (which always occur in the same order, and we allow an intermediate state to take zero time for some objects) along the lines discussed above, then we could always impose a span-based model as discussed above. If the implementation can follow a more general state machine during an active period then it gets more complicated. Each state transition could be modeled as a span, but an analyst interested in anything other than individual state transitions, or code trying to synthesize higher level spans, has a fair amount of complexity to cope with.
The idea of defining a state machine for an object is explicitly rejected as a good general design pattern. See the remarks about "phase" at https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md . Instead of a single monolithic phase it is recommended to have more a granular concept of status (and thus state as a whole). With state divided into several independent parts, what defines the spans for an object? I think someone or something analyzing performance may query some specific intervals, but asking the code to know beforehand exactly which pairs of events will be queried is problematic in general.
Even if we could set aside the generality problems discussed above, we are not done with pods. Consider the case of a StatefulSet, which creates and deletes pods. A StatefulSet also creates and deletes PVC objects to be used in volumes of those pods. The normal scheduler does not attempt to schedule a pod until all its volumes are ready for use; this includes waiting until referenced PVCs are bound to PVs. The controller that binds PVCs to PVs does not know or care whether or not a particular PVC was created by a StatefulSet. The scheduler does not know or care whether a particular pod is a member of a StatefulSet. The relationship between the binding of such a PVC and the scheduling of such a pod is a critical part of the performance story, but this pair of events does not look like a <update desired state of an object, indicate completion of implementation of that object> pair. That is, it does not look like what we have been talking about a span representing. It is an interesting interval, so we could require the scheduler to make a span for every <PVC used in pod volume got bound, pod-got-scheduled> interval. Note that such a span is not all about a single object, which violates the mental model we started with. Note also that a PVC's status already has "conditions", which can be used to record when the PVC got bound. But not every PVC is created by a StatefulSet for one of its pods; a PVC can be created and bound independently of a pod. Even for a PVC created for a pod in a StatefulSet, the PVC could get bound before the pod API object is created. We can not generally make a <PVC got bound, pod got scheduled> span a child of the pod's scheduling span because the former may start before the latter. Similarly, we can not generally use the parent/child relationship in the other direction either. The "FollowsFrom" relation in OpenTracing has the same problem. Actually, I do not see an explicit absolute requirement between start times of related spans in either OpenTracing nor OpenCensus, but I think that there is an intended constraint. OpenCensus also presents the additional difficulty that a given span can have at most one parent.
A more natural model would be to define a span for the PVC controller's work on binding a PVC to a PV and then ask the scheduler to establish a relationship between the PVC binding span and the scheduler's pod scheduling span. This requires the PVC constroller's span to persist on the PVC object after the span is finished. This also has the problem that there is no fixed relationship in familiar tracing terms, because, again, either of the two spans in question could start before the other.
In short, the relationship between work on a PVC and work on a pod does not fit into the existing models for relationships between spans.
There are many other examples in Kubernetes of relationships between different kinds of objects. And we can not put API objects into a containment tree. For example, the pods of one ReplicaSet may also contribute to an Endpoints object --- and also that Endpoints object may draw additional content from pods not in that ReplicaSet.
As we have already seen with pods, it is not a given that an object's implementation lies entirely in one controller; even forgetting about PVCs and such, a pod's implemetation is divided between scheduler and kubelet. With general granular state, it is not necessarily true that implementation work is handed off along a sequence of controllers.
With a web of relationships between objects with granular state with concurrent bits of implementation in progress, I do not see a clearly good way to model this with spans.
What I do see is that each state write done by a controller is based on some state that controller got in earlier reads (either explicit requests or watch notifications), where each part of that state was, in turn, set by an earlier such write. It is these state writes that are the primitive performance data, and the relationships just stated are the primitive relationships. In addition to drawing what is relevant to a given individual we may want --- just as in Prometheus, or in an SQL database --- to allow an analyst to make various queries against this primitive data and its relationships.
@MikeSpreitzer thanks for the feedback. I've had time to digest it, and think I understand your perspective slightly better now.
Distributed tracing is context-aware, structured, distributed latency logging. Though it is mainly used with procedure calls, it isn't limited to procedure calls. The only requirements I can see for using tracing in any system is being able to attach a context to a description of user intent, and propagate it to all components that act on that intent. That is fundamentally why associating a given trace context with an object's desired state is a good way to adopt the tracing model to the watch-based k8s model.
Tracing tools already handle absent spans gracefully. For viewing the single trace, the span would simply be absent. Analysis tools aggregate spans with a single span name. So if we had a parent span
We don't actually have to store any spans with the API object to accomplish this. As long as you have the timestamp of the start of a process, you can retroactively construct the parent span. You are correct, that by storing a few more timestamps, we could get a few more traces to wrap, for example, all of the kubelet work in a single span. But the nice thing for now is that we can just skip adding those spans when we don't have the start time, and add them in if/when we add those timestamps. Tracing tools still function even when we are missing parent spans, and just have a collection of child spans. For example, we can have
When a kubernetes controller attempts to reconcile desired and actual state for an object, it does at least two steps:
For example, the scheduler does these two steps:
While (2) is an important part of the reconciliation process, as you point out, it isn't that interesting on its own. Wrapping (1) in a span is far more interesting and useful. As we would expect from the example, the scheduler folks care immensely about how fast the schedule pod algorithm takes, and not at all about how long (2) takes.
I am not suggesting we model anything after a state machine. Let me know if something in the proposal led you to think that, and I can update it to make it clearer.
The spans for an object are any operation which advances the actual state toward the desired state. It doesn't need to be linear, and often many steps happen in parallel. This includes actions like:
Ok, I think I owe you at least a hypothetical way we could handle hierarchies in kubernetes... I haven't implemented this, but I hope it shows that it is possible to handle such object relationships relatively elegantly.
Therefore, I propose that when a controller, acting in the context of object A, modifies the desired state of object B, it should propagate that context to object B. This means each user-initiated object creation results in a single trace, since all objects created as a result of this have the trace context propagated to them. This captures the relationship between multiple objects created on behalf of a higher-level object, such as a StatefulSet, which creates both PVCs and Pods, as they are connected by the fact that they both are associated with the same StatefulSet.
There is generally a class of "selector" objects, such as a
I have done my best to answer this above.
I think we should be just as interested in the actual work done by components as the status updates that reflect this work.
(disclaimer: I haven't read anything but the prior comment)
Kubernetes doesn't make a clear distinction between users and system components.
If the information you want really does form trees, then what is missing from the existing owner references? (Also note that they are not guaranteed to be trees!)
If the information does not form trees (as I expect) then I think it is not a good idea to propagate everything. I do think it would be useful and interesting to store exactly one level of this information (e.g., list the immediate objects that caused the update, but NOT the full context that caused those objects to be last updated).
This was discussed a small amount in today's api machinery SIG. (which I haven't uploaded yet, sorry)
Yeah I'm happy with merging this in provisional state, I think there are still somewhat contentious points, but we're in agreement that we want this.
I'm still not entirely convinced that what @soltysh mentioned is alleviated (as in concurrent actions on objects creating new contexts racing with "old in progress" ones), but I think if not then that will show in the implementation.
[APPROVALNOTIFIER] This PR is APPROVED
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing