-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 34: Instance health #1690
KEP 34: Instance health #1690
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
I'm not sure if this is part of the What or the How, but: We probably have to think about "what resources are included here". Generally, all resources from the deploy plan, but what about resources from a different plan that deploys additional resources?
keps/0034-instance-health.md
Outdated
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
|
||
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: There's a script in the hack
folder, gh-md-toc.sh
, that's what I usually use. Maybe we should add that to the KEP-Template...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good editor plugins for Markdown can also create/update TOCs.
@ANeumann82 My current thinking is that it's those resources that have owner instance and are of the types I've mentioned there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some thoughts which add some clarity...
I would also add references to other keps or docs that provide more context.
Regardless... it looks good to merge currently or with some mods based on suggestions. It is easy to see where it is going. nice work!
keps/0034-instance-health.md
Outdated
|
||
## Motivation | ||
|
||
Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool. | |
Added health monitoring for `Instance` CR will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
health is a tricky word :).
I really like that it is labeled here as Instance health... this does not represent the health of the underlying service... because of that... I would encourage a different question here.
I'm question this question --"is my operator working at this point in time"--
The question is more is the control plane involved with my operator? or is my operator control plane satisfied?
I guess it depends on what is meant by "operator working" (which would be worth defining for clarity)
keps/0034-instance-health.md
Outdated
|
||
## Summary | ||
|
||
KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload health after deployment. To help with that, KUDO will expose "health" computed as a heuristic based on health and readiness of the underlying resources. In the first iteration, health will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them *health phase 1 resources*). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the heuristic of "Service"... assuming "Services" written here means kubernetes service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a kubernetes resource of type Service
keps/0034-instance-health.md
Outdated
### Non-Goals | ||
|
||
Drift detection (detecting that resource was deleted or changed manually) | ||
Including other types of resources than *health phase 1 resources* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add that it is a non-goal to establish health of the underlying service. it is a non-goal to:
- determine if the underlying service is functional
- determine if the underlying service is reachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the word application instead of service. I know you don't like the word application, that said we use it in KUDO - e.g. we call the underlying version appVersion so I am sticking with that
|
||
Expose health heuristic in `Status` field of `Instance` | ||
Compute health by evaluating health and readiness of *health phase 1 resources* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it... or is it not a goal... to determine the defined states for the status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be a boolean right now, but that's implementation detail. So I actually think it's a goal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it should be a goal - and Probably not a boolean? To make it extendable? I think we should at least mirror the Deployment/Statefulset status of "NOT_READY, DEGRADED, RUNNING" or something similar...
But this is implementation detail, and shouldn't come in at this stage of the KEP, I agree :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am going with boolean for now for some reasons that will be explained in the next part of KEP, let's discuss there
might be worth adding as a goal or non-goal: defining mechanism / linkage for an instance to define ownership to components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Readiness makes sense. I kind of like health a bit more, but using the k8s vocabulary is more important.
|
||
## Motivation | ||
|
||
Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes sense to compare that to Deployments vs. Pods?
A Deployment-Status aggregates the readiness of the owned Pods, and an Instance-Status aggregates the readiness of the owned resources?
|
||
Expose health heuristic in `Status` field of `Instance` | ||
Compute health by evaluating health and readiness of *health phase 1 resources* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it should be a goal - and Probably not a boolean? To make it extendable? I think we should at least mirror the Deployment/Statefulset status of "NOT_READY, DEGRADED, RUNNING" or something similar...
But this is implementation detail, and shouldn't come in at this stage of the KEP, I agree :)
Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good to me!
added some thoughts / comments... but lets merge!
|
||
## Summary | ||
|
||
KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload readiness after deployment. To help with that, KUDO will expose readiness computed as a heuristic based on readiness of the underlying resources. In the first iteration, readiness will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them *readiness phase 1 resources*). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I ask about service as a readiness phase 1 resources... is I don't know what it means to know a service is ready? is there a way to reflect that service is ready?
|
||
Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool. | ||
|
||
The idea here is very similar to the relation between `Deployment` and `Pod` core k8s types. Pods contain very low-level information about their readiness and state they are in while `Deployment` tries to compute an aggregated and higher-level state from all the underlying owned resources. The same goal now applies to `Instance`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
Signed-off-by: Alena Varkockova varkockova.a@gmail.com
What this PR does / why we need it:
First iteration of the KEP for health.