Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 34: Instance health #1690

Merged
merged 2 commits into from
Sep 24, 2020
Merged

KEP 34: Instance health #1690

merged 2 commits into from
Sep 24, 2020

Conversation

alenkacz
Copy link
Contributor

Signed-off-by: Alena Varkockova varkockova.a@gmail.com

What this PR does / why we need it:
First iteration of the KEP for health.

keps/0034-instance-health.md Outdated Show resolved Hide resolved
Copy link
Member

@ANeumann82 ANeumann82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

I'm not sure if this is part of the What or the How, but: We probably have to think about "what resources are included here". Generally, all resources from the deploy plan, but what about resources from a different plan that deploys additional resources?

* [Goals](#goals)
* [Non-Goals](#non-goals)

[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: There's a script in the hack folder, gh-md-toc.sh, that's what I usually use. Maybe we should add that to the KEP-Template...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that would be great!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good editor plugins for Markdown can also create/update TOCs.

keps/0034-instance-health.md Outdated Show resolved Hide resolved
@alenkacz
Copy link
Contributor Author

@ANeumann82 My current thinking is that it's those resources that have owner instance and are of the types I've mentioned there

Copy link
Member

@kensipe kensipe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some thoughts which add some clarity...
I would also add references to other keps or docs that provide more context.

Regardless... it looks good to merge currently or with some mods based on suggestions. It is easy to see where it is going. nice work!


## Motivation

Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.
Added health monitoring for `Instance` CR will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health is a tricky word :).

I really like that it is labeled here as Instance health... this does not represent the health of the underlying service... because of that... I would encourage a different question here.

I'm question this question --"is my operator working at this point in time"--
The question is more is the control plane involved with my operator? or is my operator control plane satisfied?
I guess it depends on what is meant by "operator working" (which would be worth defining for clarity)


## Summary

KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload health after deployment. To help with that, KUDO will expose "health" computed as a heuristic based on health and readiness of the underlying resources. In the first iteration, health will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them *health phase 1 resources*).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the heuristic of "Service"... assuming "Services" written here means kubernetes service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a kubernetes resource of type Service

### Non-Goals

Drift detection (detecting that resource was deleted or changed manually)
Including other types of resources than *health phase 1 resources*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add that it is a non-goal to establish health of the underlying service. it is a non-goal to:

  • determine if the underlying service is functional
  • determine if the underlying service is reachable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the word application instead of service. I know you don't like the word application, that said we use it in KUDO - e.g. we call the underlying version appVersion so I am sticking with that


Expose health heuristic in `Status` field of `Instance`
Compute health by evaluating health and readiness of *health phase 1 resources*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it... or is it not a goal... to determine the defined states for the status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be a boolean right now, but that's implementation detail. So I actually think it's a goal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it should be a goal - and Probably not a boolean? To make it extendable? I think we should at least mirror the Deployment/Statefulset status of "NOT_READY, DEGRADED, RUNNING" or something similar...

But this is implementation detail, and shouldn't come in at this stage of the KEP, I agree :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am going with boolean for now for some reasons that will be explained in the next part of KEP, let's discuss there

@kensipe
Copy link
Member

kensipe commented Sep 23, 2020

might be worth adding as a goal or non-goal: defining mechanism / linkage for an instance to define ownership to components

Copy link
Member

@ANeumann82 ANeumann82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Readiness makes sense. I kind of like health a bit more, but using the k8s vocabulary is more important.


## Motivation

Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to compare that to Deployments vs. Pods?

A Deployment-Status aggregates the readiness of the owned Pods, and an Instance-Status aggregates the readiness of the owned resources?


Expose health heuristic in `Status` field of `Instance`
Compute health by evaluating health and readiness of *health phase 1 resources*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it should be a goal - and Probably not a boolean? To make it extendable? I think we should at least mirror the Deployment/Statefulset status of "NOT_READY, DEGRADED, RUNNING" or something similar...

But this is implementation detail, and shouldn't come in at this stage of the KEP, I agree :)

Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>
Copy link
Member

@kensipe kensipe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good to me!

added some thoughts / comments... but lets merge!


## Summary

KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload readiness after deployment. To help with that, KUDO will expose readiness computed as a heuristic based on readiness of the underlying resources. In the first iteration, readiness will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them *readiness phase 1 resources*).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I ask about service as a readiness phase 1 resources... is I don't know what it means to know a service is ready? is there a way to reflect that service is ready?


Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.

The idea here is very similar to the relation between `Deployment` and `Pod` core k8s types. Pods contain very low-level information about their readiness and state they are in while `Deployment` tries to compute an aggregated and higher-level state from all the underlying owned resources. The same goal now applies to `Instance`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>
@alenkacz alenkacz merged commit 963d38f into main Sep 24, 2020
@alenkacz alenkacz deleted the av/kep-34 branch September 24, 2020 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants