KEP 34: Instance health #1690

alenkacz · 2020-09-23T11:50:30Z

Signed-off-by: Alena Varkockova varkockova.a@gmail.com

What this PR does / why we need it:
First iteration of the KEP for health.

keps/0034-instance-health.md

ANeumann82

lgtm.

I'm not sure if this is part of the What or the How, but: We probably have to think about "what resources are included here". Generally, all resources from the deploy plan, but what about resources from a different plan that deploys additional resources?

ANeumann82 · 2020-09-23T12:00:31Z

keps/0034-instance-health.md

+    * [Goals](#goals)
+    * [Non-Goals](#non-goals)
+
+[Tools for generating]: https://github.com/ekalinin/github-markdown-toc


FYI: There's a script in the hack folder, gh-md-toc.sh, that's what I usually use. Maybe we should add that to the KEP-Template...

yeah, that would be great!

Good editor plugins for Markdown can also create/update TOCs.

keps/0034-instance-health.md

alenkacz · 2020-09-23T12:08:32Z

@ANeumann82 My current thinking is that it's those resources that have owner instance and are of the types I've mentioned there

kensipe

added some thoughts which add some clarity...
I would also add references to other keps or docs that provide more context.

Regardless... it looks good to merge currently or with some mods based on suggestions. It is easy to see where it is going. nice work!

kensipe · 2020-09-23T15:46:35Z

keps/0034-instance-health.md

+
+## Motivation
+
+Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.


Suggested change

Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.

Added health monitoring for `Instance` CR will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.

health is a tricky word :).

I really like that it is labeled here as Instance health... this does not represent the health of the underlying service... because of that... I would encourage a different question here.

I'm question this question --"is my operator working at this point in time"--
The question is more is the control plane involved with my operator? or is my operator control plane satisfied?
I guess it depends on what is meant by "operator working" (which would be worth defining for clarity)

kensipe · 2020-09-23T15:57:11Z

keps/0034-instance-health.md

+
+## Summary
+
+KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload health after deployment. To help with that, KUDO will expose "health" computed as a heuristic based on health and readiness of the underlying resources. In the first iteration, health will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them *health phase 1 resources*).


I don't understand the heuristic of "Service"... assuming "Services" written here means kubernetes service.

it's a kubernetes resource of type Service

kensipe · 2020-09-23T15:59:51Z

keps/0034-instance-health.md

+### Non-Goals
+
+Drift detection (detecting that resource was deleted or changed manually)
+Including other types of resources than *health phase 1 resources*


we should add that it is a non-goal to establish health of the underlying service. it is a non-goal to:

determine if the underlying service is functional

determine if the underlying service is reachable

I used the word application instead of service. I know you don't like the word application, that said we use it in KUDO - e.g. we call the underlying version appVersion so I am sticking with that

kensipe · 2020-09-23T16:00:33Z

keps/0034-instance-health.md

+
+Expose health heuristic in `Status` field of `Instance`
+Compute health by evaluating health and readiness of *health phase 1 resources*
+


is it... or is it not a goal... to determine the defined states for the status?

I think it will be a boolean right now, but that's implementation detail. So I actually think it's a goal

Yeah, I think it should be a goal - and Probably not a boolean? To make it extendable? I think we should at least mirror the Deployment/Statefulset status of "NOT_READY, DEGRADED, RUNNING" or something similar...

But this is implementation detail, and shouldn't come in at this stage of the KEP, I agree :)

Yeah I am going with boolean for now for some reasons that will be explained in the next part of KEP, let's discuss there

kensipe · 2020-09-23T16:22:21Z

might be worth adding as a goal or non-goal: defining mechanism / linkage for an instance to define ownership to components

ANeumann82

Using Readiness makes sense. I kind of like health a bit more, but using the k8s vocabulary is more important.

ANeumann82 · 2020-09-24T08:49:02Z

keps/0034-instance-health.md

+
+## Motivation
+
+Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.


Maybe it makes sense to compare that to Deployments vs. Pods?

A Deployment-Status aggregates the readiness of the owned Pods, and an Instance-Status aggregates the readiness of the owned resources?

ANeumann82 · 2020-09-24T08:54:39Z

keps/0034-instance-health.md

+
+Expose health heuristic in `Status` field of `Instance`
+Compute health by evaluating health and readiness of *health phase 1 resources*
+


Yeah, I think it should be a goal - and Probably not a boolean? To make it extendable? I think we should at least mirror the Deployment/Statefulset status of "NOT_READY, DEGRADED, RUNNING" or something similar...

But this is implementation detail, and shouldn't come in at this stage of the KEP, I agree :)

Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>

kensipe

looks really good to me!

added some thoughts / comments... but lets merge!

kensipe · 2020-09-24T12:25:28Z

keps/0034-instance-health.md

+
+## Summary
+
+KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload readiness after deployment. To help with that, KUDO will expose readiness computed as a heuristic based on readiness of the underlying resources. In the first iteration, readiness will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them *readiness phase 1 resources*).


the reason I ask about service as a readiness phase 1 resources... is I don't know what it means to know a service is ready? is there a way to reflect that service is ready?

kensipe · 2020-09-24T12:26:21Z

keps/0034-instance-health.md

+
+Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.
+
+The idea here is very similar to the relation between `Deployment` and `Pod` core k8s types. Pods contain very low-level information about their readiness and state they are in while `Deployment` tries to compute an aggregated and higher-level state from all the underlying owned resources. The same goal now applies to `Instance`.


Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>

alenkacz requested review from gerred, kensipe, nfnt and zen-dog as code owners September 23, 2020 11:50

alenkacz force-pushed the av/kep-34 branch from d7d2e67 to 97bb68e Compare September 23, 2020 11:58

nfnt reviewed Sep 23, 2020

View reviewed changes

keps/0034-instance-health.md Outdated Show resolved Hide resolved

alenkacz force-pushed the av/kep-34 branch from 97bb68e to 604388b Compare September 23, 2020 12:03

ANeumann82 approved these changes Sep 23, 2020

View reviewed changes

alenkacz force-pushed the av/kep-34 branch from 604388b to 4e85ac2 Compare September 23, 2020 12:09

kensipe approved these changes Sep 23, 2020

View reviewed changes

alenkacz force-pushed the av/kep-34 branch from 4e85ac2 to 8aaa4ca Compare September 24, 2020 07:44

alenkacz requested review from ANeumann82 and kensipe September 24, 2020 07:45

ANeumann82 approved these changes Sep 24, 2020

View reviewed changes

KEP 34: Instance health

36c69b8

Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>

alenkacz force-pushed the av/kep-34 branch from 8aaa4ca to 36c69b8 Compare September 24, 2020 10:31

kensipe approved these changes Sep 24, 2020

View reviewed changes

Merge

e86f970

Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>

alenkacz merged commit 963d38f into main Sep 24, 2020

alenkacz deleted the av/kep-34 branch September 24, 2020 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP 34: Instance health #1690

KEP 34: Instance health #1690

alenkacz commented Sep 23, 2020

ANeumann82 left a comment

ANeumann82 Sep 23, 2020

alenkacz Sep 23, 2020

nfnt Sep 23, 2020

alenkacz commented Sep 23, 2020

kensipe left a comment

kensipe Sep 23, 2020

kensipe Sep 23, 2020

kensipe Sep 23, 2020

alenkacz Sep 24, 2020

kensipe Sep 23, 2020

alenkacz Sep 24, 2020

kensipe Sep 23, 2020

alenkacz Sep 24, 2020

ANeumann82 Sep 24, 2020

alenkacz Sep 24, 2020

kensipe commented Sep 23, 2020

ANeumann82 left a comment

ANeumann82 Sep 24, 2020

ANeumann82 Sep 24, 2020

kensipe left a comment

kensipe Sep 24, 2020

kensipe Sep 24, 2020


		## Motivation

		Added health monitoring for `Instance` CRD will help people answer question "is my operator working at this point in time" without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.


		## Summary

		KUDO helps people implement their operators and it's focus is day 2 operations. Part of day 2 is also monitoring your workload health after deployment. To help with that, KUDO will expose "health" computed as a heuristic based on health and readiness of the underlying resources. In the first iteration, health will be just a simple heuristic computed from Pods, StatefulSets, Deployments, ReplicaSets, DaemonSets and Services (let's call them health phase 1 resources).


		Expose health heuristic in `Status` field of `Instance`
		Compute health by evaluating health and readiness of health phase 1 resources


		## Motivation

		Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.


		Added readiness monitoring for `Instance` CR will help people answer question "is my operator running and ready at this point in time" (considering all the available information exposed by k8s resource) without querying all underlying resources. KUDO will expose this heuristic as part of `Status` field for everyone to query. This could be used as a signal for a monitoring tool.

		The idea here is very similar to the relation between `Deployment` and `Pod` core k8s types. Pods contain very low-level information about their readiness and state they are in while `Deployment` tries to compute an aggregated and higher-level state from all the underlying owned resources. The same goal now applies to `Instance`.

KEP 34: Instance health #1690

KEP 34: Instance health #1690

Conversation

alenkacz commented Sep 23, 2020

ANeumann82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alenkacz commented Sep 23, 2020

kensipe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kensipe commented Sep 23, 2020

ANeumann82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kensipe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment