Skip to content

Commit

Permalink
Proposal for instance readiness
Browse files Browse the repository at this point in the history
Signed-off-by: Alena Varkockova <varkockova.a@gmail.com>
  • Loading branch information
alenkacz committed Sep 24, 2020
1 parent 963d38f commit 10f818e
Showing 1 changed file with 55 additions and 0 deletions.
55 changes: 55 additions & 0 deletions keps/0034-instance-health.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,58 @@ Drift detection (detecting that resource was deleted or changed manually)
Including other types of resources than *readiness phase 1 resources*
Determining if the underlying application is functional
Determining if the underlying application is reachable

## Proposal

Readiness of an `Instance` will be communicated via a newly introduced `Status.Conditions` field. The schema of this field will conform the schema recommended by k8s api machinery. The `Type` of the newly added condition will be `Ready`. Condition is supposed to have these three values which will have the following meaning:
True - the last observation of readiness on this instance is that it’s ready
False - the last observation of readiness on this instance is that it’s NOT ready
Unknown - deploy/upgrade/update plan is running

Ready is expected to be an oscillating condition and it will indicate the object was believed to be fully operational at the time it was last verified. There’s an expectation that the value won’t be stalled for more than 1 minute (at least one minute after the state changes, status should be reflecting it).

Unknown state will be set whenever deploy/upgrade/update plan is running - this is because plan run should be an atomic operation going from stable state to another stable state, evaluating health of all resources involved in the plan as part of the execution. It feels redundant to be also checking for readiness on a side.

The reason for using Conditions field rather than introducing a separate field is mainly that it’s starting to be established as a go-to pattern in kubernetes world. Also conditions have a standard set of fields attached with metadata that are useful for our case (namely Reason, Message). Having a ‘Ready’ condition is also encouraged by sig-architecture. (excerpt from that document: *Although they are not a consistent standard, the `Ready` and `Succeeded` condition types may be used by API designers for long-running and bounded-execution objects, respectively.*)

Example of Instance status being ready:
```
apiVersion: kudo.dev/v1beta1
kind: Instance
Status:
Conditions:
- type: Ready
status: "True"
lastTransitionTime: 2018-01-01T00:00:00Z
```

Example of Instance status being NOT ready:
```
apiVersion: kudo.dev/v1beta1
kind: Instance
Status:
Conditions:
- type: Ready
status: "False"
lastTransitionTime: 2018-01-01T00:00:00Z
Reason: ResourceNotReady
Message: Deployment ‘xxx’ is not ready, Deployment ‘yyy’ is not ready
```

### Implementation Details/Notes/Constraints

Setting `Ready` condition will be a responsibility of existing `instance_controller`. It will be watching for all the types in *readiness phase 1 resources* (it’s already watching most of those already so there’s really no additional big performance penalty) and trigger reconcile for owned Instance.

Controller will have to pull all the *readiness phase 1 resources* (this means additional N requests to API server where N is number of resource types) while filtering for labels `heritage: kudo` and `kudo.dev/instance=<instance-name>`.

From all these resources it would compute readiness based on `Condition: Ready`, `Condition: Available` or other fields based on the convention of that particular type.

For operators installing their app via e.g. `Deployment`, there’s no need at this point to also check the Status of the underlying Pods because Deployment `Status` already mirrors the readiness of the underlying Pods. This is true to all higher-level types like StatefulSet etc.

### Risks and Mitigations

On a big cluster with a huge number of Pods and Deployments, it’s possible that a controller might have a scaling issues because of number of items it need to process.

This is mitigated by a fact that inside event handlers we’re filtering only for events that belong to KUDO which should limit the scope of events to only a very few (the expectation is that majority of the Deployments does not come from KUDO).

Also this feature is not making this situation any worse as it’s right now because KUDO controller already watches for Pods and Deployments, so we’re not introducing new overhead.

0 comments on commit 10f818e

Please sign in to comment.