Add liveness/readiness probes #602

Danil-Grigorev · 2020-05-27T13:43:41Z

OCPCLOUD-785 - health checks for all machine API controllers

This introduce support for readinessProbe and livenessProbe [1] for the owned machine controllers deployment and its machineSet, MHC and machine controller containers.
This will let the kubelet to better acknowledge about these containers lifecycle and therefore letting us to be more robust to signal the operator degradability on the clusterOperator status.

This PR needs [2], [3] and [4] to work and pass CI so the probes included in the container spec here can get a 200 from the machine controllers. Also additionally [5], [6] and [7] must the same to not break or the probes included in the container spec here will fail will and result in the containers getting restarted.

This also reverts accidental rebase and put back the syncPeriod which was dropped by [8].

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
[2] openshift/cluster-api-provider-aws#329
[3] openshift/cluster-api-provider-azure#139
[4] openshift/cluster-api-provider-gcp#96
[5] openshift/cluster-api-provider-baremetal#79
[6] openshift/cluster-api-provider-ovirt#52
[7] openshift/cluster-api-provider-openstack#105
[8] https://github.com/openshift/machine-api-operator/pull/590/files#diff-7417e4bc31a1bacc1a431704bee56978L41

openshift-ci-robot · 2020-05-27T13:43:56Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

go.mod

Danil-Grigorev · 2020-06-10T11:18:14Z

/hold cancel

enxebre · 2020-06-17T08:37:48Z

This will fail the readinessProbe/livenessProbe on the machine controllers unless they are listening on the given port. If that's the case already can you please link all the relevant PRs in the description?
Any reason why we are we not healthchecking the controller leaving ni this repo i.e machineSet, MHC...?

enxebre · 2020-06-17T11:34:33Z

install/0000_30_machine-api-operator_11_deployment.yaml

@@ -52,6 +52,17 @@ spec:
        - "--images-json=/etc/machine-api-operator-config/images/images.json"
        - "--alsologtostderr"
        - "--v=3"
+        ports:
+        - containerPort: 9440


why this, is operator binary is not the one exposing the health check?

enxebre · 2020-06-18T09:07:33Z

@Danil-Grigorev can you please coordinate with openstack/metal3/ovirt providers so they are aware of this?
Can we also please link the counter parts in the description?

Danil-Grigorev · 2020-06-18T09:44:01Z

cmd/vsphere/main.go

@@ -34,9 +41,11 @@ func main() {
 	}

 	cfg := config.GetConfigOrDie()
-
+	syncPeriod := 10 * time.Minute


At some point this fix was lost.

can you please link the specific commit where this was lost and put it back in its own commit?
Let's keep the bar high for atomic commits, meaningful messages and small PRs. The more we do that the more sustainable and the easier to engage with all the repos.

@enxebre fixed

Thanks a lot for splitting the commits! fwiw I didn't meant to necessarily cherry-pick back the syncPeriod commit but rather add a link in the description pointing to the commit which dropped it by accident.

Can you please link the counter part PRs for the actuators in the PR description and elaborate a bit on the motivation behind this change (OCPCLOUD-785 is not something public) and also at minimum explain which others providers will be affected by this, e.g openstack/rhv/metal3.

That along with the commits as they are broken down now would have dramatically reduce the friction and time to review this PR in the first place. It would also make extremely easier for people getting here with less context to understand the reasoning behind the change. People with less context includes ourselves in a month from now or doing context switching from other repos.

Usually we elaborate the reasoning behind a change in git ci -m"" -m"here" so it's not only recorded in GH but also in the git history. GH recognises and use that automatically as the PR desc.

They won't be affected immediately, even if it gets merged, I'll open issues in other repos.

mm wouldn't they break as soon as this get merged, as the health check will fail for them when the mao runs?

I see, let me open a couple of issues.

Danil-Grigorev · 2020-06-18T15:16:26Z

/retest

JoelSpeed · 2020-06-19T13:35:03Z

pkg/operator/sync.go

+	defaultMachineHealthPort            = 9440
+	defaultMachineSetHealthPort         = 9441
+	defaultMachineHealthCheckHealthPort = 9442


Might be being a bit pedantic, but would it be a pain to make these that same as the metrics ports but +1000 for consistency? WDYT?

Suggested change

defaultMachineHealthPort = 9440

defaultMachineSetHealthPort = 9441

defaultMachineHealthCheckHealthPort = 9442

defaultMachineHealthPort = 9441

defaultMachineSetHealthPort = 9442

defaultMachineHealthCheckHealthPort = 9444

It will be something breaking PRs across 5 repos 😁

Oh really? 😓 We should be using constants for this really, but I'll let that be a future improvement

enxebre · 2020-06-22T07:45:33Z

pkg/operator/sync.go

+	defaultMachineHealthPort            = 9440
+	defaultMachineSetHealthPort         = 9441
+	defaultMachineHealthCheckHealthPort = 9442
+	healthFailureThreshold              = 10


where is this magic number 10 coming from?

I saw that the default 3 retries was not enough, and the container failed couple of times before becoming ready. Just decided to give it more time. Not that important

so why was it not enough? can we add a comment explaining why we choose 10?

OCPCLOUD-785 - health checks for all machine API controllers This introduce support for readinessProbe and livenessProbe [1] for the owned machine controllers deployment and its machineSet, MHC and machine controller containers. This will let the kubelet to better acknowledge about these containers lifecycle and therefore letting us to be more robust to signal the operator degradability on the clusterOperator status. This PR needs [2], [3] and [4] to work and pass CI so the probes included in the container spec here can get a 200 from the machine controllers. Also additionally [5], [6] and [7] must the same to not break or the probes included in the container spec here will fail will and result in the containers getting restarted. This also reverts accidental rebase and put back the syncPeriod which was dropped by [8]. [1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes [2] openshift/cluster-api-provider-aws#329 [3] openshift/cluster-api-provider-azure#139 [4] openshift/cluster-api-provider-gcp#96 [5] https://github.com/openshift/cluster-api-provider-openstack [6] https://github.com/openshift/cluster-api-provider-ovirt [7] https://github.com/metal3-io/cluster-api-provider-metal3 [8] https://github.com/openshift/machine-api-operator/pull/590/files#diff-7417e4bc31a1bacc1a431704bee56978L41

- Reintroducing a fix dropped in 4c9abf9

enxebre · 2020-06-22T09:03:35Z

/retest
/approve
/hold
Thanks for addressing all the comments @Danil-Grigorev! This PR might break CI jobs which are not blocking in this repo. Feel free to unhold by verifying results before merging.

openshift-ci-robot · 2020-06-22T09:03:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2020-06-22T10:45:20Z

/test e2e-azure-operator

Danil-Grigorev · 2020-06-22T13:07:29Z

/retest

enxebre · 2020-06-23T08:26:22Z

/test e2e-azure-operator

enxebre · 2020-06-23T11:56:56Z

/hold cancel

elmiko

this looks reasonable to me, i agree with the comments about coming back to make the ports into constants.
/lgtm

elmiko · 2020-06-23T16:51:39Z

cmd/machine-healthcheck/main.go

+
+	healthAddr := flag.String(
+		"health-addr",
+		":9442",


+1, i think making this a constant would be a good improvement for a followup

openshift-bot · 2020-06-23T17:02:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T17:41:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T18:33:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T20:43:03Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T20:57:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T21:09:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-06-23T22:37:52Z

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure-operator	`64261c8`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-06-23T22:40:03Z

/retest

Please review the full test history for this PR and help us cut down flakes.

The machine-api-operator recently added liveness and readiness checks(openshift/machine-api-operator#602) With this change the controller will respond with a 200. Closes: openshift#197

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2020

openshift-ci-robot requested review from beekhof and paulfantom May 27, 2020 13:43

Danil-Grigorev commented May 27, 2020

View reviewed changes

go.mod Outdated Show resolved Hide resolved

Danil-Grigorev force-pushed the healthz branch from d49c7f6 to 826be8c Compare May 27, 2020 13:45

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2020

Danil-Grigorev force-pushed the healthz branch from 826be8c to 45a71ab Compare June 10, 2020 11:17

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2020

Danil-Grigorev marked this pull request as ready for review June 10, 2020 11:18

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2020

Danil-Grigorev force-pushed the healthz branch 2 times, most recently from d6f25e9 to 2717f3e Compare June 10, 2020 11:24

Danil-Grigorev force-pushed the healthz branch from d6232a1 to d5ed396 Compare June 17, 2020 11:30

enxebre reviewed Jun 17, 2020

View reviewed changes

Danil-Grigorev force-pushed the healthz branch 2 times, most recently from 59ad63c to dd5d4db Compare June 17, 2020 12:12

Danil-Grigorev force-pushed the healthz branch from dd5d4db to e7c087d Compare June 18, 2020 09:40

Danil-Grigorev commented Jun 18, 2020

View reviewed changes

Danil-Grigorev force-pushed the healthz branch from e7c087d to 87d7d9d Compare June 18, 2020 13:06

Danil-Grigorev requested a review from enxebre June 18, 2020 13:10

This was referenced Jun 19, 2020

Add liveness/readiness probes openshift/cluster-api-provider-openstack#104

Closed

Add liveness/readiness probes openshift/cluster-api-provider-baremetal#78

Closed

This was referenced Jun 19, 2020

Add liveness/readiness probes openshift/cluster-api-provider-openstack#105

Merged

Add liveness/readiness probes openshift/cluster-api-provider-baremetal#79

Merged

JoelSpeed reviewed Jun 19, 2020

View reviewed changes

enxebre reviewed Jun 22, 2020

View reviewed changes

Danil-Grigorev force-pushed the healthz branch 2 times, most recently from 79aecd1 to 5776f6c Compare June 22, 2020 08:16

Danil-Grigorev and others added 2 commits June 22, 2020 10:17

[vSphere] Reduce sync period to 10 minutes

64261c8

- Reintroducing a fix dropped in 4c9abf9

Danil-Grigorev force-pushed the healthz branch from 5776f6c to 64261c8 Compare June 22, 2020 08:18

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 22, 2020

Danil-Grigorev mentioned this pull request Jun 23, 2020

Add readiness/liveness probes openshift/cluster-api-provider-ovirt#52

Merged

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020

elmiko approved these changes Jun 23, 2020

View reviewed changes

openshift-ci-robot assigned elmiko Jun 23, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2020

openshift-merge-robot merged commit 1e858a4 into openshift:master Jun 23, 2020

Prashanth684 mentioned this pull request Jul 3, 2020

Need to update go mod dependencies for k8s and machine-api packages openshift/cluster-api-provider-libvirt#197

Closed

Add liveness/readiness probes #602

Add liveness/readiness probes #602

Conversation

Danil-Grigorev commented May 27, 2020 • edited Loading

openshift-ci-robot commented May 27, 2020

Danil-Grigorev commented Jun 10, 2020

enxebre commented Jun 17, 2020

Choose a reason for hiding this comment

enxebre commented Jun 18, 2020

Choose a reason for hiding this comment

enxebre Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Danil-Grigorev commented Jun 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Jun 22, 2020 • edited Loading

openshift-ci-robot commented Jun 22, 2020

JoelSpeed commented Jun 22, 2020

Danil-Grigorev commented Jun 22, 2020

enxebre commented Jun 23, 2020

enxebre commented Jun 23, 2020

elmiko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Jun 23, 2020

openshift-bot commented Jun 23, 2020

openshift-bot commented Jun 23, 2020

openshift-bot commented Jun 23, 2020

openshift-bot commented Jun 23, 2020

openshift-bot commented Jun 23, 2020

openshift-ci-robot commented Jun 23, 2020 • edited Loading

openshift-bot commented Jun 23, 2020

Danil-Grigorev commented May 27, 2020 •

edited

Loading

enxebre Jun 18, 2020 •

edited

Loading

enxebre Jun 18, 2020 •

edited

Loading

enxebre commented Jun 22, 2020 •

edited

Loading

openshift-ci-robot commented Jun 23, 2020 •

edited

Loading