Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness/readiness probes #602

Merged
merged 2 commits into from
Jun 23, 2020

Conversation

Danil-Grigorev
Copy link
Contributor

@Danil-Grigorev Danil-Grigorev commented May 27, 2020

OCPCLOUD-785 - health checks for all machine API controllers

This introduce support for readinessProbe and livenessProbe [1] for the owned machine controllers deployment and its machineSet, MHC and machine controller containers.
This will let the kubelet to better acknowledge about these containers lifecycle and therefore letting us to be more robust to signal the operator degradability on the clusterOperator status.

This PR needs [2], [3] and [4] to work and pass CI so the probes included in the container spec here can get a 200 from the machine controllers. Also additionally [5], [6] and [7] must the same to not break or the probes included in the container spec here will fail will and result in the containers getting restarted.

This also reverts accidental rebase and put back the syncPeriod which was dropped by [8].

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
[2] openshift/cluster-api-provider-aws#329
[3] openshift/cluster-api-provider-azure#139
[4] openshift/cluster-api-provider-gcp#96
[5] openshift/cluster-api-provider-baremetal#79
[6] openshift/cluster-api-provider-ovirt#52
[7] openshift/cluster-api-provider-openstack#105
[8] https://github.com/openshift/machine-api-operator/pull/590/files#diff-7417e4bc31a1bacc1a431704bee56978L41

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2020
@openshift-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

go.mod Outdated Show resolved Hide resolved
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2020
@Danil-Grigorev
Copy link
Contributor Author

/hold cancel

@Danil-Grigorev Danil-Grigorev marked this pull request as ready for review June 10, 2020 11:18
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2020
@enxebre
Copy link
Member

enxebre commented Jun 17, 2020

This will fail the readinessProbe/livenessProbe on the machine controllers unless they are listening on the given port. If that's the case already can you please link all the relevant PRs in the description?
Any reason why we are we not healthchecking the controller leaving ni this repo i.e machineSet, MHC...?

@@ -52,6 +52,17 @@ spec:
- "--images-json=/etc/machine-api-operator-config/images/images.json"
- "--alsologtostderr"
- "--v=3"
ports:
- containerPort: 9440
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this, is operator binary is not the one exposing the health check?

@enxebre
Copy link
Member

enxebre commented Jun 18, 2020

@Danil-Grigorev can you please coordinate with openstack/metal3/ovirt providers so they are aware of this?
Can we also please link the counter parts in the description?

@@ -34,9 +41,11 @@ func main() {
}

cfg := config.GetConfigOrDie()

syncPeriod := 10 * time.Minute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point this fix was lost.

Copy link
Member

@enxebre enxebre Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please link the specific commit where this was lost and put it back in its own commit?
Let's keep the bar high for atomic commits, meaningful messages and small PRs. The more we do that the more sustainable and the easier to engage with all the repos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre fixed

Copy link
Member

@enxebre enxebre Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for splitting the commits! fwiw I didn't meant to necessarily cherry-pick back the syncPeriod commit but rather add a link in the description pointing to the commit which dropped it by accident.

Can you please link the counter part PRs for the actuators in the PR description and elaborate a bit on the motivation behind this change (OCPCLOUD-785 is not something public) and also at minimum explain which others providers will be affected by this, e.g openstack/rhv/metal3.

That along with the commits as they are broken down now would have dramatically reduce the friction and time to review this PR in the first place. It would also make extremely easier for people getting here with less context to understand the reasoning behind the change. People with less context includes ourselves in a month from now or doing context switching from other repos.

Usually we elaborate the reasoning behind a change in git ci -m"" -m"here" so it's not only recorded in GH but also in the git history. GH recognises and use that automatically as the PR desc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They won't be affected immediately, even if it gets merged, I'll open issues in other repos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm wouldn't they break as soon as this get merged, as the health check will fail for them when the mao runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let me open a couple of issues.

@Danil-Grigorev
Copy link
Contributor Author

/retest

Comment on lines +34 to +36
defaultMachineHealthPort = 9440
defaultMachineSetHealthPort = 9441
defaultMachineHealthCheckHealthPort = 9442
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be being a bit pedantic, but would it be a pain to make these that same as the metrics ports but +1000 for consistency? WDYT?

Suggested change
defaultMachineHealthPort = 9440
defaultMachineSetHealthPort = 9441
defaultMachineHealthCheckHealthPort = 9442
defaultMachineHealthPort = 9441
defaultMachineSetHealthPort = 9442
defaultMachineHealthCheckHealthPort = 9444

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be something breaking PRs across 5 repos 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really? 😓 We should be using constants for this really, but I'll let that be a future improvement

defaultMachineHealthPort = 9440
defaultMachineSetHealthPort = 9441
defaultMachineHealthCheckHealthPort = 9442
healthFailureThreshold = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this magic number 10 coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that the default 3 retries was not enough, and the container failed couple of times before becoming ready. Just decided to give it more time. Not that important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why was it not enough? can we add a comment explaining why we choose 10?

Danil-Grigorev and others added 2 commits June 22, 2020 10:17
OCPCLOUD-785 - health checks for all machine API controllers

This introduce support for readinessProbe and livenessProbe [1] for the owned machine controllers deployment and its machineSet, MHC and machine controller containers.
This will let the kubelet to better acknowledge about these containers lifecycle and therefore letting us to be more robust to signal the operator degradability on the clusterOperator status.

This PR needs [2], [3] and [4] to work and pass CI so the probes included in the container spec here can get a 200 from the machine controllers. Also additionally [5], [6] and [7] must the same to not break or the probes included in the container spec here will fail will and result in the containers getting restarted.

This also reverts accidental rebase and put back the syncPeriod which was dropped by [8].

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
[2] openshift/cluster-api-provider-aws#329
[3] openshift/cluster-api-provider-azure#139
[4] openshift/cluster-api-provider-gcp#96
[5] https://github.com/openshift/cluster-api-provider-openstack
[6] https://github.com/openshift/cluster-api-provider-ovirt
[7] https://github.com/metal3-io/cluster-api-provider-metal3
[8] https://github.com/openshift/machine-api-operator/pull/590/files#diff-7417e4bc31a1bacc1a431704bee56978L41
- Reintroducing a fix dropped in 4c9abf9
@enxebre
Copy link
Member

enxebre commented Jun 22, 2020

/retest
/approve
/hold
Thanks for addressing all the comments @Danil-Grigorev! This PR might break CI jobs which are not blocking in this repo. Feel free to unhold by verifying results before merging.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 22, 2020
@JoelSpeed
Copy link
Contributor

/test e2e-azure-operator

@Danil-Grigorev
Copy link
Contributor Author

/retest

@enxebre
Copy link
Member

enxebre commented Jun 23, 2020

/test e2e-azure-operator

@enxebre
Copy link
Member

enxebre commented Jun 23, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable to me, i agree with the comments about coming back to make the ports into constants.
/lgtm


healthAddr := flag.String(
"health-addr",
":9442",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i think making this a constant would be a good improvement for a followup

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 23, 2020

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator 64261c8 link /test e2e-azure-operator

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 1e858a4 into openshift:master Jun 23, 2020
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Jul 6, 2020
The machine-api-operator recently added liveness and readiness checks(openshift/machine-api-operator#602)
With this change the controller will respond with a 200.

Closes: openshift#197
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants