-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add liveness/readiness probes #602
Add liveness/readiness probes #602
Conversation
Skipping CI for Draft Pull Request. |
/hold cancel |
d6f25e9
to
2717f3e
Compare
This will fail the readinessProbe/livenessProbe on the machine controllers unless they are listening on the given port. If that's the case already can you please link all the relevant PRs in the description? |
@@ -52,6 +52,17 @@ spec: | |||
- "--images-json=/etc/machine-api-operator-config/images/images.json" | |||
- "--alsologtostderr" | |||
- "--v=3" | |||
ports: | |||
- containerPort: 9440 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this, is operator binary is not the one exposing the health check?
59ad63c
to
dd5d4db
Compare
@Danil-Grigorev can you please coordinate with openstack/metal3/ovirt providers so they are aware of this? |
@@ -34,9 +41,11 @@ func main() { | |||
} | |||
|
|||
cfg := config.GetConfigOrDie() | |||
|
|||
syncPeriod := 10 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point this fix was lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please link the specific commit where this was lost and put it back in its own commit?
Let's keep the bar high for atomic commits, meaningful messages and small PRs. The more we do that the more sustainable and the easier to engage with all the repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for splitting the commits! fwiw I didn't meant to necessarily cherry-pick back the syncPeriod commit but rather add a link in the description pointing to the commit which dropped it by accident.
Can you please link the counter part PRs for the actuators in the PR description and elaborate a bit on the motivation behind this change (OCPCLOUD-785 is not something public) and also at minimum explain which others providers will be affected by this, e.g openstack/rhv/metal3.
That along with the commits as they are broken down now would have dramatically reduce the friction and time to review this PR in the first place. It would also make extremely easier for people getting here with less context to understand the reasoning behind the change. People with less context includes ourselves in a month from now or doing context switching from other repos.
Usually we elaborate the reasoning behind a change in git ci -m"" -m"here"
so it's not only recorded in GH but also in the git history. GH recognises and use that automatically as the PR desc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They won't be affected immediately, even if it gets merged, I'll open issues in other repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm wouldn't they break as soon as this get merged, as the health check will fail for them when the mao runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, let me open a couple of issues.
/retest |
defaultMachineHealthPort = 9440 | ||
defaultMachineSetHealthPort = 9441 | ||
defaultMachineHealthCheckHealthPort = 9442 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be being a bit pedantic, but would it be a pain to make these that same as the metrics ports but +1000 for consistency? WDYT?
defaultMachineHealthPort = 9440 | |
defaultMachineSetHealthPort = 9441 | |
defaultMachineHealthCheckHealthPort = 9442 | |
defaultMachineHealthPort = 9441 | |
defaultMachineSetHealthPort = 9442 | |
defaultMachineHealthCheckHealthPort = 9444 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be something breaking PRs across 5 repos 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh really? 😓 We should be using constants for this really, but I'll let that be a future improvement
pkg/operator/sync.go
Outdated
defaultMachineHealthPort = 9440 | ||
defaultMachineSetHealthPort = 9441 | ||
defaultMachineHealthCheckHealthPort = 9442 | ||
healthFailureThreshold = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this magic number 10 coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that the default 3 retries was not enough, and the container failed couple of times before becoming ready. Just decided to give it more time. Not that important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why was it not enough? can we add a comment explaining why we choose 10?
79aecd1
to
5776f6c
Compare
OCPCLOUD-785 - health checks for all machine API controllers This introduce support for readinessProbe and livenessProbe [1] for the owned machine controllers deployment and its machineSet, MHC and machine controller containers. This will let the kubelet to better acknowledge about these containers lifecycle and therefore letting us to be more robust to signal the operator degradability on the clusterOperator status. This PR needs [2], [3] and [4] to work and pass CI so the probes included in the container spec here can get a 200 from the machine controllers. Also additionally [5], [6] and [7] must the same to not break or the probes included in the container spec here will fail will and result in the containers getting restarted. This also reverts accidental rebase and put back the syncPeriod which was dropped by [8]. [1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes [2] openshift/cluster-api-provider-aws#329 [3] openshift/cluster-api-provider-azure#139 [4] openshift/cluster-api-provider-gcp#96 [5] https://github.com/openshift/cluster-api-provider-openstack [6] https://github.com/openshift/cluster-api-provider-ovirt [7] https://github.com/metal3-io/cluster-api-provider-metal3 [8] https://github.com/openshift/machine-api-operator/pull/590/files#diff-7417e4bc31a1bacc1a431704bee56978L41
- Reintroducing a fix dropped in 4c9abf9
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-azure-operator |
/retest |
/test e2e-azure-operator |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks reasonable to me, i agree with the comments about coming back to make the ports into constants.
/lgtm
|
||
healthAddr := flag.String( | ||
"health-addr", | ||
":9442", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, i think making this a constant would be a good improvement for a followup
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@Danil-Grigorev: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
The machine-api-operator recently added liveness and readiness checks(openshift/machine-api-operator#602) With this change the controller will respond with a 200. Closes: openshift#197
OCPCLOUD-785 - health checks for all machine API controllers
This introduce support for readinessProbe and livenessProbe [1] for the owned machine controllers deployment and its machineSet, MHC and machine controller containers.
This will let the kubelet to better acknowledge about these containers lifecycle and therefore letting us to be more robust to signal the operator degradability on the clusterOperator status.
This PR needs [2], [3] and [4] to work and pass CI so the probes included in the container spec here can get a 200 from the machine controllers. Also additionally [5], [6] and [7] must the same to not break or the probes included in the container spec here will fail will and result in the containers getting restarted.
This also reverts accidental rebase and put back the
syncPeriod
which was dropped by [8].[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
[2] openshift/cluster-api-provider-aws#329
[3] openshift/cluster-api-provider-azure#139
[4] openshift/cluster-api-provider-gcp#96
[5] openshift/cluster-api-provider-baremetal#79
[6] openshift/cluster-api-provider-ovirt#52
[7] openshift/cluster-api-provider-openstack#105
[8] https://github.com/openshift/machine-api-operator/pull/590/files#diff-7417e4bc31a1bacc1a431704bee56978L41