Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-907] Add remediationsAllowed field to MHC status #652

Merged

Conversation

JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented Jul 21, 2020

This adds an remediationsAllowed field to the MHC status that tells a user how many more remediations will be allowed before the controller starts short circuiting remediations.

IE if the value of the field is 1, 1 more machine can be remediated before maxUnhealthy blocks the remediation. If the value is 0, no more remediations are allowed, the limit of maxUnhealthy has been reached.

This also adds a condition that shows whether remediation is currently considered to be allowed or not

This is the openshift counterpart to kubernetes-sigs/cluster-api#3372
/hold

Want to make sure we agree on the naming upstream before merging this

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 21, 2020
@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade f7ada32 link /test e2e-aws-upgrade
ci/prow/e2e-gcp-operator f7ada32 link /test e2e-gcp-operator
ci/prow/e2e-azure-operator f7ada32 link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2020
@JoelSpeed JoelSpeed changed the title [OCPCLOUD-907] Add allowedRemediations field to MHC status [OCPCLOUD-907] Add remediationsAllowed field to MHC status Oct 19, 2020
@JoelSpeed
Copy link
Contributor Author

/hold cancel

I think things have stabilised on this feature upstream, this PR now matches what is being done upstream and as such I think is good to go into 4.7

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2020
@JoelSpeed
Copy link
Contributor Author

/retest

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally makes sense to me, i just had a minor question


// ANCHOR: Condition

// Condition defines an observation of a Cluster API resource operational state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this shares some ancestry with upstream, should we say "Machine API" in our version here?


// ANCHOR: Conditions

// Conditions provide observations of the operational state of a Cluster API resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here, re: Machine API

@JoelSpeed
Copy link
Contributor Author

@elmiko I fixed your comments and a couple of other issues, please take another look

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Joel, i went over the code again and just had a question about something that didn't make immediate sense to me.

)

// Remediation not allowed, the number of not started or unhealthy machines exceeds maxUnhealthy
mhc.Status.RemediationsAllowed = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this get updated on the object in etcd at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've moved the reconcile status so it happens for all code paths, ideally I would spend time to do a bigger refactor of this logic so it's more obvious what's going on here

I will make sure to update the tests before we merge this

@JoelSpeed
Copy link
Contributor Author

/retest

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this continues to make sense for me, i just had a small question

// unhealthyMachineCount calculates the number of presently unhealthy or missing machines
// ie the delta between the expected number of machines and the current number deemed healthy
func unhealthyMachineCount(mhc *mapiv1.MachineHealthCheck) int {
return derefInt(mhc.Status.ExpectedMachines) - derefInt(mhc.Status.CurrentHealthy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that either of these derefInt calls could fail (eg the value does not exist)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will default to 0 if the deref fails (I think that's the point of the function), but no matter the behaviour, this hasn't actually changed from before this PR, just moved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok perfect. thanks!

@elmiko
Copy link
Contributor

elmiko commented Oct 30, 2020

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2020
@JoelSpeed
Copy link
Contributor Author

/retest

Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2020
@JoelSpeed
Copy link
Contributor Author

/retest

2 similar comments
@JoelSpeed
Copy link
Contributor Author

/retest

@JoelSpeed
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot merged commit 7750896 into openshift:master Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants