Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-803] Run Spot Termination Handler from Machine API Operator #535

Merged
merged 5 commits into from Mar 26, 2020

Conversation

JoelSpeed
Copy link
Contributor

This PR enables the Machine API Operator to run the Termination Handler added to the AWS cloud provider in openshift/cluster-api-provider-aws#308 by running a DaemonSet on nodes labelled as interruptible

Need to test on a real cluster before this merges

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2020
@JoelSpeed JoelSpeed force-pushed the run-termination branch 2 times, most recently from 5f4f6c8 to 92bdd71 Compare March 25, 2020 12:04
@JoelSpeed
Copy link
Contributor Author

/unhold

I've verified this deploys the Daemonset as expected

@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 25, 2020
@@ -41,7 +43,16 @@ func (optr *Operator) syncAll(config *OperatorConfig) error {
}

if config.Controllers.TerminationHandler != clusterAPIControllerNoOp {

if err := optr.syncTerminationHandler(config); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this call within syncClusterAPIController so we make sure we wrap all the logic only once to report status appropriately based on whatever errors are reported from the inner calls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also within there may be invert the logic to 1 - error earlier, 2 - logging no Op scenario i.e:`

if config.Controllers.Provider == config.Controllers.TerminationHandler != clusterAPIControllerNoOp {
    log mao will no op
    return nil
}

logic goes here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do the return earlier with NoOp, that makes sense, though the comparison above doesn't tell us that the Provider or TerminationHandler aren't NoOp, would just tell us that one of them isn't right? We'd still have to check later?

I'd be tempted to just check the Provider is NoOp, return early, otherwise continue and then check later that TerminationHandler is or isn't and run appropriately

Copy link
Member

@enxebre enxebre Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry that was a typo I meant
if config.Controllers.TerminationHandler == clusterAPIControllerNoOp then return

We should probably for the same for line 32 but no need to go in this PR.

@@ -39,6 +39,10 @@ func (optr *Operator) syncAll(config *OperatorConfig) error {
glog.V(3).Info("Synced up all machine-api-controller components")
}

if config.Controllers.TerminationHandler != clusterAPIControllerNoOp {
Copy link
Member

@enxebre enxebre Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do this within syncClusterAPIController so it's all covered by the wrapping here and we don't have to deduplicate the status* logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which status logic are you referring to here? Do you mean the waiting for the rollouts? I don't think there's any duplication of status apart from the waiting, but they have to be different because the fields are different in DaemonSets vs Deployments

As far as I can tell the status is only updated once this method returns

Containers: containers,
PriorityClassName: "system-node-critical",
NodeSelector: map[string]string{machinecontroller.MachineInterruptibleInstanceLabelName: ""},
ServiceAccountName: "machine-api-termination-handler",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const for machine-api-termination-handler?

@enxebre
Copy link
Member

enxebre commented Mar 26, 2020

looks great, dropped some comments suggesting all the reconciliation work is covered by only once status* workflow. Needs rebase after #536 got in.

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2020
@enxebre
Copy link
Member

enxebre commented Mar 26, 2020

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 4ab7f78 into openshift:master Mar 26, 2020
@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure c11a68d link /test e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

enxebre added a commit to enxebre/machine-api-operator that referenced this pull request Jul 20, 2020
…tion handler

This openshift#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR fix that by passing the event handler to the daemonSet namespaced informer.
This will be e2e tested by openshift/cluster-api-actuator-pkg#177
enxebre added a commit to enxebre/cluster-api-actuator-pkg that referenced this pull request Jul 20, 2020
This openshift/machine-api-operator#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR openshift/machine-api-operator#648 fixes that by passing the event handler to the daemonSet namespaced informer.
This PR cover this e2e.
enxebre added a commit to enxebre/cluster-api-actuator-pkg that referenced this pull request Jul 20, 2020
This openshift/machine-api-operator#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR openshift/machine-api-operator#648 fixes that by passing the event handler to the daemonSet namespaced informer.
This PR cover this e2e.
enxebre added a commit to enxebre/cluster-api-actuator-pkg that referenced this pull request Jul 20, 2020
This openshift/machine-api-operator#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR openshift/machine-api-operator#648 fixes that by passing the event handler to the daemonSet namespaced informer.
This PR cover this e2e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants