Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reconcile termination handler daemonSet validations #177

Closed

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Jul 20, 2020

This openshift/machine-api-operator#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR openshift/machine-api-operator#648 fixes that by passing the event handler to the daemonSet namespaced informer.
This PR cover this e2e.

enxebre added a commit to enxebre/machine-api-operator that referenced this pull request Jul 20, 2020
…tion handler

This openshift#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR fix that by passing the event handler to the daemonSet namespaced informer.
This will be e2e tested by openshift/cluster-api-actuator-pkg#177
@enxebre enxebre force-pushed the termination-handler-sync-coverage branch from 396c87a to 1405cc5 Compare July 20, 2020 11:39
@enxebre
Copy link
Member Author

enxebre commented Jul 20, 2020

/hold
to run this manually on my cluster

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2020
@enxebre enxebre force-pushed the termination-handler-sync-coverage branch from 1405cc5 to 260730b Compare July 20, 2020 11:42
This openshift/machine-api-operator#535 introduced support to manage a damonSet which runs termination handler for spot intances.
As an event handler is not passed to the damonSet informer changes to the resource won't trigger a reconcile.
This PR openshift/machine-api-operator#648 fixes that by passing the event handler to the daemonSet namespaced informer.
This PR cover this e2e.
@enxebre enxebre force-pushed the termination-handler-sync-coverage branch from 260730b to f7192cc Compare July 20, 2020 14:27
// DeleteDaemonSet deletes the specified daemonSet
func DeleteDaemonSet(c client.Client, ds *kappsapi.DaemonSet) error {
return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) {
if err := c.Delete(context.TODO(), ds); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to account for situations when the DaemonSet was not found, as it was already removed.

By(fmt.Sprintf("checking got daemonSet spec matches the initial one"))
Expect(framework.IsDaemonSetSynced(client, initialDaemonSet, terminationHandlerDaemonSet, framework.MachineAPINamespace)).To(BeTrue())

By(fmt.Sprintf("updating got daemonSet spec"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could give it a separate It to increase test robustness

Comment on lines +77 to +78
By(fmt.Sprintf("checking daemonSet is available"))
Expect(framework.IsDaemonSetAvailable(client, terminationHandlerDaemonSet, framework.MachineAPINamespace)).To(BeTrue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does available here mean? Does it mean that all replicas are running? If so, on a default cluster, the daemonset should always be available by virtue of it having no replicas. I think we need to simulate somewhere in the test suite that the daemonset is available and has more than 1 replica. I'm not really sure what this is testing over the daemonset just existing

Perhaps we do that in https://github.com/openshift/cluster-api-actuator-pkg/blob/master/pkg/infra/spot.go#L73? Could follow up later.

Copy link
Member Author

@enxebre enxebre Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is validating the operator does its job and also that the expectation of having no unavailable replicas is satisfied https://github.com/openshift/cluster-api-actuator-pkg/pull/177/files#diff-a8166de82f0b6261e02122357a0c6096R40.
On a default cluster the expected available happens to be zero. That's circumstantial, this test cover that and literally any other possible scenario scenario. If the default ever changes or if this runs in parallel with any spot instance this test must always still remain green. This let us introducing any change while being confident we are not breaking the expectation.

I think we need to simulate somewhere in the test suite that the daemonset is available and has more than 1 replica

Yes, I'll follow up with PRs to make sure the operator goes degraded if the pod crashloop and a test for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, SGTM

@JoelSpeed
Copy link
Contributor

Should probably address this #177 (comment), but otherwise I'm happy with this PR

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@enxebre
Copy link
Member Author

enxebre commented Jul 23, 2020

/retest

@openshift-ci-robot
Copy link

@enxebre: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator f7192cc link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

@enxebre: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2020
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 30, 2020
@openshift-merge-robot
Copy link
Contributor

@enxebre: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-vsphere-operator f7192cc link /test e2e-vsphere-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants