Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18554: Fix "error removing HNS network when cleaning up BYOH proxy nodes" #1774

Merged

Conversation

saifshaikh48
Copy link
Contributor

This PR re-orders a deconfiguration step removing files and HNS networks
to be after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.

Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting. These would fail and cause
deconfiguration to reconcile again, leading to timeouts in CI for our deletion suite.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 8, 2023
@openshift-ci-robot
Copy link

@saifshaikh48: This pull request references Jira Issue OCPBUGS-18554, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR re-orders a deconfiguration step removing files and HNS networks
to be after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.

Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting. These would fail and cause
deconfiguration to reconcile again, leading to timeouts in CI for our deletion suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 8, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@saifshaikh48
Copy link
Contributor Author

/hold

blocked by #1770

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2023
@saifshaikh48
Copy link
Contributor Author

/test vsphere-proxy-e2e-operator

@saifshaikh48
Copy link
Contributor Author

/cherry-pick release-4.14

@openshift-cherrypick-robot

@saifshaikh48: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtnbikenc
Copy link
Member

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 28, 2023
@openshift-ci-robot
Copy link

@mtnbikenc: This pull request references Jira Issue OCPBUGS-18554, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rrasouli

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@saifshaikh48
Copy link
Contributor Author

/unhold

#1770 merged

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2023
@openshift-ci-robot
Copy link

@saifshaikh48: This pull request references Jira Issue OCPBUGS-18554, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rrasouli

In response to this:

This PR re-orders a deconfiguration step removing files and HNS networks
to be after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.

Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting. These would fail and cause
deconfiguration to reconcile again, leading to timeouts in CI for our deletion suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@saifshaikh48 saifshaikh48 changed the title [WIP] OCPBUGS-18554: Fix "error removing HNS network when cleaning up BYOH proxy nodes" OCPBUGS-18554: Fix "error removing HNS network when cleaning up BYOH proxy nodes" Oct 4, 2023
@saifshaikh48 saifshaikh48 force-pushed the fix-proxy-byoh-deletion-flake branch 2 times, most recently from b5bc8f2 to adfd87b Compare October 4, 2023 17:27
@saifshaikh48
Copy link
Contributor Author

/test vsphere-proxy-e2e-operator
/test aws-upgrade-e2e

// Deconfigure removes all files and networks created by WMCO and runs the WICD cleanup command.
Deconfigure(string, string) error
// Deconfigure removes all files and networks created by WMCO
Deconfigure() error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really Deconfigure at this point. Its no longer undoing changes done by ConfigureWICD
It may be more apt to name it RemoveFilesAndNetworks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this given that we no longer have Windows.Configure(). In some ways you are doing the opposite of windows.Bootstrap().

Comment on lines 524 to 526
if err := nc.Windows.RunWICDCleanup(nc.wmcoNamespace, wicdKC); err != nil {
return fmt.Errorf("unable to cleanup the Windows instance: %w", err)
}

// Wait for reboot annotation removal. This prevents deleting the node until the node no longer needs reboot.
if err := metadata.WaitForRebootAnnotationRemoval(context.TODO(), nc.client, nc.node.Name); err != nil {
return err
}
if err := nc.Windows.Deconfigure(); err != nil {
return fmt.Errorf("error deconfiguring instance: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theres an ordering requirement thats not clear here.
If its solely between RunWICDCleanup and WaitForRebootAnnotationRemoval it would be best to move them to a separate documented function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point even nc.generateWICDKubeconfig() needs to move into the separate documented function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for pointing this out

// Deconfigure removes all files and networks created by WMCO and runs the WICD cleanup command.
Deconfigure(string, string) error
// Deconfigure removes all files and networks created by WMCO
Deconfigure() error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this given that we no longer have Windows.Configure(). In some ways you are doing the opposite of windows.Bootstrap().

Comment on lines 524 to 526
if err := nc.Windows.RunWICDCleanup(nc.wmcoNamespace, wicdKC); err != nil {
return fmt.Errorf("unable to cleanup the Windows instance: %w", err)
}

// Wait for reboot annotation removal. This prevents deleting the node until the node no longer needs reboot.
if err := metadata.WaitForRebootAnnotationRemoval(context.TODO(), nc.client, nc.node.Name); err != nil {
return err
}
if err := nc.Windows.Deconfigure(); err != nil {
return fmt.Errorf("error deconfiguring instance: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point even nc.generateWICDKubeconfig() needs to move into the separate documented function.

This commit re-orders a deconfiguration step to remove files and HNS networks
after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.
Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting, which would fail.

With this change, a log message was moved to better signal the start of node
deconfiguration.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aravindhp, saifshaikh48

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2023
@openshift-ci-robot
Copy link

@saifshaikh48: This pull request references Jira Issue OCPBUGS-18554, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rrasouli

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR re-orders a deconfiguration step removing files and HNS networks
to be after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.

Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting. These would fail and cause
deconfiguration to reconcile again, leading to timeouts in CI for our deletion suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@alinaryan alinaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need any testing? Otherwise lgtm

@saifshaikh48
Copy link
Contributor Author

@alinaryan the testing I did was

  • making sure the proxy job passes e2e
  • running the deletion tests a bunch of times to make sure the BYOH removal flake wasn't seen anymore

@saifshaikh48
Copy link
Contributor Author

/cherry-pick release-4.14

@openshift-cherrypick-robot

@saifshaikh48: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alinaryan
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2023
@saifshaikh48 saifshaikh48 marked this pull request as ready for review October 5, 2023 22:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 5, 2023
@sebsoto
Copy link
Contributor

sebsoto commented Oct 6, 2023

/override ci/prow/vsphere-e2e-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

@sebsoto: Overrode contexts on behalf of sebsoto: ci/prow/vsphere-e2e-upgrade

In response to this:

/override ci/prow/vsphere-e2e-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

@saifshaikh48: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 50132d3 into openshift:master Oct 6, 2023
16 checks passed
@openshift-ci-robot
Copy link

@saifshaikh48: Jira Issue OCPBUGS-18554: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18554 has been moved to the MODIFIED state.

In response to this:

This PR re-orders a deconfiguration step removing files and HNS networks
to be after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.

Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting. These would fail and cause
deconfiguration to reconcile again, leading to timeouts in CI for our deletion suite.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@saifshaikh48: new pull request created: #1874

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants