-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-18554: Fix "error removing HNS network when cleaning up BYOH proxy nodes" #1774
OCPBUGS-18554: Fix "error removing HNS network when cleaning up BYOH proxy nodes" #1774
Conversation
@saifshaikh48: This pull request references Jira Issue OCPBUGS-18554, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Skipping CI for Draft Pull Request. |
/hold blocked by #1770 |
/test vsphere-proxy-e2e-operator |
/cherry-pick release-4.14 |
@saifshaikh48: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@mtnbikenc: This pull request references Jira Issue OCPBUGS-18554, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/unhold #1770 merged |
@saifshaikh48: This pull request references Jira Issue OCPBUGS-18554, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
242be9a
to
be323d0
Compare
b5bc8f2
to
adfd87b
Compare
/test vsphere-proxy-e2e-operator |
pkg/windows/windows.go
Outdated
// Deconfigure removes all files and networks created by WMCO and runs the WICD cleanup command. | ||
Deconfigure(string, string) error | ||
// Deconfigure removes all files and networks created by WMCO | ||
Deconfigure() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really Deconfigure
at this point. Its no longer undoing changes done by ConfigureWICD
It may be more apt to name it RemoveFilesAndNetworks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this given that we no longer have Windows.Configure()
. In some ways you are doing the opposite of windows.Bootstrap()
.
pkg/nodeconfig/nodeconfig.go
Outdated
if err := nc.Windows.RunWICDCleanup(nc.wmcoNamespace, wicdKC); err != nil { | ||
return fmt.Errorf("unable to cleanup the Windows instance: %w", err) | ||
} | ||
|
||
// Wait for reboot annotation removal. This prevents deleting the node until the node no longer needs reboot. | ||
if err := metadata.WaitForRebootAnnotationRemoval(context.TODO(), nc.client, nc.node.Name); err != nil { | ||
return err | ||
} | ||
if err := nc.Windows.Deconfigure(); err != nil { | ||
return fmt.Errorf("error deconfiguring instance: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres an ordering requirement thats not clear here.
If its solely between RunWICDCleanup
and WaitForRebootAnnotationRemoval
it would be best to move them to a separate documented function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point even nc.generateWICDKubeconfig()
needs to move into the separate documented function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for pointing this out
pkg/windows/windows.go
Outdated
// Deconfigure removes all files and networks created by WMCO and runs the WICD cleanup command. | ||
Deconfigure(string, string) error | ||
// Deconfigure removes all files and networks created by WMCO | ||
Deconfigure() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this given that we no longer have Windows.Configure()
. In some ways you are doing the opposite of windows.Bootstrap()
.
pkg/nodeconfig/nodeconfig.go
Outdated
if err := nc.Windows.RunWICDCleanup(nc.wmcoNamespace, wicdKC); err != nil { | ||
return fmt.Errorf("unable to cleanup the Windows instance: %w", err) | ||
} | ||
|
||
// Wait for reboot annotation removal. This prevents deleting the node until the node no longer needs reboot. | ||
if err := metadata.WaitForRebootAnnotationRemoval(context.TODO(), nc.client, nc.node.Name); err != nil { | ||
return err | ||
} | ||
if err := nc.Windows.Deconfigure(); err != nil { | ||
return fmt.Errorf("error deconfiguring instance: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point even nc.generateWICDKubeconfig()
needs to move into the separate documented function.
This commit re-orders a deconfiguration step to remove files and HNS networks after the instance has finished rebooting. This way WMCO is guaranteed to only interact with the instance when it is reachable via SSH. Previously, we were hitting timing issues where, after WICD cleanup is ran, WMCO's configmap and node controllers would race, resulting in WMCO issuing SSH commands while the node is still rebooting, which would fail. With this change, a log message was moved to better signal the start of node deconfiguration.
adfd87b
to
445c14c
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aravindhp, saifshaikh48 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@saifshaikh48: This pull request references Jira Issue OCPBUGS-18554, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need any testing? Otherwise lgtm
@alinaryan the testing I did was
|
/cherry-pick release-4.14 |
@saifshaikh48: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/override ci/prow/vsphere-e2e-upgrade |
@sebsoto: Overrode contexts on behalf of sebsoto: ci/prow/vsphere-e2e-upgrade In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@saifshaikh48: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@saifshaikh48: Jira Issue OCPBUGS-18554: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-18554 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@saifshaikh48: new pull request created: #1874 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR re-orders a deconfiguration step removing files and HNS networks
to be after the instance has finished rebooting. This way WMCO is guaranteed
to only interact with the instance when it is reachable via SSH.
Previously, we were hitting timing issues where, after WICD cleanup is ran,
WMCO's configmap and node controllers would race, resulting in WMCO issuing
SSH commands while the node is still rebooting. These would fail and cause
deconfiguration to reconcile again, leading to timeouts in CI for our deletion suite.