New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node check to vSphere cloud provider #117243
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @gnufied @divyenpatel |
/priority important-soon |
@lubronzhan Can you help review this PR? |
7d7741b
to
3b25e0a
Compare
3b25e0a
to
0e2b76f
Compare
At KCM startup, vSphere cloud provider builds a cache from NodeAdded events from an informer. But these events are asynchronous, the cloud provider may need information from the cache earlier, for example when detaching a volume from a node at KCM startup. If a node is missing in the cache, the cloud provider treats such a detach as successful, which is wrong. Such a volume will be attached to the node forever. To prevent this issue: 1. Try nodeLister before declaring a node as not found. A/D controller starts after its node informer has been synced. 2. Read API server before declaring a node as not found. Just in case the informer has stale data.
0e2b76f
to
8d19c00
Compare
/lgtm |
LGTM label has been added. Git tree hash: f44ce987190d7c201b6638febfac85e189f8a8fa
|
/assign @cheftako @andrewsykim |
/assign @nckturner |
/approve bug fix |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane, nckturner The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…7243-upstream-release-1.27 Automated cherry pick of #117243: Add node check to vSphere cloud provider
…7243-upstream-release-1.26 Automated cherry pick of #117243: Add node check to vSphere cloud provider
…7243-upstream-release-1.25 Automated cherry pick of #117243: Add node check to vSphere cloud provider
What type of PR is this?
/kind bug
What this PR does / why we need it:
At KCM startup, vSphere cloud provider builds a cache from
NodeAdded()
events from an informer. But these events are asynchronous, the cloud provider may need information from the cache earlier, for example when detaching a volume from a node at KCM startup. If a node is missing in the cache, the cloud provider treats such a detach as successful, which is wrong. Such a volume will be attached to the node forever.To prevent this issue:
Try
nodeLister
from the informer before declaring a node as not found. A/D controller starts after its node informer has been synced.Read API server before declaring a node as not found. To be 100% sure a node has been really deleted and the informer does not have stale data.
Which issue(s) this PR fixes:
Fixes #117091
Special notes for your reviewer:
Since vSphere CSI migration is GA, the vSphere cloud provider is used only to detach volumes from 1.25 and older nodes (unless the migration is disabled).
I've tested this with hacked errors in the cloud provider to fail Detach requests + restarted KCM - with this PR, I can see in logs:
Which means the node is not in NodeManager's cache and the old cloud provider would treat all volumes used by this node as detached. With this PR, the cloud provider actually detaches volumes from the node.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: