New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework volume reconstruction #108180
Rework volume reconstruction #108180
Conversation
@jsafrane: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Some test failures may be genuine
|
Hmm, kind runs kube-apiserver as a static pod, i.e. there is no API server when kubelet starts. With this PR, kubelet needs to populate desired state of the world before starting any pod, i.e. it needs the API server. |
bc76e7c
to
6324226
Compare
I reworked the PR a bit, now both DSW populator, ASW reconstruction and |
if !podExists || podObj.volumeMountStateForPod == operationexecutor.VolumeMountUncertain { | ||
// Add new mountedPod or update existing uncertain one - the new markVolumeOpts may | ||
// have updated information. Especially reconstructed volumes (marked as uncertain | ||
// during reconstruction) need update. | ||
podObj = mountedPod{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not doing this results in real breakage? (sorry just curious, I suspected it will but I don't know for sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, you told me that we can't trust reconstructed volumes. And I think we should not trust Uncertain volumes either, so I update everything.
rc.sync() | ||
} | ||
} | ||
go rc.sync() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically reconcile can run parallel to reconstruction and we do not wait for DSOW to be populated before doing reconstruction? This means for any volume type, where reconstruction fails, the volume/mount point may leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, reconstruction needs DSW populated here to force unmount volumes where the reconstruction itself failed:
kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go
Lines 377 to 390 in f0d5ea1
volumeInDSW := rc.desiredStateOfWorld.VolumeExistsWithSpecName(volume.podName, volume.volumeSpecName) | |
reconstructedVolume, err := rc.reconstructVolume(volume) | |
if err != nil { | |
if volumeInDSW { | |
// Some pod needs the volume, don't clean it up and hope that | |
// reconcile() calls SetUp and reconstructs the volume in ASW. | |
klog.V(4).InfoS("Volume exists in desired state, skip cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName) | |
continue | |
} | |
// No pod needs the volume. | |
klog.InfoS("Could not construct volume information, cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName, "err", err) | |
rc.cleanupMounts(volume) | |
continue |
So, back to the drawing board. I was thinking about adding just an "uncertain tombstone" to ASW, where the reconstruction sync would just mark that there is a dir for the volume in /var/lib/kubelet/pods and the actual reconstruction would happen in UnmountVolume / MountVolume operations. We could try reconstruction couple of times and if it fails for too long, then force unmount. IMO it would be cleaner that today's approach (try once), but it's also more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about adding just an "uncertain tombstone" to ASW, where the reconstruction sync would just mark that there is a dir for the volume in /var/lib/kubelet/pods and the actual reconstruction would happen in UnmountVolume / MountVolume operations
Tried that, failed fast. To mark anything as uncertain, kubelet needs UniqueVolumeID and that is not available before reconstruction completes.
Did the disruptive tests we discussed are passing with this PR btw? I don' think they are running always. |
/assign @jingxu97 |
/hold
|
It always return nil anyway.
To be able to mark volumes in DSW as marked "in use" in node.status, the whole DSW needs to be populated first.
I tested reboot with OpenShift 4.10 (=Kubernetes 1.23.3) and with in-tree AWS EBS volumes. If I left Pods in the API server during the reboot, the new kubelet started all of them. This PR did not have any visible effect. |
/retest |
4ca9ba3
to
a541d12
Compare
With the new reconstruction, AWS.MarkVolumeAsMounted will update outer spec name with the correct value from Pod.
a541d12
to
88348d7
Compare
/milestone v1.25 |
@jsafrane: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Hi @jsafrane
Just checking in to see if this is still on track for k8s 1.25. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Hello 👋, 1.25 Release Lead here. Unfortunately, this enhancement did not meet the code freeze criteria because there are still unmerged k/k code PRs. If you still wish to progress this enhancement in v1.25, please file an exception request. Thank you so much! /milestone clear |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
Somewhere feature and cleanup - it should not change kubelet behavior (too much) and it's necessary for an upcoming feature.
/kind cleanup
/kind feature
What this PR does / why we need it:
Right now, Kubelet reconstructs volumes that are mounted on the host only after kubelet's desired state of world (DSW) was populated. And it reconstructed only volumes that are not in the DSW (because those volumes will be "fixed" by mounting the volumes in the usual way).
Split volume reconstruction into three distinct steps:
Reconstruct all mounted volumes on the host right when kubelet starts.
After DesiredStateOfWorld is fully populated and kubelet can actually check what volumes are needed, force clean up volumes that failed reconstruction and are not needed by any pods. (This is done also without this PR).
After DesiredStateOfWorld is fully populated, try to update
devicePaths
of reconstructed volumes fromnode.status.volumesAttached
, because devicePaths obtained from reconstruction may not be accurate.I renamed few functions / variables on the way to better match their purpose and hidden all changed functionality in reconstruct_new.go + behind SELinuxMountReadWriteOncePod feature gate. This is a pre-requisite of https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling, where we need to know SELinux label of volume mount in ASW and compare it with desired SELinux label from desired state of world (DSW). The current reconstruction (put only volumes that need to be unmounted to ASW) will not populate the SELinux label from existing mounts.
TODO:
Special notes for your reviewer:
Tested with in-tree AWS EBS, in-tree iSCSI and CSI AWS EBS.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: