Rework volume reconstruction #108180

jsafrane · 2022-02-17T12:13:31Z

What type of PR is this?

Somewhere feature and cleanup - it should not change kubelet behavior (too much) and it's necessary for an upcoming feature.

/kind cleanup
/kind feature

What this PR does / why we need it:

Right now, Kubelet reconstructs volumes that are mounted on the host only after kubelet's desired state of world (DSW) was populated. And it reconstructed only volumes that are not in the DSW (because those volumes will be "fixed" by mounting the volumes in the usual way).

Split volume reconstruction into three distinct steps:

Reconstruct all mounted volumes on the host right when kubelet starts.
- Reconstruct only information that do not need access to the API server.
- Add these volumes to actual state of world (AWS) as uncertain.
- Keep record of the volumes that failed reconstruction and all reconstructed volumes.
After DesiredStateOfWorld is fully populated and kubelet can actually check what volumes are needed, force clean up volumes that failed reconstruction and are not needed by any pods. (This is done also without this PR).
After DesiredStateOfWorld is fully populated, try to update devicePaths of reconstructed volumes from node.status.volumesAttached, because devicePaths obtained from reconstruction may not be accurate.
- This requires connection to the API server, which can be established long after kubelet start.

I renamed few functions / variables on the way to better match their purpose and hidden all changed functionality in reconstruct_new.go + behind SELinuxMountReadWriteOncePod feature gate. This is a pre-requisite of https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling, where we need to know SELinux label of volume mount in ASW and compare it with desired SELinux label from desired state of world (DSW). The current reconstruction (put only volumes that need to be unmounted to ASW) will not populate the SELinux label from existing mounts.

TODO:

Exp. backoff for retrieving Node from the API server.

Special notes for your reviewer:

Tested with in-tree AWS EBS, in-tree iSCSI and CSI AWS EBS.

Does this PR introduce a user-facing change?

Kubelet now reconstructs its full cache of mounted volumes after restart; previously it reconstructed only volumes that were not used by any pod to be able to unmount them.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling

k8s-ci-robot · 2022-02-17T12:13:39Z

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2022-02-17T12:14:56Z

cc @jingxu97 @gnufied, trying to rework volume reconstruction as we discussed yesterday. To me it looks working, tested in local-up-cluster only and limited nr. of volumes (in-tree AWS EBS, CSI AWS EBS and in-tree iSCSI).

jsafrane · 2022-02-17T12:47:08Z

Some test failures may be genuine

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
...
Unfortunately, an error has occurred:
timed out waiting for the condition

jsafrane · 2022-02-17T13:29:10Z

Hmm, kind runs kube-apiserver as a static pod, i.e. there is no API server when kubelet starts. With this PR, kubelet needs to populate desired state of the world before starting any pod, i.e. it needs the API server.

jsafrane · 2022-02-17T14:49:09Z

I reworked the PR a bit, now both DSW populator, ASW reconstruction and reconcile() run in parallel, however, any unmount is blocked until DSW is fully populated.
Works with in-tree iSCSI and kind

gnufied · 2022-02-17T17:07:21Z

pkg/kubelet/volumemanager/cache/actual_state_of_world.go

+	if !podExists || podObj.volumeMountStateForPod == operationexecutor.VolumeMountUncertain {
+		// Add new mountedPod or update existing uncertain one - the new markVolumeOpts may
+		// have updated information. Especially reconstructed volumes (marked as uncertain
+		// during reconstruction) need update.
 		podObj = mountedPod{


Does not doing this results in real breakage? (sorry just curious, I suspected it will but I don't know for sure).

I don't know, you told me that we can't trust reconstructed volumes. And I think we should not trust Uncertain volumes either, so I update everything.

gnufied · 2022-02-17T17:14:00Z

pkg/kubelet/volumemanager/reconciler/reconciler.go

-			rc.sync()
-		}
-	}
+	go rc.sync()


So basically reconcile can run parallel to reconstruction and we do not wait for DSOW to be populated before doing reconstruction? This means for any volume type, where reconstruction fails, the volume/mount point may leak.

I see, reconstruction needs DSW populated here to force unmount volumes where the reconstruction itself failed:

kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go

Lines 377 to 390 in f0d5ea1

volumeInDSW := rc.desiredStateOfWorld.VolumeExistsWithSpecName(volume.podName, volume.volumeSpecName)

reconstructedVolume, err := rc.reconstructVolume(volume)

if err != nil {

if volumeInDSW {

// Some pod needs the volume, don't clean it up and hope that

// reconcile() calls SetUp and reconstructs the volume in ASW.

klog.V(4).InfoS("Volume exists in desired state, skip cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName)

continue

}

// No pod needs the volume.

klog.InfoS("Could not construct volume information, cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName, "err", err)

rc.cleanupMounts(volume)

continue

So, back to the drawing board. I was thinking about adding just an "uncertain tombstone" to ASW, where the reconstruction sync would just mark that there is a dir for the volume in /var/lib/kubelet/pods and the actual reconstruction would happen in UnmountVolume / MountVolume operations. We could try reconstruction couple of times and if it fails for too long, then force unmount. IMO it would be cleaner that today's approach (try once), but it's also more complicated.

I was thinking about adding just an "uncertain tombstone" to ASW, where the reconstruction sync would just mark that there is a dir for the volume in /var/lib/kubelet/pods and the actual reconstruction would happen in UnmountVolume / MountVolume operations

Tried that, failed fast. To mark anything as uncertain, kubelet needs UniqueVolumeID and that is not available before reconstruction completes.

gnufied · 2022-02-17T17:28:47Z

Did the disruptive tests we discussed are passing with this PR btw? I don' think they are running always.

msau42 · 2022-02-18T01:14:57Z

/assign @jingxu97

jsafrane · 2022-02-21T09:40:37Z

/hold
This PR does not work

Reconstruction still needs DSW populated, Rework volume reconstruction #108180 (review). That is hard to fix.
Since reconstruction can happen after reconciler finishes mounting of a volume, it may overwrite a fully mounted volume with uncertain. That can be fixed relatively easily.

It always return nil anyway.

To be able to mark volumes in DSW as marked "in use" in node.status, the whole DSW needs to be populated first.

jsafrane · 2022-03-17T11:46:42Z

I tested reboot with OpenShift 4.10 (=Kubernetes 1.23.3) and with in-tree AWS EBS volumes.
Both with and without this PR (and the feature enabled), the volume reconstruction failed after reboot - EBS volume plugin needs the volume mounted to be able to find the global mount + find AWS volume ID from it.

If I left Pods in the API server during the reboot, the new kubelet started all of them.
If I removed Pods from the API server during reboot, the pod local directory got removed in both cases by our fallback cleaner and the global dir was not removed (because reconstruction failed).

This PR did not have any visible effect.

jsafrane · 2022-03-17T11:47:01Z

/retest

With the new reconstruction, AWS.MarkVolumeAsMounted will update outer spec name with the correct value from Pod.

ehashman · 2022-03-24T00:34:56Z

/milestone v1.25

k8s-ci-robot · 2022-03-24T00:35:03Z

@jsafrane: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-06-22T01:30:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hosseinsalahi · 2022-06-29T15:46:39Z

Hi @jsafrane
Bug triage team here!
It looks like this PR has the following linked PRs:

Just checking in to see if this is still on track for k8s 1.25.

k8s-triage-robot · 2022-07-29T16:22:39Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

cici37 · 2022-08-03T01:47:06Z

Hello 👋, 1.25 Release Lead here.

Unfortunately, this enhancement did not meet the code freeze criteria because there are still unmerged k/k code PRs.

If you still wish to progress this enhancement in v1.25, please file an exception request. Thank you so much!

/milestone clear

k8s-triage-robot · 2022-09-02T01:52:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-09-02T01:53:07Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 17, 2022

k8s-ci-robot requested review from gnufied and msau42 February 17, 2022 12:14

jsafrane force-pushed the rework-reconstruction branch from bc76e7c to 6324226 Compare February 17, 2022 14:35

gnufied reviewed Feb 17, 2022

View reviewed changes

k8s-ci-robot assigned jingxu97 Feb 18, 2022

SergeyKanzhelev added this to Triage in SIG Node PR Triage Feb 18, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2022

jsafrane added 5 commits March 17, 2022 09:27

Add unit tests for UpdateReconstructedDevicePath

750a1c8

Add unit test for reconstructVolumes()

6013b77

Remove devicePathsReconstructed

6a751d0

Remove error return value from updateStatesNew

c423a80

It always return nil anyway.

Mark volumes as in use only after DSW is populated

1bdeb9f

To be able to mark volumes in DSW as marked "in use" in node.status, the whole DSW needs to be populated first.

jsafrane mentioned this pull request Mar 17, 2022

WIP: Rework reconstruction and test it in OCP openshift/kubernetes#1217

Closed

jsafrane force-pushed the rework-reconstruction branch from 4ca9ba3 to a541d12 Compare March 18, 2022 12:36

Remove SyncReconstructedVolume call

88348d7

With the new reconstruction, AWS.MarkVolumeAsMounted will update outer spec name with the correct value from Pod.

jsafrane force-pushed the rework-reconstruction branch from a541d12 to 88348d7 Compare March 18, 2022 12:40

k8s-ci-robot added this to the v1.25 milestone Mar 24, 2022

ehashman moved this from Triage to Waiting on Author in SIG Node PR Triage Mar 24, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2022

gnufied mentioned this pull request Jun 2, 2022

kubelet is not able to delete pod with mounted secret/configmap after restart #96635

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2022

gnufied mentioned this pull request Jun 30, 2022

Fix pod stuck in termination state when mount fails or gets skipped after kubelet restart #110670

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2022

k8s-ci-robot removed this from the v1.25 milestone Aug 3, 2022

k8s-ci-robot closed this Sep 2, 2022

SIG Node PR Triage automation moved this from Waiting on Author to Done Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework volume reconstruction #108180

Rework volume reconstruction #108180

jsafrane commented Feb 17, 2022 •

edited

k8s-ci-robot commented Feb 17, 2022

jsafrane commented Feb 17, 2022

jsafrane commented Feb 17, 2022

jsafrane commented Feb 17, 2022

jsafrane commented Feb 17, 2022

gnufied Feb 17, 2022

jsafrane Feb 21, 2022

gnufied Feb 17, 2022 •

edited

jsafrane Feb 21, 2022

jsafrane Feb 24, 2022

gnufied commented Feb 17, 2022

msau42 commented Feb 18, 2022

jsafrane commented Feb 21, 2022

jsafrane commented Mar 17, 2022

jsafrane commented Mar 17, 2022

ehashman commented Mar 24, 2022

k8s-ci-robot commented Mar 24, 2022

k8s-triage-robot commented Jun 22, 2022

hosseinsalahi commented Jun 29, 2022

k8s-triage-robot commented Jul 29, 2022

cici37 commented Aug 3, 2022

k8s-triage-robot commented Sep 2, 2022

k8s-ci-robot commented Sep 2, 2022

	volumeInDSW := rc.desiredStateOfWorld.VolumeExistsWithSpecName(volume.podName, volume.volumeSpecName)

	reconstructedVolume, err := rc.reconstructVolume(volume)
	if err != nil {
	if volumeInDSW {
	// Some pod needs the volume, don't clean it up and hope that
	// reconcile() calls SetUp and reconstructs the volume in ASW.
	klog.V(4).InfoS("Volume exists in desired state, skip cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName)
	continue
	}
	// No pod needs the volume.
	klog.InfoS("Could not construct volume information, cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName, "err", err)
	rc.cleanupMounts(volume)
	continue

Rework volume reconstruction #108180

Rework volume reconstruction #108180

Conversation

jsafrane commented Feb 17, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Feb 17, 2022

jsafrane commented Feb 17, 2022

jsafrane commented Feb 17, 2022

jsafrane commented Feb 17, 2022

jsafrane commented Feb 17, 2022

gnufied Feb 17, 2022

Choose a reason for hiding this comment

jsafrane Feb 21, 2022

Choose a reason for hiding this comment

gnufied Feb 17, 2022 • edited

Choose a reason for hiding this comment

jsafrane Feb 21, 2022

Choose a reason for hiding this comment

jsafrane Feb 24, 2022

Choose a reason for hiding this comment

gnufied commented Feb 17, 2022

msau42 commented Feb 18, 2022

jsafrane commented Feb 21, 2022

jsafrane commented Mar 17, 2022

jsafrane commented Mar 17, 2022

ehashman commented Mar 24, 2022

k8s-ci-robot commented Mar 24, 2022

k8s-triage-robot commented Jun 22, 2022

hosseinsalahi commented Jun 29, 2022

k8s-triage-robot commented Jul 29, 2022

cici37 commented Aug 3, 2022

k8s-triage-robot commented Sep 2, 2022

k8s-ci-robot commented Sep 2, 2022

jsafrane commented Feb 17, 2022 •

edited

gnufied Feb 17, 2022 •

edited