Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework volume reconstruction #108180

Closed
wants to merge 9 commits into from

Conversation

jsafrane
Copy link
Member

@jsafrane jsafrane commented Feb 17, 2022

What type of PR is this?

Somewhere feature and cleanup - it should not change kubelet behavior (too much) and it's necessary for an upcoming feature.

/kind cleanup
/kind feature

What this PR does / why we need it:

Right now, Kubelet reconstructs volumes that are mounted on the host only after kubelet's desired state of world (DSW) was populated. And it reconstructed only volumes that are not in the DSW (because those volumes will be "fixed" by mounting the volumes in the usual way).

Split volume reconstruction into three distinct steps:

  1. Reconstruct all mounted volumes on the host right when kubelet starts.

    • Reconstruct only information that do not need access to the API server.
    • Add these volumes to actual state of world (AWS) as uncertain.
    • Keep record of the volumes that failed reconstruction and all reconstructed volumes.
  2. After DesiredStateOfWorld is fully populated and kubelet can actually check what volumes are needed, force clean up volumes that failed reconstruction and are not needed by any pods. (This is done also without this PR).

  3. After DesiredStateOfWorld is fully populated, try to update devicePaths of reconstructed volumes from node.status.volumesAttached, because devicePaths obtained from reconstruction may not be accurate.

    • This requires connection to the API server, which can be established long after kubelet start.

I renamed few functions / variables on the way to better match their purpose and hidden all changed functionality in reconstruct_new.go + behind SELinuxMountReadWriteOncePod feature gate. This is a pre-requisite of https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling, where we need to know SELinux label of volume mount in ASW and compare it with desired SELinux label from desired state of world (DSW). The current reconstruction (put only volumes that need to be unmounted to ASW) will not populate the SELinux label from existing mounts.

TODO:

  • Exp. backoff for retrieving Node from the API server.

Special notes for your reviewer:

Tested with in-tree AWS EBS, in-tree iSCSI and CSI AWS EBS.

Does this PR introduce a user-facing change?

Kubelet now reconstructs its full cache of mounted volumes after restart; previously it reconstructed only volumes that were not used by any pod to be able to unmount them.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Feb 17, 2022
@k8s-ci-robot
Copy link
Contributor

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 17, 2022
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 17, 2022
@jsafrane
Copy link
Member Author

cc @jingxu97 @gnufied, trying to rework volume reconstruction as we discussed yesterday. To me it looks working, tested in local-up-cluster only and limited nr. of volumes (in-tree AWS EBS, CSI AWS EBS and in-tree iSCSI).

@jsafrane
Copy link
Member Author

Some test failures may be genuine

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
...
Unfortunately, an error has occurred:
timed out waiting for the condition

@jsafrane
Copy link
Member Author

Hmm, kind runs kube-apiserver as a static pod, i.e. there is no API server when kubelet starts. With this PR, kubelet needs to populate desired state of the world before starting any pod, i.e. it needs the API server.

@jsafrane
Copy link
Member Author

I reworked the PR a bit, now both DSW populator, ASW reconstruction and reconcile() run in parallel, however, any unmount is blocked until DSW is fully populated.
Works with in-tree iSCSI and kind

if !podExists || podObj.volumeMountStateForPod == operationexecutor.VolumeMountUncertain {
// Add new mountedPod or update existing uncertain one - the new markVolumeOpts may
// have updated information. Especially reconstructed volumes (marked as uncertain
// during reconstruction) need update.
podObj = mountedPod{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not doing this results in real breakage? (sorry just curious, I suspected it will but I don't know for sure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, you told me that we can't trust reconstructed volumes. And I think we should not trust Uncertain volumes either, so I update everything.

rc.sync()
}
}
go rc.sync()
Copy link
Member

@gnufied gnufied Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically reconcile can run parallel to reconstruction and we do not wait for DSOW to be populated before doing reconstruction? This means for any volume type, where reconstruction fails, the volume/mount point may leak.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, reconstruction needs DSW populated here to force unmount volumes where the reconstruction itself failed:

volumeInDSW := rc.desiredStateOfWorld.VolumeExistsWithSpecName(volume.podName, volume.volumeSpecName)
reconstructedVolume, err := rc.reconstructVolume(volume)
if err != nil {
if volumeInDSW {
// Some pod needs the volume, don't clean it up and hope that
// reconcile() calls SetUp and reconstructs the volume in ASW.
klog.V(4).InfoS("Volume exists in desired state, skip cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName)
continue
}
// No pod needs the volume.
klog.InfoS("Could not construct volume information, cleaning up mounts", "podName", volume.podName, "volumeSpecName", volume.volumeSpecName, "err", err)
rc.cleanupMounts(volume)
continue

So, back to the drawing board. I was thinking about adding just an "uncertain tombstone" to ASW, where the reconstruction sync would just mark that there is a dir for the volume in /var/lib/kubelet/pods and the actual reconstruction would happen in UnmountVolume / MountVolume operations. We could try reconstruction couple of times and if it fails for too long, then force unmount. IMO it would be cleaner that today's approach (try once), but it's also more complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about adding just an "uncertain tombstone" to ASW, where the reconstruction sync would just mark that there is a dir for the volume in /var/lib/kubelet/pods and the actual reconstruction would happen in UnmountVolume / MountVolume operations

Tried that, failed fast. To mark anything as uncertain, kubelet needs UniqueVolumeID and that is not available before reconstruction completes.

@gnufied
Copy link
Member

gnufied commented Feb 17, 2022

Did the disruptive tests we discussed are passing with this PR btw? I don' think they are running always.

@msau42
Copy link
Member

msau42 commented Feb 18, 2022

/assign @jingxu97

@jsafrane
Copy link
Member Author

/hold
This PR does not work

  • Reconstruction still needs DSW populated, Rework volume reconstruction #108180 (review). That is hard to fix.
  • Since reconstruction can happen after reconciler finishes mounting of a volume, it may overwrite a fully mounted volume with uncertain. That can be fixed relatively easily.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2022
@jsafrane
Copy link
Member Author

I tested reboot with OpenShift 4.10 (=Kubernetes 1.23.3) and with in-tree AWS EBS volumes.
Both with and without this PR (and the feature enabled), the volume reconstruction failed after reboot - EBS volume plugin needs the volume mounted to be able to find the global mount + find AWS volume ID from it.

If I left Pods in the API server during the reboot, the new kubelet started all of them.
If I removed Pods from the API server during reboot, the pod local directory got removed in both cases by our fallback cleaner and the global dir was not removed (because reconstruction failed).

This PR did not have any visible effect.

@jsafrane
Copy link
Member Author

/retest

With the new reconstruction, AWS.MarkVolumeAsMounted will update outer spec
name with the correct value from Pod.
@ehashman
Copy link
Member

/milestone v1.25

@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Mar 24, 2022
@k8s-ci-robot
Copy link
Contributor

@jsafrane: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ehashman ehashman moved this from Triage to Waiting on Author in SIG Node PR Triage Mar 24, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2022
@hosseinsalahi
Copy link

Hi @jsafrane
Bug triage team here!
It looks like this PR has the following linked PRs:

Just checking in to see if this is still on track for k8s 1.25.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 29, 2022
@cici37
Copy link
Contributor

cici37 commented Aug 3, 2022

Hello 👋, 1.25 Release Lead here.

Unfortunately, this enhancement did not meet the code freeze criteria because there are still unmerged k/k code PRs.

If you still wish to progress this enhancement in v1.25, please file an exception request. Thank you so much!

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.25 milestone Aug 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node PR Triage automation moved this from Waiting on Author to Done Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

9 participants