Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data lost by k3s-uninstall.sh #3264

Closed
angelnu opened this issue May 3, 2021 · 9 comments
Closed

Data lost by k3s-uninstall.sh #3264

angelnu opened this issue May 3, 2021 · 9 comments

Comments

@angelnu
Copy link
Contributor

angelnu commented May 3, 2021

Environmental Info:
K3s Version:

k3s version v1.20.6+k3s1 (8d043282)
go version go1.15.10

Node(s) CPU architecture, OS, and Version:

Linux test-k3s2 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 servers

Describe the bug:

The k3s-killall.sh does not unmount all folders under /var/lib/kubelet. Explicitly it does not unmount the CSI ceph mount points which are placed under /var/lib/kubelet/plugins/kubernetes.io/csi/pv. This results on k3s-uninstall deleting their content later on when it does rm -rf /var/lib/kubelet

Steps To Reproduce:

  1. Installed K3s
  2. Install ceph CSI
  3. Deploy a pod with an static cephfs volume (I use a cluster on proxmox baremetal)
  4. put some data in the mounted ceph volume
  5. k3s-uninstall.sh

Expected behavior:

  • All mount under /var/lib/kubeleted being unmounted
  • static ceph volume content untouched

Actual behavior:

ceph volumen content lost

Additional context / logs:

NA

@angelnu
Copy link
Contributor Author

angelnu commented May 3, 2021

My proposal would be to:

  • unmount all /var/lib/kubetet mounts in k3s-killall.sh and not only /var/lib/kubetet/pods
  • ensure the rm -rf /var/lib/kubelet does not leave the filesystem (in case a mount point could not be unmounted for any reason)

If needed I could propose a PR.

angelnu added a commit to angelnu/k3s that referenced this issue May 3, 2021
angelnu added a commit to angelnu/k3s that referenced this issue May 3, 2021
Signed-off-by: angelnu <git@angelnucom>
@brandond brandond self-assigned this May 4, 2021
@brandond brandond added this to the v1.21.1+k3s1 milestone May 4, 2021
@brandond brandond added this to To Triage in Development [DEPRECATED] via automation May 4, 2021
@brandond brandond moved this from To Triage to Backlog in Development [DEPRECATED] May 4, 2021
@brandond brandond moved this from Backlog to Working in Development [DEPRECATED] May 4, 2021
@brandond brandond moved this from Working to Peer Review in Development [DEPRECATED] May 4, 2021
@brandond brandond moved this from Peer Review to To Test in Development [DEPRECATED] May 4, 2021
Development [DEPRECATED] automation moved this from To Test to Done Issue / Merged PR May 4, 2021
@brandond brandond reopened this May 4, 2021
Development [DEPRECATED] automation moved this from Done Issue / Merged PR to Working May 4, 2021
@brandond brandond moved this from Working to Done Issue / Merged PR in Development [DEPRECATED] May 4, 2021
@brandond brandond moved this from Done Issue / Merged PR to To Test in Development [DEPRECATED] May 4, 2021
@angelnu
Copy link
Contributor Author

angelnu commented May 4, 2021

@bradtopol - thanks for merging!

Would you consider a backport of this fix to 1.20? I would be happy to trigger a PR there.

@brandond
Copy link
Contributor

brandond commented May 4, 2021

@angelnu install.sh is only served off master, and is live as soon as merged - so there's no point in backporting it. You will need to re-run the installer to get the updated uninstall script though.

@angelnu
Copy link
Contributor Author

angelnu commented May 4, 2021

I see - I did a test to see if I was getting the fix but it did not work for me. The reason is that I install with Ansible and it turns out that that project keeps a derived copy of install.sh at https://github.com/PyratLabs/ansible-role-k3s/blob/main/templates/k3s-killall.sh.j2

I will do a PR to commit the fix there as well.

Thanks!

Update: opened PyratLabs/ansible-role-k3s#113

angelnu added a commit to angelnu/ansible-role-k3s that referenced this issue May 4, 2021
@ShylajaDevadiga
Copy link
Contributor

@angelnu Following the steps to reproduce, i am seeing all mounts in /var/lib/kubelet are not unmounted after running k3s-uninstall.

$ ps aux|grep k3s
ubuntu    278043  0.0  0.0   8160   736 pts/0    S+   15:58   0:00 grep --color=auto k3s
$ mount |grep kubelet
devtmpfs on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-df60e9f8-22ac-486a-a3c1-fbba60a48649/dev/6113cbff-7f98-479b-be10-45687c50e6c1 type devtmpfs (rw,relatime,size=2008280k,nr_inodes=502070,mode=755)

During uninstall shows target is busy

+ do_unmount_and_remove /var/lib/kubelet/plugins
+ sort+ awk -v path=/var/lib/kubelet/plugins $2 ~ ("^" path) { print $2 } /proc/self/mounts
 -r
+ xargs -r -t -n 1 sh -c umount "$0" && rm -rf "$0"
sh -c 'umount "$0" && rm -rf "$0"' /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-df60e9f8-22ac-486a-a3c1-fbba60a48649/dev/6113cbff-7f98-479b-be10-45687c50e6c1 
umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-df60e9f8-22ac-486a-a3c1-fbba60a48649/dev/6113cbff-7f98-479b-be10-45687c50e6c1: target is busy.

@angelnu
Copy link
Contributor Author

angelnu commented May 15, 2021

@ShylajaDevadiga - could you please check what process is keeping the mount busy?

I tested with ceph and there after killing all the pods (done a few lines before in killall) the unmount works. Maybe the volumedevices plugin requires additional cleanup.

And for confirmation - did the killall about when hitting the busy error? This should prevent the unexpected delete if the unmount fails.

@davidnuzik davidnuzik modified the milestones: v1.21.1+k3s1, v1.21.2+k3s1 May 19, 2021
@davidnuzik davidnuzik moved this from To Test to Working in Development [DEPRECATED] May 19, 2021
@ShylajaDevadiga
Copy link
Contributor

@angelnu yes by deleting the pod that uses the pvc, umount is successful.

kubectl delete pod pod-raw

Without deleting the pod here is the fuser results if it helps.

ubuntu@ip-172-31-33-20:~/csi-driver-host-path/examples$ mount |grep plugin
devtmpfs on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-f8f344e2-360d-43bf-a4f7-d340a9c706cd/dev/6a7c560a-2491-4447-8527-a23ba9124a11 type devtmpfs (rw,relatime,size=2008276k,nr_inodes=502069,mode=755)
ubuntu@ip-172-31-33-20:~/csi-driver-host-path/examples$ sudo fuser -vmM  /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-f8f344e2-360d-43bf-a4f7-d340a9c706cd/dev/6a7c560a-2491-4447-8527-a23ba9124a11
                     USER        PID ACCESS COMMAND
/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-f8f344e2-360d-43bf-a4f7-d340a9c706cd/dev/6a7c560a-2491-4447-8527-a23ba9124a11:
                     root     kernel mount /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-f8f344e2-360d-43bf-a4f7-d340a9c706cd/dev/6a7c560a-2491-4447-8527-a23ba9124a11
ubuntu@ip-172-31-33-20:~/csi-driver-host-path/examples$ 

@angelnu
Copy link
Contributor Author

angelnu commented May 20, 2021

if the umount fails then at least some fails are still in use - if it is a process within the container then it should be killed by when the pod is deleted within k3s-killall.sh.

This is why my suggestion is to deck after the mount fail on what process keeps the mount busty with lsof. I suspect that your CSI is starting a process outside the pod hat keeps the mount busy and it is not killed by k3s-killall.sh. Handing for those would process would need to be added if this gets confirmed.

When do cleanly delete the container the CSI does the unmount.

@davidnuzik davidnuzik moved this from Working to To Test in Development [DEPRECATED] Jun 3, 2021
@ShylajaDevadiga
Copy link
Contributor

@angelnu I had use hostpath in the earlier scenario. After internal discussion we decided to use longhorn csi. Validated fix on k3s version v1.21.1+k3s1. Umount was successful.

mount |grep kubelet
...
/dev/longhorn/pvc-c72e462c-81e2-4d37-9a05-456d3aec381f on /var/lib/kubelet/pods/3cd82bb6-1790-403c-966b-fda638ba60ab/volumes/kubernetes.io~csi/pvc-c72e462c-81e2-4d37-9a05-456d3aec381f/mount type ext4 (rw,relatime)

Development [DEPRECATED] automation moved this from To Test to Done Issue / Merged PR Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants