Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] longhorn manager pod fails to start in container-based K3s #5693

Closed
zedi-pramodh opened this issue Mar 31, 2023 · 24 comments
Closed

[BUG] longhorn manager pod fails to start in container-based K3s #5693

zedi-pramodh opened this issue Mar 31, 2023 · 24 comments
Assignees
Labels
Milestone

Comments

@zedi-pramodh
Copy link

Describe the bug (馃悰 if you encounter this issue)

longhorn manager pod fails to start.
5e2b3989-174a-450f-ad73-47b021784f28:/# kubectl get pods -n longhorn-system
NAME READY STATUS RESTARTS AGE
longhorn-admission-webhook-5bc4b984c4-6bpp6 1/1 Running 1 (22h ago) 34h
longhorn-admission-webhook-5bc4b984c4-wcfwv 1/1 Running 1 (22h ago) 34h
longhorn-conversion-webhook-75d97f9fc8-f4c9g 1/1 Running 1 (22h ago) 34h
longhorn-conversion-webhook-75d97f9fc8-m28xz 1/1 Running 1 (22h ago) 34h
longhorn-driver-deployer-c654d94c9-hmj8l 0/1 Init:0/1 1 34h
longhorn-manager-mxsgm 0/1 CrashLoopBackOff 220 (2m40s ago) 18h
longhorn-recovery-backend-bc84b6dbf-gwf85 1/1 Running 1 (22h ago) 34h
longhorn-recovery-backend-bc84b6dbf-rb7kg 1/1 Running 1 (22h ago) 34h
longhorn-ui-677c9cb6d7-kk496 1/1 Running 3 (22h ago) 34h
longhorn-ui-677c9cb6d7-nnwcq 1/1 Running 3 (22h ago) 34h

5e2b3989-174a-450f-ad73-47b021784f28:/# kubectl logs longhorn-manager-mxsgm -n longhorn-system
Defaulted container "longhorn-manager" out of: longhorn-manager, wait-longhorn-admission-webhook (init)
time="2023-03-31T15:29:52Z" level=error msg="Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host"
time="2023-03-31T15:29:52Z" level=fatal msg="Error starting manager: environment check failed: failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net iscsiadm --version], output , stderr nsenter: failed to execute iscsiadm: No such file or directory\n: exit status 127"

To Reproduce

My env is not so typical.

  1. Base OS is alpine 3.16 kernel version 5.15.90
  2. k3s is installed in a OCI container on top of alpine
  3. longhorn is installed within that k3s container.

longhorn was installed using following command
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.4.0/deploy/longhorn.yaml

5e2b3989-174a-450f-ad73-47b021784f28:/# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
5e2b3989-174a-450f-ad73-47b021784f28 Ready control-plane,etcd,master 22d v1.25.3+k3s1 10.129.17.90 Unknown 5.15.99-linuxkit containerd://1.6.8-k3s1

I did install open-iscsi and iscsiadm is present in the k3s container.

5e2b3989-174a-450f-ad73-47b021784f28:/# lsmod | grep iscsi
iscsi_tcp 24576 0
libiscsi_tcp 28672 1 iscsi_tcp
libiscsi 53248 2 iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi 102400 4 iscsi_tcp,libiscsi_tcp,libiscsi

Expected behavior

Expect longhorn pods to start

Initially it looked like path to iscsiadm is missing. But after deep dive it appears that it is something to do with namespaces and /proc/1/ns/mnt is not found in longhorn manager pod. Since my env is BaseOS -> k3s in container -> longhorn launched in k3s container.

Did anyone seen this issue and also is this even a supported config, ie can we launch longhorn in a k3s container.

NOTE: Pardon my ignorance since I am just getting started on longhorn

@zedi-pramodh
Copy link
Author

Can someone please provide your inputs on this issue and any workaround ?

@zedi-pramodh
Copy link
Author

On further debugging it turned that the following code

https://github.com/longhorn/go-iscsi-helper/blob/master/util/process.go

const (
DockerdProcess = "dockerd"
ContainerdProcess = "containerd"
)

But the containerd runtime process is containerd-shim in my case. Hence there will never be a match with parent process.
Hence it always fallback to /proc/1/ns, but since I am running k3s within a container its not in same namespace as base os.

Is my understand correct ?

@mantissahz
Copy link
Contributor

time="2023-03-31T15:29:52Z" level=error msg="Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host"

Could you use this script curl -sSfL https://raw.githubusercontent.com/longhorn/longhorn/v1.4.1/scripts/environment_check.sh | bash
to check the environment setting is OK?
Here is the document https://longhorn.io/docs/1.4.1/deploy/install/#using-the-environment-check-script

@zedi-pramodh
Copy link
Author

@mantissahz

5e2b3989-174a-450f-ad73-47b021784f28:/# curl -sSfL https://raw.githubusercontent.com/longhorn/longhorn/v1.4.1/scripts/environment_check.sh | bash
[INFO] Required dependencies 'kubectl jq mktemp' are installed.
[INFO] Hostname uniqueness check is passed.
[INFO] Waiting for longhorn-environment-check pods to become ready (0/1)...
[INFO] Waiting for longhorn-environment-check pods to become ready (0/1)...
[INFO] All longhorn-environment-check pods are ready (1/1).
[WARN] Unable to check kernel config CONFIG_NFS_V4_1 on node 5e2b3989-174a-450f-ad73-47b021784f28
[WARN] Unable to check kernel config CONFIG_NFS_V4_2 on node 5e2b3989-174a-450f-ad73-47b021784f28
[WARN] NFS client kernel support, CONFIG_NFS_V4_1 CONFIG_NFS_V4_2, is not enabled on Longhorn nodes. Please refer to https://longhorn.io/docs/1.4.0/deploy/install/#installing-nfsv4-client for more information.
[INFO] Cleaning up longhorn-environment-check pods...
[INFO] Cleanup completed.

I think the issue is as I mentioned above:

https://github.com/longhorn/go-iscsi-helper/blob/master/util/process.go

const (
DockerdProcess = "dockerd"
ContainerdProcess = "containerd"
)

There should be check for "containerd-shim" too. Without that the code always reaches the PPID 1. In most cases when k3s is run on bare metal the namespace of k3s matches the init process namespace and hence its working.

In my case k3s is running in a OCI container and hence namespace will be different than init process.

I temporarily patched the code to replace containerd with containerd-shim and the longhorn pods start fine and I was able to create PVs.

So my suggestion is to check for all three processes.

const (
DockerdProcess = "dockerd"
ContainerdProcess = "containerd"
ContainerdShimProcess = "containerd-shim"
)

@naiming-zededa
Copy link

Not sure anyone has a working system with Kubernetes run inside a container (not on bare metal) with longhorn. In both cases, when trying to find the PPid of the process, the name does not match 'containerd', it will always go to the 'Init' process. So in the case of bare metal, using the 'Init' process namespace is fine, but in K8S/K3S in container, this is not ok. Either we replace the "containerd" with "containerd-shim", or add another check for "containerd-shim", that will let longhorn to work in both cases. We can send a patch if people agree on this.

@mantissahz
Copy link
Contributor

mantissahz commented Apr 7, 2023

containerd is not supported well, related issue: #2702, #3643

@naiming-zededa
Copy link

naiming-zededa commented Apr 8, 2023

@mantissahz I'm not talking about to have any extra support by longhorn, but a simple patch, which is used in 'FindAncestorByName()' by for example longhorn-manager, and a number of other containers. If I change the
/vendor/github.com/longhorn/go-iscsi-helper/util/process.go to this:

const (
DockerdProcess = "dockerd"
- ContainerdProcess = "containerd"
+ //ContainerdProcess = "containerd"
+ ContainerdProcess = "containerd-shim"
)

then it works in both k3s/longhorn inside a container and on a bare-metal, otherwise I have traced this ancestor finding with some debug info:

find the process (0) longhorn-manage, id 28641;(1) containerd-shim, id 26544;(2) containerd-shim, id 1738;(3) init, id 1;ppid is zero

if this find stops at the pid of 26544, then it works, without the above patch, it will find all the way to the 'init' process, and using /proc/1/ns/ for the nsenter which seems only work for bare-metal condition.

We can also add another check for 'containerd-shim' (instead of replacing above), if finds the process with that name, also returns the pid.

@mantissahz
Copy link
Contributor

@naiming-zededa happy to hear that you have a solution.
Could you create a PR for this solution to improve the container support?
Thanks

@naiming-zededa
Copy link

will do @mantissahz

@naiming-zededa
Copy link

naiming-zededa commented Apr 12, 2023

@mantissahz can you add permission for me (https://github.com/longhorn/go-iscsi-helper) to submit a PR? thanks.

@mantissahz
Copy link
Contributor

mantissahz commented Apr 13, 2023

@innobead Do we have any permission limitation for submitting a PR to go-iscsi-helper ?

@naiming-zededa
Copy link

naiming-zededa commented Apr 13, 2023

I encountered this error during 'git push':
go-iscsi-helper [naiming-containerd-shim] git push --set-upstream origin naiming-containerd-shim
remote: Permission to longhorn/go-iscsi-helper.git denied to naiming-zededa.
fatal: unable to access 'https://github.com/longhorn/go-iscsi-helper.git/': The requested URL returned error: 403

@mantissahz
Copy link
Contributor

mantissahz commented Apr 13, 2023

@naiming-zededa
Do you push the commit to your fork go-iscsi-helper first?
Then you are able to create a new pull request on you fork repository.
Directly pushing the commit/branch to the longhorn repository is not allowed.

@naiming-zededa
Copy link

@mantissahz PR submitted: longhorn/go-iscsi-helper#63

@andrewd-zededa
Copy link

@mantissahz and @shuo-wu I've submitted a PR mentioned just above this to incorporate the fix into 1.5.x/1.5.4. Can someone take a look and advise please? Thanks!

@innobead innobead added this to the v1.7.0 milestone Jan 5, 2024
@innobead innobead added the component/longhorn-manager Longhorn manager (control plane) label Jan 5, 2024
@innobead innobead modified the milestones: v1.7.0, v1.6.0 Jan 8, 2024
@innobead innobead added the area/v1-data-engine v1 data engine (iSCSI tgt) label Jan 8, 2024
@innobead innobead modified the milestones: v1.6.0, v1.5.4 Jan 10, 2024
@innobead
Copy link
Member

innobead commented Feb 1, 2024

@ChanYiLin Please assist @andrewd-zededa on this issue. Move it forward.

@ChanYiLin
Copy link
Contributor

Sure I will pick it up, it seems some of the PRs are closed due to being inactive for too long.

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Feb 5, 2024

Hi @andrewd-zededa
I am here to help you to proceed the progress of this feature.
Longhorn components has following import chain
go-iscsi-helper/go-common-libs -> longhorn-engine -> longhorn-instance-manager -> longhorn-manager
That means we have to update the vendor in each repo one by one and from button to top.
So every components has the latest update.

Besides the import chain, our release is based on the branch.
v1.5.x -> head of v1.5.x, every minor patch version cut from this branch, e.g. v1.5.4
master -> every major release cut from this branch, e.g. v1.6.0, v1.7.0

So if you want to patch the previous release, you have to create branch from the correct head and update it.
For example, to backport this feature to v1.5.x, you have the create branch from v1.5.x and create the PR to this version.

Now we have to make sure

Master-head

  • go-iscsi-helper updated. PR
  • longhorn-engine updated. PR
  • longhorn-instance-manager updated (has already included here). It is common that others' PRs would update the vendor as well. So no need the PR

v1.6.x

  • go-iscsi-helper updated. PR (this repo has no version)
  • longhorn-engine updated. (has already included here)
  • longhorn-instance-manager updated (has already included here)

v1.5.x

  • go-iscsi-helper updated. PR (this repo has no version)
  • longhorn-engine updated. (Please refer to this comment to update the vendor in longhorn/longhorn-engine v1.5.x)
  • longhorn-instance-manger updated (Please update the go-iscsi-helper/go-common-libs/longhorn-engine vendor in go.mod after longhorn-engine PR is merged into v1.5.x)
  • longhonr-manager updated (Please update the go-iscsi-helper/go-common-libs/longhorn-engine/longhorn-instance-manger vendor in go.mod after longhorn-instance-manger PR is merged into v1.5.x)

Thanks!

@chriscchien
Copy link
Contributor

HI @zedi-pramodh, for testing purposes, it would be appreciated if you could elaborate more on how to install k3s in a container. Thank you.

@bashofmann
Copy link

@chriscchien The easiest way is to use K3d: https://k3d.io/

@chriscchien
Copy link
Contributor

Hi @bashofmann, thank you for your information.

By create k3d cluster with defauIt, I can reproduce the longhorn-manager CrashLoopBackOff issue

> k get pods -n longhorn-system
NAME                                        READY   STATUS             RESTARTS         AGE
longhorn-ui-7bfc767bfd-dpj66                1/1     Running            0                169m
longhorn-driver-deployer-766f858d87-c27gf   0/1     Init:0/1           0                169m
longhorn-ui-7bfc767bfd-jr2kx                1/1     Running            0                169m
longhorn-manager-j9zmv                      0/1     CrashLoopBackOff   37 (3m38s ago)   169m

From longhorn-manager logs, can observe the same log as this issue described

> k -n longhorn-system logs longhorn-manager-j9zmv
warning: GOCOVERDIR not set, no coverage data emitted
time="2024-03-12T06:47:59Z" level=fatal msg="Error starting manager: Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host: failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/6163/ns/mnt --net=/host/proc/6163/ns/net iscsiadm --version], output , stderr nsenter: failed to execute iscsiadm: No such file or directory\n: exit status 127" func=main.main.DaemonCmd.func3 file="daemon.go:92"

In the container, only few iSCSI moduled loaded and no iscsiadm loaded

> docker exec -it 43599ac14951 /bin/sh -c "lsmod | grep iscsi"
iscsi_ibft             16384  0 
iscsi_boot_sysfs       20480  1 iscsi_ibft
> docker exec -it 43599ac14951 /bin/sh -c "iscsiadm --version"
/bin/sh: iscsiadm: not found

After using custom image with iscsiadm installed to do again, the longhorn-manager not crashed (ref)

> k get nodes -o wide
NAME                            STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE           KERNEL-VERSION                 CONTAINER-RUNTIME
k3d-one-node-cluster-server-0   Ready    control-plane,master   39m   v1.29.2+k3s1   172.18.0.2    <none>        K3s v1.29.2+k3s1   5.14.21-150500.55.44-default   containerd://1.7.11-k3s2
> k get pods -n longhorn-system | grep longhorn-manager
longhorn-manager-wrwwr                              1/1     Running                0               28m
> docker exec -it d8fa9a96a81b /bin/sh -c "iscsiadm --version"
iscsiadm version 2.1.8

@chriscchien chriscchien self-assigned this Mar 12, 2024
@chriscchien
Copy link
Contributor

Verified pass on longhorn master(longhorn engine 9ff2e8, longhorn-instance-manager a09d9d )

Create k3d cluster with custom image which iscsiadm installed and then deploy longhorn master, longhorn-manager pod running correctly. (detail)

Close this ticket first as we did not totally know how issue creater's environment build. Currently only can mock the environment by k3d and longhorn-manager can running correctly. If there have furthur information, I will test again, thank you.

@kust-soptim
Copy link

If you are facing this issue and are in dire need of a quick workaround you can start a shell on your affected nodes using the following command on a shell:
sudo dnf -y install iscsi-initiator-utils
Then delete then longhorn-manager pods and your cluster will be up and running again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

No branches or pull requests

10 participants