[BUG] longhorn manager pod fails to start in container-based K3s #5693

zedi-pramodh · 2023-03-31T15:40:02Z

Describe the bug (🐛 if you encounter this issue)

longhorn manager pod fails to start.
5e2b3989-174a-450f-ad73-47b021784f28:/# kubectl get pods -n longhorn-system
NAME READY STATUS RESTARTS AGE
longhorn-admission-webhook-5bc4b984c4-6bpp6 1/1 Running 1 (22h ago) 34h
longhorn-admission-webhook-5bc4b984c4-wcfwv 1/1 Running 1 (22h ago) 34h
longhorn-conversion-webhook-75d97f9fc8-f4c9g 1/1 Running 1 (22h ago) 34h
longhorn-conversion-webhook-75d97f9fc8-m28xz 1/1 Running 1 (22h ago) 34h
longhorn-driver-deployer-c654d94c9-hmj8l 0/1 Init:0/1 1 34h
longhorn-manager-mxsgm 0/1 CrashLoopBackOff 220 (2m40s ago) 18h
longhorn-recovery-backend-bc84b6dbf-gwf85 1/1 Running 1 (22h ago) 34h
longhorn-recovery-backend-bc84b6dbf-rb7kg 1/1 Running 1 (22h ago) 34h
longhorn-ui-677c9cb6d7-kk496 1/1 Running 3 (22h ago) 34h
longhorn-ui-677c9cb6d7-nnwcq 1/1 Running 3 (22h ago) 34h

5e2b3989-174a-450f-ad73-47b021784f28:/# kubectl logs longhorn-manager-mxsgm -n longhorn-system
Defaulted container "longhorn-manager" out of: longhorn-manager, wait-longhorn-admission-webhook (init)
time="2023-03-31T15:29:52Z" level=error msg="Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host"
time="2023-03-31T15:29:52Z" level=fatal msg="Error starting manager: environment check failed: failed to execute: nsenter [--mount=/host/proc/1/ns/mnt --net=/host/proc/1/ns/net iscsiadm --version], output , stderr nsenter: failed to execute iscsiadm: No such file or directory\n: exit status 127"

To Reproduce

My env is not so typical.

Base OS is alpine 3.16 kernel version 5.15.90
k3s is installed in a OCI container on top of alpine
longhorn is installed within that k3s container.

longhorn was installed using following command
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.4.0/deploy/longhorn.yaml

5e2b3989-174a-450f-ad73-47b021784f28:/# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
5e2b3989-174a-450f-ad73-47b021784f28 Ready control-plane,etcd,master 22d v1.25.3+k3s1 10.129.17.90 Unknown 5.15.99-linuxkit containerd://1.6.8-k3s1

I did install open-iscsi and iscsiadm is present in the k3s container.

5e2b3989-174a-450f-ad73-47b021784f28:/# lsmod | grep iscsi
iscsi_tcp 24576 0
libiscsi_tcp 28672 1 iscsi_tcp
libiscsi 53248 2 iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi 102400 4 iscsi_tcp,libiscsi_tcp,libiscsi

Expected behavior

Expect longhorn pods to start

Initially it looked like path to iscsiadm is missing. But after deep dive it appears that it is something to do with namespaces and /proc/1/ns/mnt is not found in longhorn manager pod. Since my env is BaseOS -> k3s in container -> longhorn launched in k3s container.

Did anyone seen this issue and also is this even a supported config, ie can we launch longhorn in a k3s container.

NOTE: Pardon my ignorance since I am just getting started on longhorn

zedi-pramodh · 2023-04-02T20:16:42Z

Can someone please provide your inputs on this issue and any workaround ?

zedi-pramodh · 2023-04-03T23:12:07Z

On further debugging it turned that the following code

https://github.com/longhorn/go-iscsi-helper/blob/master/util/process.go

const (
DockerdProcess = "dockerd"
ContainerdProcess = "containerd"
)

But the containerd runtime process is containerd-shim in my case. Hence there will never be a match with parent process.
Hence it always fallback to /proc/1/ns, but since I am running k3s within a container its not in same namespace as base os.

Is my understand correct ?

mantissahz · 2023-04-06T04:02:11Z

time="2023-03-31T15:29:52Z" level=error msg="Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host"

Could you use this script curl -sSfL https://raw.githubusercontent.com/longhorn/longhorn/v1.4.1/scripts/environment_check.sh | bash
to check the environment setting is OK?
Here is the document https://longhorn.io/docs/1.4.1/deploy/install/#using-the-environment-check-script

zedi-pramodh · 2023-04-06T18:08:00Z

@mantissahz

5e2b3989-174a-450f-ad73-47b021784f28:/# curl -sSfL https://raw.githubusercontent.com/longhorn/longhorn/v1.4.1/scripts/environment_check.sh | bash
[INFO] Required dependencies 'kubectl jq mktemp' are installed.
[INFO] Hostname uniqueness check is passed.
[INFO] Waiting for longhorn-environment-check pods to become ready (0/1)...
[INFO] Waiting for longhorn-environment-check pods to become ready (0/1)...
[INFO] All longhorn-environment-check pods are ready (1/1).
[WARN] Unable to check kernel config CONFIG_NFS_V4_1 on node 5e2b3989-174a-450f-ad73-47b021784f28
[WARN] Unable to check kernel config CONFIG_NFS_V4_2 on node 5e2b3989-174a-450f-ad73-47b021784f28
[WARN] NFS client kernel support, CONFIG_NFS_V4_1 CONFIG_NFS_V4_2, is not enabled on Longhorn nodes. Please refer to https://longhorn.io/docs/1.4.0/deploy/install/#installing-nfsv4-client for more information.
[INFO] Cleaning up longhorn-environment-check pods...
[INFO] Cleanup completed.

I think the issue is as I mentioned above:

https://github.com/longhorn/go-iscsi-helper/blob/master/util/process.go

const (
DockerdProcess = "dockerd"
ContainerdProcess = "containerd"
)

There should be check for "containerd-shim" too. Without that the code always reaches the PPID 1. In most cases when k3s is run on bare metal the namespace of k3s matches the init process namespace and hence its working.

In my case k3s is running in a OCI container and hence namespace will be different than init process.

I temporarily patched the code to replace containerd with containerd-shim and the longhorn pods start fine and I was able to create PVs.

So my suggestion is to check for all three processes.

const (
DockerdProcess = "dockerd"
ContainerdProcess = "containerd"
ContainerdShimProcess = "containerd-shim"
)

naiming-zededa · 2023-04-06T18:29:36Z

Not sure anyone has a working system with Kubernetes run inside a container (not on bare metal) with longhorn. In both cases, when trying to find the PPid of the process, the name does not match 'containerd', it will always go to the 'Init' process. So in the case of bare metal, using the 'Init' process namespace is fine, but in K8S/K3S in container, this is not ok. Either we replace the "containerd" with "containerd-shim", or add another check for "containerd-shim", that will let longhorn to work in both cases. We can send a patch if people agree on this.

mantissahz · 2023-04-07T04:16:54Z

containerd is not supported well, related issue: #2702, #3643

naiming-zededa · 2023-04-08T23:03:15Z

@mantissahz I'm not talking about to have any extra support by longhorn, but a simple patch, which is used in 'FindAncestorByName()' by for example longhorn-manager, and a number of other containers. If I change the
/vendor/github.com/longhorn/go-iscsi-helper/util/process.go to this:

const (
DockerdProcess = "dockerd"
- ContainerdProcess = "containerd"
+ //ContainerdProcess = "containerd"
+ ContainerdProcess = "containerd-shim"
)

then it works in both k3s/longhorn inside a container and on a bare-metal, otherwise I have traced this ancestor finding with some debug info:

find the process (0) longhorn-manage, id 28641;(1) containerd-shim, id 26544;(2) containerd-shim, id 1738;(3) init, id 1;ppid is zero

if this find stops at the pid of 26544, then it works, without the above patch, it will find all the way to the 'init' process, and using /proc/1/ns/ for the nsenter which seems only work for bare-metal condition.

We can also add another check for 'containerd-shim' (instead of replacing above), if finds the process with that name, also returns the pid.

mantissahz · 2023-04-10T00:52:30Z

@naiming-zededa happy to hear that you have a solution.
Could you create a PR for this solution to improve the container support?
Thanks

naiming-zededa · 2023-04-10T17:24:49Z

will do @mantissahz

naiming-zededa · 2023-04-12T20:43:05Z

@mantissahz can you add permission for me (https://github.com/longhorn/go-iscsi-helper) to submit a PR? thanks.

mantissahz · 2023-04-13T00:39:00Z

@innobead Do we have any permission limitation for submitting a PR to go-iscsi-helper ?

naiming-zededa · 2023-04-13T01:17:24Z

I encountered this error during 'git push':
go-iscsi-helper [naiming-containerd-shim] git push --set-upstream origin naiming-containerd-shim
remote: Permission to longhorn/go-iscsi-helper.git denied to naiming-zededa.
fatal: unable to access 'https://github.com/longhorn/go-iscsi-helper.git/': The requested URL returned error: 403

mantissahz · 2023-04-13T01:22:24Z

@naiming-zededa
Do you push the commit to your fork go-iscsi-helper first?
Then you are able to create a new pull request on you fork repository.
Directly pushing the commit/branch to the longhorn repository is not allowed.

naiming-zededa · 2023-04-13T03:26:44Z

@mantissahz PR submitted: longhorn/go-iscsi-helper#63

andrewd-zededa · 2024-01-03T18:21:43Z

@mantissahz and @shuo-wu I've submitted a PR mentioned just above this to incorporate the fix into 1.5.x/1.5.4. Can someone take a look and advise please? Thanks!

longhorn-io-github-bot · 2024-01-05T05:27:19Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
- [BUG] longhorn manager pod fails to start in container-based K3s #5693 (comment)

PRs:

innobead · 2024-02-01T07:48:24Z

@ChanYiLin Please assist @andrewd-zededa on this issue. Move it forward.

ChanYiLin · 2024-02-02T06:56:46Z

Sure I will pick it up, it seems some of the PRs are closed due to being inactive for too long.

ChanYiLin · 2024-02-05T09:57:20Z

chriscchien · 2024-03-06T10:12:19Z

HI @zedi-pramodh, for testing purposes, it would be appreciated if you could elaborate more on how to install k3s in a container. Thank you.

bashofmann · 2024-03-09T10:55:22Z

@chriscchien The easiest way is to use K3d: https://k3d.io/

chriscchien · 2024-03-12T08:52:32Z

Hi @bashofmann, thank you for your information.

By create k3d cluster with defauIt, I can reproduce the longhorn-manager CrashLoopBackOff issue

> k get pods -n longhorn-system
NAME                                        READY   STATUS             RESTARTS         AGE
longhorn-ui-7bfc767bfd-dpj66                1/1     Running            0                169m
longhorn-driver-deployer-766f858d87-c27gf   0/1     Init:0/1           0                169m
longhorn-ui-7bfc767bfd-jr2kx                1/1     Running            0                169m
longhorn-manager-j9zmv                      0/1     CrashLoopBackOff   37 (3m38s ago)   169m

From longhorn-manager logs, can observe the same log as this issue described

> k -n longhorn-system logs longhorn-manager-j9zmv
warning: GOCOVERDIR not set, no coverage data emitted
time="2024-03-12T06:47:59Z" level=fatal msg="Error starting manager: Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host: failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/6163/ns/mnt --net=/host/proc/6163/ns/net iscsiadm --version], output , stderr nsenter: failed to execute iscsiadm: No such file or directory\n: exit status 127" func=main.main.DaemonCmd.func3 file="daemon.go:92"

In the container, only few iSCSI moduled loaded and no iscsiadm loaded

> docker exec -it 43599ac14951 /bin/sh -c "lsmod | grep iscsi"
iscsi_ibft             16384  0 
iscsi_boot_sysfs       20480  1 iscsi_ibft
> docker exec -it 43599ac14951 /bin/sh -c "iscsiadm --version"
/bin/sh: iscsiadm: not found

After using custom image with iscsiadm installed to do again, the longhorn-manager not crashed (ref)

> k get nodes -o wide
NAME                            STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE           KERNEL-VERSION                 CONTAINER-RUNTIME
k3d-one-node-cluster-server-0   Ready    control-plane,master   39m   v1.29.2+k3s1   172.18.0.2    <none>        K3s v1.29.2+k3s1   5.14.21-150500.55.44-default   containerd://1.7.11-k3s2
> k get pods -n longhorn-system | grep longhorn-manager
longhorn-manager-wrwwr                              1/1     Running                0               28m
> docker exec -it d8fa9a96a81b /bin/sh -c "iscsiadm --version"
iscsiadm version 2.1.8

chriscchien · 2024-03-12T09:15:15Z

Verified pass on longhorn master(longhorn engine 9ff2e8, longhorn-instance-manager a09d9d )

Create k3d cluster with custom image which iscsiadm installed and then deploy longhorn master, longhorn-manager pod running correctly. (detail)

Close this ticket first as we did not totally know how issue creater's environment build. Currently only can mock the environment by k3d and longhorn-manager can running correctly. If there have furthur information, I will test again, thank you.

kust-soptim · 2024-04-05T07:23:17Z

If you are facing this issue and are in dire need of a quick workaround you can start a shell on your affected nodes using the following command on a shell:
sudo dnf -y install iscsi-initiator-utils
Then delete then longhorn-manager pods and your cluster will be up and running again.

zedi-pramodh added the kind/bug label Mar 31, 2023

naiming-zededa mentioned this issue Apr 13, 2023

allow containerd-shim in finding namespace longhorn/go-iscsi-helper#63

Merged

innobead changed the title ~~[BUG] longhorn manager pod fails to start~~ [BUG] longhorn manager pod fails to start in container-based K3s Apr 14, 2023

andrewd-zededa mentioned this issue Aug 4, 2023

Update go-iscsi-helper to version also used in longhorn-manager to incorporate PR#63 containerd fix longhorn/longhorn-instance-manager#275

Closed

shuo-wu mentioned this issue Aug 8, 2023

vendor: Update longhorn/go-iscsi-helper longhorn/longhorn-engine#920

Merged

andrewd-zededa mentioned this issue Jan 3, 2024

Update go-iscsi-helper and backupstore in vendor longhorn/longhorn-engine#981

Closed

innobead added this to the v1.7.0 milestone Jan 5, 2024

innobead assigned andrewd-zededa Jan 5, 2024

innobead added the component/longhorn-manager Longhorn manager (control plane) label Jan 5, 2024

innobead modified the milestones: v1.7.0, v1.6.0 Jan 8, 2024

innobead added the area/v1-data-engine v1 data engine (iSCSI tgt) label Jan 8, 2024

innobead modified the milestones: v1.6.0, v1.5.4 Jan 10, 2024

innobead assigned ChanYiLin Feb 1, 2024

ChanYiLin modified the milestones: v1.5.4, v1.7.0 Feb 5, 2024

ChanYiLin added backport/1.5.4 backport/1.6.1 labels Feb 5, 2024

This was referenced Feb 5, 2024

[BACKPORT][v1.6.1][BUG] longhorn manager pod fails to start in container-based K3s #7847

Closed

[BACKPORT][v1.5.4][BUG] longhorn manager pod fails to start in container-based K3s #7848

Closed

innobead added backport/1.5.5 and removed backport/1.5.4 labels Feb 16, 2024

github-actions bot mentioned this issue Feb 16, 2024

[BACKPORT][v1.5.5][BUG] longhorn manager pod fails to start in container-based K3s #7948

Closed

chriscchien self-assigned this Mar 12, 2024

chriscchien closed this as completed Mar 12, 2024

innobead mentioned this issue Mar 12, 2024

[TASK] Evaluate and try the feasibility to run tests in containered K8s cluster #8158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] longhorn manager pod fails to start in container-based K3s #5693

[BUG] longhorn manager pod fails to start in container-based K3s #5693

zedi-pramodh commented Mar 31, 2023

zedi-pramodh commented Apr 2, 2023

zedi-pramodh commented Apr 3, 2023

mantissahz commented Apr 6, 2023

zedi-pramodh commented Apr 6, 2023

naiming-zededa commented Apr 6, 2023

mantissahz commented Apr 7, 2023 •

edited

naiming-zededa commented Apr 8, 2023 •

edited

mantissahz commented Apr 10, 2023

naiming-zededa commented Apr 10, 2023

naiming-zededa commented Apr 12, 2023 •

edited

mantissahz commented Apr 13, 2023 •

edited

naiming-zededa commented Apr 13, 2023 •

edited

mantissahz commented Apr 13, 2023 •

edited

naiming-zededa commented Apr 13, 2023

andrewd-zededa commented Jan 3, 2024

longhorn-io-github-bot commented Jan 5, 2024 •

edited by ChanYiLin

innobead commented Feb 1, 2024

ChanYiLin commented Feb 2, 2024

ChanYiLin commented Feb 5, 2024 •

edited

chriscchien commented Mar 6, 2024

bashofmann commented Mar 9, 2024

chriscchien commented Mar 12, 2024

chriscchien commented Mar 12, 2024

kust-soptim commented Apr 5, 2024

[BUG] longhorn manager pod fails to start in container-based K3s #5693

[BUG] longhorn manager pod fails to start in container-based K3s #5693

Comments

zedi-pramodh commented Mar 31, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

zedi-pramodh commented Apr 2, 2023

zedi-pramodh commented Apr 3, 2023

mantissahz commented Apr 6, 2023

zedi-pramodh commented Apr 6, 2023

naiming-zededa commented Apr 6, 2023

mantissahz commented Apr 7, 2023 • edited

naiming-zededa commented Apr 8, 2023 • edited

mantissahz commented Apr 10, 2023

naiming-zededa commented Apr 10, 2023

naiming-zededa commented Apr 12, 2023 • edited

mantissahz commented Apr 13, 2023 • edited

naiming-zededa commented Apr 13, 2023 • edited

mantissahz commented Apr 13, 2023 • edited

naiming-zededa commented Apr 13, 2023

andrewd-zededa commented Jan 3, 2024

longhorn-io-github-bot commented Jan 5, 2024 • edited by ChanYiLin

Pre Ready-For-Testing Checklist

innobead commented Feb 1, 2024

ChanYiLin commented Feb 2, 2024

ChanYiLin commented Feb 5, 2024 • edited

Master-head

v1.6.x

v1.5.x

chriscchien commented Mar 6, 2024

bashofmann commented Mar 9, 2024

chriscchien commented Mar 12, 2024

chriscchien commented Mar 12, 2024

kust-soptim commented Apr 5, 2024

mantissahz commented Apr 7, 2023 •

edited

naiming-zededa commented Apr 8, 2023 •

edited

naiming-zededa commented Apr 12, 2023 •

edited

mantissahz commented Apr 13, 2023 •

edited

naiming-zededa commented Apr 13, 2023 •

edited

mantissahz commented Apr 13, 2023 •

edited

longhorn-io-github-bot commented Jan 5, 2024 •

edited by ChanYiLin

ChanYiLin commented Feb 5, 2024 •

edited