[BUG] cannot restart engine image daemonset sometimes #2351

c3y1huang · 2021-03-16T04:06:31Z

Describe the bug
While running e2e tests, noticed sometimes when deleting the engine image to trigger restart will fails to bring all engineimage pod to running.

To Reproduce
Steps to reproduce the behavior:

Start with new longhorn cluster
Delete engine-image daemonset.

kubectl -n longhorn-system delete ds/engine-image-ei-605a0f3e

Expected behavior
All engine image pods should restart and become running eventually.

Log

ip-172-30-0-21:~ # kubectl -n longhorn-system get ds | grep engine-image
engine-image-ei-605a0f3e   3         3         1       3            1           <none>          7m1s

ip-172-30-0-21:~ # kubectl -n longhorn-system get pod | grep engine-image
engine-image-ei-605a0f3e-dnrhd             0/1     Running   0          7m6s
engine-image-ei-605a0f3e-l5vfx             0/1     Running   0          7m6s
engine-image-ei-605a0f3e-bcd67             1/1     Running   0          7m6s

ip-172-30-0-21:~ # kubectl -n longhorn-system describe pod/engine-image-ei-605a0f3e-dnrhd
Name:         engine-image-ei-605a0f3e-dnrhd
Namespace:    longhorn-system
Priority:     0
Node:         ip-172-30-0-21/172.30.0.21
Start Time:   Tue, 16 Mar 2021 03:39:41 +0000
Labels:       controller-revision-hash=845db86794
              longhorn.io/component=engine-image
              longhorn.io/engine-image=ei-605a0f3e
              pod-template-generation=1
Annotations:  <none>
Status:       Running
IP:           10.42.0.45
IPs:
  IP:           10.42.0.45
Controlled By:  DaemonSet/engine-image-ei-605a0f3e
Containers:
  engine-image-ei-605a0f3e:
    Container ID:  containerd://63b10894e9877f8c516ee89a620a009558d528e2f074c7400196c32aa0436c5f
    Image:         longhornio/longhorn-engine:master
    Image ID:      docker.io/longhornio/longhorn-engine@sha256:128676e30bd6436772ac358d3e3d52ef1a43035bd9ff1186fc681974ea105ed0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      diff /usr/local/bin/longhorn /data/longhorn > /dev/null 2>&1; if [ $? -ne 0 ]; then cp -p /usr/local/bin/longhorn /data/ && echo installed; fi && trap 'rm /data/longhorn* && echo cleaned up' EXIT && sleep infinity
    State:          Running
      Started:      Tue, 16 Mar 2021 03:39:42 +0000
    Ready:          False
    Restart Count:  0
    Readiness:      exec [sh -c ls /data/longhorn && /data/longhorn version --client-only] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data/ from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from longhorn-service-account-token-xxhfk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master
    HostPathType:  
  longhorn-service-account-token-xxhfk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  longhorn-service-account-token-xxhfk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m17s                  default-scheduler  Successfully assigned longhorn-system/engine-image-ei-605a0f3e-dnrhd to ip-172-30-0-21
  Normal   Pulled     7m17s                  kubelet            Container image "longhornio/longhorn-engine:master" already present on machine
  Normal   Created    7m16s                  kubelet            Created container engine-image-ei-605a0f3e
  Normal   Started    7m16s                  kubelet            Started container engine-image-ei-605a0f3e
  Warning  Unhealthy  2m14s (x60 over 7m9s)  kubelet            Readiness probe failed: ls: cannot access '/data/longhorn': No such file or directory

longhorn-support-bundle_c34b0423-f841-4052-8bc2-80fe3d53afc5_2021-03-16T04-03-16Z.zip

Environment:

Longhorn version: master-head
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s/1.20.4
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 2
Node config
- OS type and version: sles15-sp2
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster:

Additional context
This is also causing regression to fail. The result varies.
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/101/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/102/testReport/tests/

The text was updated successfully, but these errors were encountered:

innobead · 2021-03-29T05:50:28Z

In the current implementation, the engine image container is just like a driver/bin installer, so it should be an atomic operation. But unfortunately, because it involves the file copy in the mounted hostpath volume, so it's possible to encounter some race issue before and after the operation.

I doubt this only happens in the tests. Let's figure it out in the test regression.

cc @khushboo-rancher @meldafrawi

c3y1huang · 2021-03-30T06:24:28Z

This one I was able to reproduce manually. But when it gets stuck, restart again would mostly fix it.

khushboo-rancher · 2021-03-30T08:30:27Z

#2113 could be similar one.

c3y1huang · 2021-04-21T06:30:35Z

Cannot reproduce manually on the current master(067763d). Close.

c3y1huang added kind/bug kind/test Request for adding test labels Mar 16, 2021

c3y1huang added this to the v1.1.1 milestone Mar 16, 2021

c3y1huang mentioned this issue Mar 16, 2021

Fix test_engine_image* longhorn/longhorn-tests#561

Merged

innobead assigned c3y1huang Mar 19, 2021

c3y1huang added reproduce/rare < 50% reproducible reproduce/often 80 - 50% reproducible and removed reproduce/rare < 50% reproducible labels Mar 19, 2021

innobead modified the milestones: v1.1.1, v1.1.2 Mar 29, 2021

innobead unassigned c3y1huang Mar 29, 2021

innobead added the severity/4 Function working but has a minor issue (a minor incident with low impact) label Mar 30, 2021

c3y1huang closed this as completed Apr 21, 2021

innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021

innobead added the wontfix label Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cannot restart engine image daemonset sometimes #2351

[BUG] cannot restart engine image daemonset sometimes #2351

c3y1huang commented Mar 16, 2021 •

edited

innobead commented Mar 29, 2021

c3y1huang commented Mar 30, 2021

khushboo-rancher commented Mar 30, 2021

c3y1huang commented Apr 21, 2021 •

edited

[BUG] cannot restart engine image daemonset sometimes #2351

[BUG] cannot restart engine image daemonset sometimes #2351

Comments

c3y1huang commented Mar 16, 2021 • edited

innobead commented Mar 29, 2021

c3y1huang commented Mar 30, 2021

khushboo-rancher commented Mar 30, 2021

c3y1huang commented Apr 21, 2021 • edited

c3y1huang commented Mar 16, 2021 •

edited

c3y1huang commented Apr 21, 2021 •

edited