Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cannot restart engine image daemonset sometimes #2351

Closed
c3y1huang opened this issue Mar 16, 2021 · 4 comments
Closed

[BUG] cannot restart engine image daemonset sometimes #2351

c3y1huang opened this issue Mar 16, 2021 · 4 comments
Labels
kind/bug kind/test Request for adding test reproduce/often 80 - 50% reproducible severity/4 Function working but has a minor issue (a minor incident with low impact) wontfix
Milestone

Comments

@c3y1huang
Copy link
Contributor

c3y1huang commented Mar 16, 2021

Describe the bug
While running e2e tests, noticed sometimes when deleting the engine image to trigger restart will fails to bring all engineimage pod to running.

To Reproduce
Steps to reproduce the behavior:

  1. Start with new longhorn cluster
  2. Delete engine-image daemonset.
kubectl -n longhorn-system delete ds/engine-image-ei-605a0f3e

Expected behavior
All engine image pods should restart and become running eventually.

Log

ip-172-30-0-21:~ # kubectl -n longhorn-system get ds | grep engine-image
engine-image-ei-605a0f3e   3         3         1       3            1           <none>          7m1s

ip-172-30-0-21:~ # kubectl -n longhorn-system get pod | grep engine-image
engine-image-ei-605a0f3e-dnrhd             0/1     Running   0          7m6s
engine-image-ei-605a0f3e-l5vfx             0/1     Running   0          7m6s
engine-image-ei-605a0f3e-bcd67             1/1     Running   0          7m6s

ip-172-30-0-21:~ # kubectl -n longhorn-system describe pod/engine-image-ei-605a0f3e-dnrhd
Name:         engine-image-ei-605a0f3e-dnrhd
Namespace:    longhorn-system
Priority:     0
Node:         ip-172-30-0-21/172.30.0.21
Start Time:   Tue, 16 Mar 2021 03:39:41 +0000
Labels:       controller-revision-hash=845db86794
              longhorn.io/component=engine-image
              longhorn.io/engine-image=ei-605a0f3e
              pod-template-generation=1
Annotations:  <none>
Status:       Running
IP:           10.42.0.45
IPs:
  IP:           10.42.0.45
Controlled By:  DaemonSet/engine-image-ei-605a0f3e
Containers:
  engine-image-ei-605a0f3e:
    Container ID:  containerd://63b10894e9877f8c516ee89a620a009558d528e2f074c7400196c32aa0436c5f
    Image:         longhornio/longhorn-engine:master
    Image ID:      docker.io/longhornio/longhorn-engine@sha256:128676e30bd6436772ac358d3e3d52ef1a43035bd9ff1186fc681974ea105ed0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      diff /usr/local/bin/longhorn /data/longhorn > /dev/null 2>&1; if [ $? -ne 0 ]; then cp -p /usr/local/bin/longhorn /data/ && echo installed; fi && trap 'rm /data/longhorn* && echo cleaned up' EXIT && sleep infinity
    State:          Running
      Started:      Tue, 16 Mar 2021 03:39:42 +0000
    Ready:          False
    Restart Count:  0
    Readiness:      exec [sh -c ls /data/longhorn && /data/longhorn version --client-only] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data/ from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from longhorn-service-account-token-xxhfk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master
    HostPathType:  
  longhorn-service-account-token-xxhfk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  longhorn-service-account-token-xxhfk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m17s                  default-scheduler  Successfully assigned longhorn-system/engine-image-ei-605a0f3e-dnrhd to ip-172-30-0-21
  Normal   Pulled     7m17s                  kubelet            Container image "longhornio/longhorn-engine:master" already present on machine
  Normal   Created    7m16s                  kubelet            Created container engine-image-ei-605a0f3e
  Normal   Started    7m16s                  kubelet            Started container engine-image-ei-605a0f3e
  Warning  Unhealthy  2m14s (x60 over 7m9s)  kubelet            Readiness probe failed: ls: cannot access '/data/longhorn': No such file or directory

longhorn-support-bundle_c34b0423-f841-4052-8bc2-80fe3d53afc5_2021-03-16T04-03-16Z.zip

Environment:

  • Longhorn version: master-head
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s/1.20.4
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 2
  • Node config
    • OS type and version: sles15-sp2
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:

Additional context
This is also causing regression to fail. The result varies.
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/101/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/102/testReport/tests/

@c3y1huang c3y1huang added kind/bug kind/test Request for adding test labels Mar 16, 2021
@c3y1huang c3y1huang added this to the v1.1.1 milestone Mar 16, 2021
@c3y1huang c3y1huang added reproduce/rare < 50% reproducible reproduce/often 80 - 50% reproducible and removed reproduce/rare < 50% reproducible labels Mar 19, 2021
@innobead innobead modified the milestones: v1.1.1, v1.1.2 Mar 29, 2021
@innobead
Copy link
Member

In the current implementation, the engine image container is just like a driver/bin installer, so it should be an atomic operation. But unfortunately, because it involves the file copy in the mounted hostpath volume, so it's possible to encounter some race issue before and after the operation.

I doubt this only happens in the tests. Let's figure it out in the test regression.

cc @khushboo-rancher @meldafrawi

@c3y1huang
Copy link
Contributor Author

This one I was able to reproduce manually. But when it gets stuck, restart again would mostly fix it.

@innobead innobead added the severity/4 Function working but has a minor issue (a minor incident with low impact) label Mar 30, 2021
@khushboo-rancher
Copy link
Contributor

#2113 could be similar one.

@c3y1huang
Copy link
Contributor Author

c3y1huang commented Apr 21, 2021

Cannot reproduce manually on the current master(067763d). Close.

@innobead innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/test Request for adding test reproduce/often 80 - 50% reproducible severity/4 Function working but has a minor issue (a minor incident with low impact) wontfix
Projects
None yet
Development

No branches or pull requests

3 participants