Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Disks occasionally mounted in a way leading to I/O errors #71453

Closed
antoineco opened this issue Nov 27, 2018 · 3 comments · Fixed by #71495
Closed

Azure Disks occasionally mounted in a way leading to I/O errors #71453

antoineco opened this issue Nov 27, 2018 · 3 comments · Fixed by #71495
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@antoineco
Copy link
Contributor

antoineco commented Nov 27, 2018

What happened:

Azure Disks get randomly mounted in a way that makes them unusable, returning Input/output error for every disk I/O operation. Affected Pods remain in CrashLoopBack until someone manually recreates them.

What you expected to happen:

Azure Disks are consistently usable by Pods inside which they are mounted.

How to reproduce it:

Create the following StatefulSet. Adjust the replicas and delete the associated Pods (kubectl delete pod -l app=azure-disk-failure) until the problem occurs (i.e. some Pod ends up in CrashLoopBackOff).

sts.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: azure-disk-failure
spec:
  ports:
  - port: 65535
    targetPort: 65535
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: azure-disk-failure
spec:
  podManagementPolicy: Parallel
  serviceName: azure-disk-failure
  replicas: 3
  selector:
    matchLabels:
      app: azure-disk-failure
  template:
    metadata:
      labels:
        app: azure-disk-failure
    spec:
      containers:
      - image: busybox
        name: touch
        command:
        - sh
        - -c
        args:
        - "touch /vol/$(date '+%Y%m%d%H%M%S') && sleep 9999999999"
        volumeMounts:
        - mountPath: /vol
          name: azure-disk
      terminationGracePeriodSeconds: 1
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: azure-disk
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi

Default StorageClass (as created by ACS-Engine):

sc.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
  labels:
    kubernetes.io/cluster-service: "true"
  name: default
parameters:
  cachingmode: None
  kind: Managed
  storageaccounttype: Standard_LRS
provisioner: kubernetes.io/azure-disk
reclaimPolicy: Delete
volumeBindingMode: Immediate

Anything else we need to know?:

Nodes were updated from 1.10.4 to 1.10.10, then from 1.10.10 to 0180b22 without a reboot.

According to this document written by @andyzhangx the issue is fixed, but I can still observe it quite often.

Ref Azure/acs-engine#1918

Environment:

  • Kubernetes version: v1.10.10 + 0180b22
  • Cloud provider or hardware configuration: Azure (westeurope)
  • OS: Ubuntu 16.04.4 LTS
  • Kernel: 4.13.0-1018-azure
  • Install tools: ACS-Engine
  • Others:

/kind bug
/sig storage
/sig azure

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/azure labels Nov 27, 2018
@antoineco
Copy link
Contributor Author

antoineco commented Nov 27, 2018

In my current test environment:

$ kubectl describe pod azure-disk-failure-0 (pod uid is 8296eb99-f23b-11e8-bf38-000d3a2f91ba)

Events:
  Type     Reason                  Age                 From                                Message
  ----     ------                  ----                ----                                -------
  Normal   Scheduled               50m                 default-scheduler                   Successfully assigned azure-disk-failure-0 to k8s-my-cluster-vmss00000x
  Normal   SuccessfulMountVolume   49m                 kubelet, k8s-my-cluster-vmss00000x  MountVolume.SetUp succeeded for volume "pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba"
  Normal   Pulling                 48m (x4 over 49m)   kubelet, k8s-my-cluster-vmss00000x  pulling image "busybox:latest"
  Normal   Pulled                  48m (x4 over 49m)   kubelet, k8s-my-cluster-vmss00000x  Successfully pulled image "busybox:latest"
  Normal   Created                 48m (x4 over 49m)   kubelet, k8s-my-cluster-vmss00000x  Created container
  Normal   Started                 48m (x4 over 49m)   kubelet, k8s-my-cluster-vmss00000x  Started container
  Normal   SuccessfulAttachVolume  48m (x2 over 49m)   attachdetach-controller             AttachVolume.Attach succeeded for volume "pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba"
  Warning  BackOff                 4m (x210 over 49m)  kubelet, k8s-my-cluster-vmss00000x  Back-off restarting failed container

$ kubectl get node k8s-my-cluster-vmss00000x

  volumesAttached:
  - devicePath: "1"
    name: kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d773c5ab-cc71-11e8-99e7-000d3a2f91ba
  - devicePath: "0"
    name: kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba
  - devicePath: "2"
    name: kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-be97bfbf-cc6e-11e8-99e7-000d3a2f91ba
  volumesInUse:
  - kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-be97bfbf-cc6e-11e8-99e7-000d3a2f91ba
  - kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba
  - kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d773c5ab-cc71-11e8-99e7-000d3a2f91ba

$ tree /dev/disk/azure

/dev/disk/azure
|-- resource -> ../../sdb
|-- resource-part1 -> ../../sdb1
|-- root -> ../../sda
|-- root-part1 -> ../../sda1
`-- scsi1
    |-- lun0 -> ../../../sdf
    |-- lun1 -> ../../../sdd
    `-- lun2 -> ../../../sde

$ mount -l on the node

/dev/sdc on /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619 type ext4 (rw,relatime,data=ordered)
/dev/sdc on /var/lib/kubelet/pods/8296eb99-f23b-11e8-bf38-000d3a2f91ba/volumes/kubernetes.io~azure-disk/pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba type ext4 (rw,relatime,data=ordered)
/dev/sdc on /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619 type ext4 (rw,relatime,data=ordered)
/dev/sdc on /var/lib/kubelet/pods/8296eb99-f23b-11e8-bf38-000d3a2f91ba/volumes/kubernetes.io~azure-disk/pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba type ext4 (rw,relatime,data=ordered)

Notice the mismatch between the disk referenced by lun0 (/dev/sdf) and the actual mount (/dev/sdc).

$ ls -l /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619 on the node

ls: reading directory '/var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619': Input/output error
total 0

@antoineco
Copy link
Contributor Author

antoineco commented Nov 27, 2018

After deleting the failing Pod and waiting for its successful recreation:

$ kubectl describe pod azure-disk-failure-0 (pod uid is now 38b167b3-f247-11e8-bf38-000d3a2f91ba)

Events:
  Type     Reason                  Age              From                                Message
  ----     ------                  ----             ----                                -------
  Normal   Scheduled               2m               default-scheduler                   Successfully assigned azure-disk-failure-0 to k8s-my-cluster-vmss00001a
  Normal   SuccessfulAttachVolume  1m (x2 over 2m)  attachdetach-controller             AttachVolume.Attach succeeded for volume "pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba"
  Normal   SuccessfulMountVolume   29s              kubelet, k8s-my-cluster-vmss00001a  MountVolume.SetUp succeeded for volume "pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba"
  Normal   Pulling                 28s              kubelet, k8s-my-cluster-vmss00001a  pulling image "busybox:default"
  Normal   Pulled                  28s              kubelet, k8s-my-cluster-vmss00001a  Successfully pulled image "busybox:default"
  Normal   Created                 28s              kubelet, k8s-my-cluster-vmss00001a  Created container
  Normal   Started                 27s              kubelet, k8s-my-cluster-vmss00001a  Started container

$ kubectl get node k8s-my-cluster-vmss00001a

  volumesAttached:
  - devicePath: "0"
    name: kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d7621669-cc71-11e8-99e7-000d3a2f91ba
  - devicePath: "1"
    name: kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba
  volumesInUse:
  - kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d7621669-cc71-11e8-99e7-000d3a2f91ba
  - kubernetes.io/azure-disk//subscriptions/<redacted>/resourceGroups/my-group/providers/Microsoft.Compute/disks/k8s-mstr-my-group-kub-pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba

$ tree /dev/disk/azure

/dev/disk/azure
|-- resource -> ../../sdb
|-- resource-part1 -> ../../sdb1
|-- root -> ../../sda
|-- root-part1 -> ../../sda1
`-- scsi1
    |-- lun0 -> ../../../sdc
    `-- lun1 -> ../../../sdd

$ mount -l on the node

/dev/sdd on /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619 type ext4 (rw,relatime,data=ordered)
/dev/sdd on /var/lib/kubelet/pods/38b167b3-f247-11e8-bf38-000d3a2f91ba/volumes/kubernetes.io~azure-disk/pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba type ext4 (rw,relatime,data=ordered)
/dev/sdd on /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619 type ext4 (rw,relatime,data=ordered)
/dev/sdd on /var/lib/kubelet/pods/38b167b3-f247-11e8-bf38-000d3a2f91ba/volumes/kubernetes.io~azure-disk/pvc-d76c49ae-cc71-11e8-99e7-000d3a2f91ba type ext4 (rw,relatime,data=ordered)

This time disks are matching (/dev/sdd).

$ ls -l /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1001924619 on the node

<my data>

@andyzhangx
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants