Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File exists when re-using mounts after scaling pods #240

Closed
faandg opened this issue Mar 16, 2021 · 5 comments
Closed

File exists when re-using mounts after scaling pods #240

faandg opened this issue Mar 16, 2021 · 5 comments

Comments

@faandg
Copy link

faandg commented Mar 16, 2021

What happened:
We have an issue that seems to happen when scaling pods down/up quickly in OpenShift (due to deployment change, for example changes to the liveness probe).
If the terminated pod was running on the same node that is used for the new pod, the node already has the mount point configured and seems to fail to re-use it:

2 minutes ago
Generated from kubelet on node.domain.com
25 times in the last 2 minutes
MountVolume.MountDevice failed for volume "pv-smb-awsfin-converter-uat" : kubernetes.io/csi: attacher.MountDevice failed to create dir "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-smb-awsfin-converter-uat/globalmount": mkdir /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv-smb-awsfin-converter-uat/globalmount: file exists

We can fix this by labeling that specific node as unschedulable and deleting the pod (mount is created on a new node) which works fine, but is a manual intervention we would like to prevent.

What you expected to happen:
scale down/up quickly not causing any issues with the re-use or re-initialization of mounts.

How to reproduce it:
Install the latest driver, edit a deployment (for example livenessprobe) causing pods to terminate and recreate (possibly on the same node). The behavior we are seeing is that most of the time, 1 out of 2 pods will experience this issue.

Anything else we need to know?:

Environment:
CSI Driver version:
latest

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2-0-g52c56ce", GitCommit:"3f6a83fb70bfe5c6ef0e0886923c90015a4256bc", GitTreeState:"clean", BuildDate:"2020-10-08T15:54:21Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.3+2fbd7c7", GitCommit:"2fbd7c7", GitTreeState:"clean", BuildDate:"2020-10-09T11:41:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
OS (e.g. from /etc/os-release):
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="45.82.202010091130-0"
VERSION_ID="4.5"
OPENSHIFT_VERSION="4.5"
RHEL_VERSION="8.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 45.82.202010091130-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.5"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.5"
OSTREE_VERSION='45.82.202010091130-0'
Kernel (e.g. uname -a):
Linux sainfr00071.domainx 4.18.0-193.24.1.el8_2.dt1.x86_64 #1 SMP Thu Sep 24 14:57:05 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Install by kubectl

@andyzhangx
Copy link
Member

what's your pv/pvc/deployment config? If you only have one PV, the NodeStageVolume would only happen once for the first pod running, the second pod would not trigger NodeStageVolume since that PV is already mounted on the node. Provide more details could help understand this issue, and I just used example here to scale up and down quickly, could not repro this issue.

@faandg
Copy link
Author

faandg commented Mar 16, 2021

PV:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-smb-awsfin-converter-uat
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
    - dir_mode=0770
    - file_mode=0660
    - vers=3.0
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: vh-pv-awsfin-converter-uat
    volumeAttributes:
      source: "//DOMAIN.COM/GLOBAL/APPS-A/ASFI"
    nodeStageSecretRef:
      name: secret-smb-awsfin-converter-uat
      namespace: system-development-uat

PVC:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-smb-awsfin-converter-uat
  namespace: system-development-uat
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  volumeName: pv-smb-awsfin-converter-uat
  storageClassName: ""

The deployment itself is managed by the OpenLiberty operator but just contains a simple reference to the PVC.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: awsfin-converter-uat
  namespace: system-development-uat
  ownerReferences:
    - apiVersion: openliberty.io/v1beta1
      kind: OpenLibertyApplication
      name: awsfin-converter-uat
      uid: ea147876-b264-453e-85ac-f1b1cb93e59e
      controller: true
      blockOwnerDeletion: true
  labels:
    app.kubernetes.io/component: backend
    app.kubernetes.io/instance: awsfin-converter-uat
    app.kubernetes.io/managed-by: open-liberty-operator
    app.kubernetes.io/name: awsfin-converter-uat
    app.kubernetes.io/part-of: awsfin-converter-uat
    kappnav.app.auto-create: 'true'
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/instance: awsfin-converter-uat
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: backend
        app.kubernetes.io/instance: awsfin-converter-uat
        app.kubernetes.io/managed-by: open-liberty-operator
        app.kubernetes.io/name: awsfin-converter-uat
        app.kubernetes.io/part-of: awsfin-converter-uat
    spec:
      restartPolicy: Always
      serviceAccountName: awsfin-converter-uat
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources: {}
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 9080
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 9080
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: WLP_LOGGING_CONSOLE_LOGLEVEL
              value: info
            - name: WLP_LOGGING_CONSOLE_SOURCE
              value: 'message,accessLog,ffdc,audit'
            - name: WLP_LOGGING_CONSOLE_FORMAT
              value: json
          ports:
            - name: http
              containerPort: 9080
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: volume-awsfin-converter-uat
              mountPath: /config/bootstrap.properties
              subPath: bootstrap.properties
            - name: volume-awsfin-converter-uat
              mountPath: /config/resources/log4j/log4j2.xml
              subPath: log4j2.xml
            - name: volume-awsfin-converter-uat
              mountPath: /config/includes/application-bnd.xml
              subPath: application-bnd.xml
            - name: volume-awsfin-converter-uat
              mountPath: /config/includes/ldap.xml
              subPath: ldap.xml
            - name: volume-awsfin-converter-uat
              mountPath: /config/resources/security/trust.p12
              subPath: trust.p12
            - name: volume-awsfin-converter-uat
              mountPath: /config/jvm.options
              subPath: jvm.options
            - name: volume-awsfin-converter-filemount-uat
              mountPath: /data/awsfin-converter
            - name: volume-awsfin-converter-uat-outputlog
              mountPath: /output/logs
          terminationMessagePolicy: File
          envFrom:
            - secretRef:
                name: secret-awsfin-converter-uat
          image: 'nexus.sbbaddelijn.be:5001/awsfin-converter:1.2.0'
        - name: awsfin-converter-uat-fluent-bit-sidecar
          image: 'fluent/fluent-bit:1.6'
          args:
            - /fluent-bit/bin/fluent-bit
            - '-c'
            - /fluentbit/etc/fluentbit.conf
          resources: {}
          volumeMounts:
            - name: volume-awsfin-converter-uat-outputlog
              mountPath: /output/logs
            - name: volume-configmap-awsfin-converter-fluent-bit-uat
              mountPath: /fluentbit/etc
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      serviceAccount: awsfin-converter-uat
      volumes:
        - name: volume-awsfin-converter-uat
          configMap:
            name: configmap-awsfin-converter-uat
            defaultMode: 420
        - name: volume-configmap-awsfin-converter-fluent-bit-uat
          configMap:
            name: configmap-awsfin-converter-fluent-bit-uat
            defaultMode: 420
        - name: volume-awsfin-converter-filemount-uat
          persistentVolumeClaim:
            claimName: pvc-smb-awsfin-converter-uat
        - name: volume-awsfin-converter-uat-outputlog
          emptyDir: {}
      dnsPolicy: ClusterFirst
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

@andyzhangx
Copy link
Member

could this issue repro by this example(https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/example/deployment.yaml) in your cluster? I tried, while it works well after a few tries scaling up/down quickly.

@faandg
Copy link
Author

faandg commented Mar 17, 2021

Hi @andyzhangx ,

It looks like the issue was caused by having this driver and the old deprecated driver installed at the same time.

We installed both drivers to make the transition easier as we have a lot of applications which had to be tested.
Since we are now done migrating to the new driver, we uninstalled the old driver and now we can no longer reproduce the issue above.

Issue can be closed for me, unless there are actions you want to take.

@andyzhangx
Copy link
Member

ok, thanks

andyzhangx added a commit to andyzhangx/csi-driver-smb that referenced this issue Nov 27, 2023
f8c8cc4c7 Merge pull request kubernetes-csi#237 from msau42/prow
b36b5bfdc Merge pull request kubernetes-csi#240 from dannawang0221/upgrade-go-version
adfddcc9a Merge pull request kubernetes-csi#243 from pohly/git-subtree-pull-fix
c4650889d pull-test.sh: avoid "git subtree pull" error
7b175a1e2 Update csi-test version to v5.2.0
987c90ccd Update go version to 1.21 to match k/k
2c625d41d Add script to generate patch release notes
f9d5b9c05 Merge pull request kubernetes-csi#236 from mowangdk/feature/bump_csi-driver-host-path_version
b01fd5372 Bump csi-driver-host-path version up to v1.12.0
984feece4 Merge pull request kubernetes-csi#234 from siddhikhapare/csi-tools
1f7e60599 fixed broken links of testgrid dashboard

git-subtree-dir: release-tools
git-subtree-split: f8c8cc4c7414c11526f14649856ff8e6b8a4e67c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants