hwameistor schedule error when a bunch of pods are creating #1425

AmazingPangWei · 2024-03-25T03:09:09Z

There are 3 nodes in my k8s cluster with v0.14.1 Hwameistor like this:

root@pw-k8s01:~# kubectl get node
NAME       STATUS   ROLES                                    AGE   VERSION
pw-k8s01   Ready    control-plane,controlplane,etcd,master   58d   v1.28.3+rke2r2
pw-k8s02   Ready    control-plane,controlplane,etcd,master   56d   v1.28.3+rke2r2
pw-k8s03   Ready    control-plane,controlplane,etcd,master   56d   v1.28.3+rke2r2

Every node has a capacity of 20G for LVM. Then I apply a test yaml file(There are 4 pods and 4 pvcs whose requests storage is 6G.):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc1
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc2
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc3
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc4
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-1
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc1
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-2
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc2
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-3
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc3
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-4
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc4

root@pw-k8s01:~/pangwei/yaml# kubectl apply -f local-pvc-test.yaml
persistentvolumeclaim/pw-pvc1 created
persistentvolumeclaim/pw-pvc2 created
persistentvolumeclaim/pw-pvc3 created
persistentvolumeclaim/pw-pvc4 created
pod/pw-pod-1 created
pod/pw-pod-2 created
pod/pw-pod-3 created
pod/pw-pod-4 created

You can see:

root@pw-k8s01:~/pangwei/yaml# kubectl get pod -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES
pw-pod-1   1/1     Running   0          6h15m   100.65.76.184   pw-k8s03   <none>           <none>
pw-pod-2   1/1     Running   0          6h15m   100.65.76.185   pw-k8s03   <none>           <none>
pw-pod-3   1/1     Running   0          6h15m   100.65.76.183   pw-k8s03   <none>           <none>
pw-pod-4   0/1     Pending   0          6h15m   <none>          <none>     <none>           <none>

You can find error log:

time="2024-03-19T07:03:49Z" level=debug msg="Filtered out the node" error="can't schedule the LVM volume to node pw-k8s03" node=pw-k8s03 pod=pw-pod-4
I0319 07:03:49.095941       1 scheduler.go:351] "Unable to schedule pod; no fit; waiting" pod="default/pw-pod-4" err="0/3 nodes are available: 1 can't schedule the LVM volume to node pw-k8s03, 2 node(s) didn't find available persistent volumes to bind. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling."

Pvc pw-pvc4 is like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"pw-pvc4","namespace":"default"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"6Gi"}},"storageClassName":"hwameistor-storage-lvm-hdd","volumeMode":"Block"}}
    volume.beta.kubernetes.io/storage-provisioner: lvm.hwameistor.io
    # As you can see, pvc has been scheduled to pw-k8s03
    volume.kubernetes.io/selected-node: pw-k8s03
    volume.kubernetes.io/storage-provisioner: lvm.hwameistor.io
  creationTimestamp: "2024-03-19T03:22:17Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: pw-pvc4
  namespace: default
  resourceVersion: "56041423"
  uid: eea7fed7-fff1-4304-b4ce-fa1b06e4c942
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
  storageClassName: hwameistor-storage-lvm-hdd
  volumeMode: Block
status:
  phase: Pending

Pod pw-pod-4 and Pvc pw-pvc4 keep pending even though there are enough capacity in node1/node2.

The text was updated successfully, but these errors were encountered:

AmazingPangWei · 2024-03-25T03:42:54Z

I've learned hwameistor scheduler source code. In my opinion, there are some problems in hwameistor scheduler:

Lack of reservation mechanism. Currently, there is a window period between LV being scheduled to a certain node and actual creation and resource recording to lsn. Therefore, there is a lag in resource recording. During the window period, if a bunch of pods need to be created, it can lead to: 1. Nodes with insufficient resources can pass Filter function in scheduler 2. There is no obvious difference between node scores. The above reasons finally make pods be scheduled to a same node. After scheduling to a node, the creation may fail due to insufficient resources.
Lack of reschedule mechanism. After LV is scheduled to a certain node, insufficient resources lead to LV creation failure. The CSI interface(hwameistor) should be implemented correctly to enable pvc reschedule.

AmazingPangWei · 2024-03-25T03:44:26Z

This issue looks very similar to #1424.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hwameistor schedule error when a bunch of pods are creating #1425

hwameistor schedule error when a bunch of pods are creating #1425

AmazingPangWei commented Mar 25, 2024

AmazingPangWei commented Mar 25, 2024

AmazingPangWei commented Mar 25, 2024

hwameistor schedule error when a bunch of pods are creating #1425

hwameistor schedule error when a bunch of pods are creating #1425

Comments

AmazingPangWei commented Mar 25, 2024

AmazingPangWei commented Mar 25, 2024

AmazingPangWei commented Mar 25, 2024