Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hwameistor schedule error when a bunch of pods are creating #1425

Open
AmazingPangWei opened this issue Mar 25, 2024 · 2 comments
Open

hwameistor schedule error when a bunch of pods are creating #1425

AmazingPangWei opened this issue Mar 25, 2024 · 2 comments

Comments

@AmazingPangWei
Copy link
Contributor

There are 3 nodes in my k8s cluster with v0.14.1 Hwameistor like this:

root@pw-k8s01:~# kubectl get node
NAME       STATUS   ROLES                                    AGE   VERSION
pw-k8s01   Ready    control-plane,controlplane,etcd,master   58d   v1.28.3+rke2r2
pw-k8s02   Ready    control-plane,controlplane,etcd,master   56d   v1.28.3+rke2r2
pw-k8s03   Ready    control-plane,controlplane,etcd,master   56d   v1.28.3+rke2r2

Every node has a capacity of 20G for LVM. Then I apply a test yaml file(There are 4 pods and 4 pvcs whose requests storage is 6G.):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc1
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc2
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc3
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pw-pvc4
spec:
  volumeMode: Block
  storageClassName: hwameistor-storage-lvm-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-1
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc1
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-2
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc2
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-3
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc3
---
apiVersion: v1
kind: Pod
metadata:
  name: pw-pod-4
spec:
  containers:
    - name: busybox
      image: busybox:1.31.1
      command:
        - sleep
        - "360000000"
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - name: temp-pvc
          devicePath: /dev/temp-disk
  terminationGracePeriodSeconds: 0
  volumes:
    - name: temp-pvc
      persistentVolumeClaim:
        claimName: pw-pvc4
root@pw-k8s01:~/pangwei/yaml# kubectl apply -f local-pvc-test.yaml
persistentvolumeclaim/pw-pvc1 created
persistentvolumeclaim/pw-pvc2 created
persistentvolumeclaim/pw-pvc3 created
persistentvolumeclaim/pw-pvc4 created
pod/pw-pod-1 created
pod/pw-pod-2 created
pod/pw-pod-3 created
pod/pw-pod-4 created

You can see:

root@pw-k8s01:~/pangwei/yaml# kubectl get pod -o wide
NAME       READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES
pw-pod-1   1/1     Running   0          6h15m   100.65.76.184   pw-k8s03   <none>           <none>
pw-pod-2   1/1     Running   0          6h15m   100.65.76.185   pw-k8s03   <none>           <none>
pw-pod-3   1/1     Running   0          6h15m   100.65.76.183   pw-k8s03   <none>           <none>
pw-pod-4   0/1     Pending   0          6h15m   <none>          <none>     <none>           <none>

You can find error log:

time="2024-03-19T07:03:49Z" level=debug msg="Filtered out the node" error="can't schedule the LVM volume to node pw-k8s03" node=pw-k8s03 pod=pw-pod-4
I0319 07:03:49.095941       1 scheduler.go:351] "Unable to schedule pod; no fit; waiting" pod="default/pw-pod-4" err="0/3 nodes are available: 1 can't schedule the LVM volume to node pw-k8s03, 2 node(s) didn't find available persistent volumes to bind. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling."

Pvc pw-pvc4 is like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"pw-pvc4","namespace":"default"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"6Gi"}},"storageClassName":"hwameistor-storage-lvm-hdd","volumeMode":"Block"}}
    volume.beta.kubernetes.io/storage-provisioner: lvm.hwameistor.io
    # As you can see, pvc has been scheduled to pw-k8s03
    volume.kubernetes.io/selected-node: pw-k8s03
    volume.kubernetes.io/storage-provisioner: lvm.hwameistor.io
  creationTimestamp: "2024-03-19T03:22:17Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: pw-pvc4
  namespace: default
  resourceVersion: "56041423"
  uid: eea7fed7-fff1-4304-b4ce-fa1b06e4c942
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
  storageClassName: hwameistor-storage-lvm-hdd
  volumeMode: Block
status:
  phase: Pending

Pod pw-pod-4 and Pvc pw-pvc4 keep pending even though there are enough capacity in node1/node2.

@AmazingPangWei
Copy link
Contributor Author

I've learned hwameistor scheduler source code. In my opinion, there are some problems in hwameistor scheduler:

  1. Lack of reservation mechanism. Currently, there is a window period between LV being scheduled to a certain node and actual creation and resource recording to lsn. Therefore, there is a lag in resource recording. During the window period, if a bunch of pods need to be created, it can lead to: 1. Nodes with insufficient resources can pass Filter function in scheduler 2. There is no obvious difference between node scores. The above reasons finally make pods be scheduled to a same node. After scheduling to a node, the creation may fail due to insufficient resources.
  2. Lack of reschedule mechanism. After LV is scheduled to a certain node, insufficient resources lead to LV creation failure. The CSI interface(hwameistor) should be implemented correctly to enable pvc reschedule.

@AmazingPangWei
Copy link
Contributor Author

This issue looks very similar to #1424.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant