Detect and delete GPU zombie pod in Kubernetes cluster.
In a GPU cluster, if a GPU is scheduled to a pod but has zero utilization continuously, the pod may be hanging or deadlocked, which harms the cluster utilization.
This repository provides a tool to automatically detect and delete these "GPU zombie pods" after a timeout period.
We scratch GPU metrics from heyfey/nvidia_smi_exporter
git clone https://github.com/heyfey/nvidia_smi_exporter.git
kubectl apply -f nvidia_smi_exporter/nvidia_smi_exporter.yaml
git clone https://github.com/heyfey/kill-gpu-zombie-pod.git
cd kill-gpu-zombie-pod
kubectl apply -f kill-gpu-zombie-pod.yaml
args:
-check_period_seconds float
Check for zombie every # seconds (default 10)
-idle_timeout_seconds float
Kill the pod after idle timeout of # seconds (default 90)
-namespace string
Detect and kill GPU zombie pod in the namespace (default "default")
You can specify args in the YAML
kubectl apply -f gpu-zombie-pod.yaml
docker build .