Skip to content

Commit

Permalink
NVIDIA GPU Operator on y-k8s
Browse files Browse the repository at this point in the history
  • Loading branch information
huataihuang committed Jul 25, 2023
1 parent 7507d3b commit cf81894
Show file tree
Hide file tree
Showing 9 changed files with 77 additions and 86 deletions.
2 changes: 1 addition & 1 deletion source/kubernetes/deploy/kubespray/kubespray_startup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Kubespray快速起步
- ``calico_rr`` : 面向 :ref:`kubespray_calico`
- ``bastion`` : 如果服务器不能直接访问(隔离网络),则需要指定堡垒机(bastion)

- 首先复制出需要修订的集群配置集,这里假设集群名为 ``y-k8s`` :
- 首先复制出需要修订的集群配置集,这里群名为 :ref:`y-k8s` :

.. literalinclude:: kubespray_startup/cp_y-k8s
:language: bash
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
NAME: gpu-operator-1690303523
LAST DEPLOYED: Wed Jul 26 00:45:31 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
119 changes: 34 additions & 85 deletions source/kubernetes/gpu/install_nvidia_gpu_operator_y-k8s.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ y-k8s安装NVIDIA GPU Operator

之前实践过 :ref:`install_nvidia_gpu_operator` 是在 :ref:`priv_cloud_infra` 的 ``z-k8s`` 集群,当时还没有搞 :ref:`vgpu` ,所以是将完整的 :ref:`tesla_p10` 直接通过 :ref:`ovmf_gpu_nvme` 实现。

为了能够更好迷你大规模 :ref:`gpu_k8s` ,我重新部署 :ref:`y-k8s` 来实现多 :ref:`vgpu` 的 :ref:`machine_learning`
为了能够更好迷你大规模 :ref:`gpu_k8s` ,我重新部署 :ref:`y-k8s` (集群部署采用 :ref:`kubespray_startup` )来实现多 :ref:`vgpu` 的 :ref:`machine_learning`

本文为 :ref:`install_nvidia_gpu_operator` 再次实践

Expand All @@ -17,13 +17,19 @@ y-k8s安装NVIDIA GPU Operator

在安装NVIDIA GPU Operator之前,去需要确保 Kubernetes 集群 ( :ref:`z-k8s` )满足以下条件:

- Kubernetes工作节点已经配置好容器引擎如 :ref:`docker` CE/EE, :ref:`cri-o` 或 :ref:`containerd` ( :ref:`install_nvidia_container_toolkit_for_containerd` )
- Kubernetes工作节点已经配置好容器引擎如 :ref:`docker` CE/EE, :ref:`cri-o` 或 :ref:`containerd` ( 注意 ``NVIDIA GPU Operator`` 会自动完成节点的 :ref:`nvidia_container_runtimes` 配置,所以不需要手工 :ref:`install_nvidia_container_toolkit_for_containerd` ,只需要标准安装的 :ref:`container_runtimes` ): 通过 :ref:`kubespray_startup` 部署的集群默认采用了 :ref:`containerd`
- 每个节点都需要部署Node Feature Discovery (NFD) : 默认情况下 NVIDIA GPU Operator会自动部署NFD master 和 worker
- 在Kubernetes 1.13 和 1.14,需要激活 ``kubelet`` 的 ``KubeletPodResources`` 功能,从 Kubernetes 1.15以后是默认激活的

此外还需要确认:

- 每个hypervisor主机 :ref:`vgpu` 加速Kubernetes worker节点虚拟机必须先完成安装 NVIDIA vGPU Host Driver version 12.0 (or later)
- 每个hypervisor主机 :ref:`vgpu` 加速Kubernetes worker节点虚拟机必须先完成安装 NVIDIA vGPU Host Driver version 12.0 (or later) :

- :ref:`install_vgpu_license_server`
- :ref:`install_vgpu_manager`
- :ref:`install_vgpu_guest_driver`
- :ref:`vgpu_unlock`

- 需要安装NVIDIA vGPU License Server服务于所有Kubernetes虚拟机节点
- 部署好私有仓库以便能够上传NVIDIA vGPU specific driver container image
- 每个Kubernetes worker节点能够访问私有仓库
Expand All @@ -47,6 +53,10 @@ y-k8s安装NVIDIA GPU Operator
Operands(操作数)
-------------------

.. note::

这步跳过,目前我的 ``y-k8s`` 集群的2个worker都已经部署了 :ref:`vgpu` ,而且也没有不部署的必须跳过的节点( ``NVIDIA GPU Operator`` 会自动部署有GPU的节点 )

默认情况下,GPU Operator operands会部署到集群中所有GPU工作节点。GPU工作节点的标签由 ``feature.node.kubernetes.io/pci-10de.present=true`` 标记,这里的 ``0x10de`` 是PCI vender ID,是分配给 NVIDIA 的供应商ID

- 首先给集群中安装了NVIDIA GPU的节点打上标签:
Expand All @@ -72,6 +82,12 @@ Operands(操作数)
:language: bash
:caption: Ubuntu上Barmetal/Passthrough默认配置,helm 安装GNU Operator

安装输出信息:

.. literalinclude:: install_nvidia_gpu_operator/helm_install_gnu-operator_baremetal_passthrough_output
:language: bash
:caption: Ubuntu上Barmetal/Passthrough默认配置,helm 安装GNU Operator,输出信息

- 完成后见检查:

.. literalinclude:: install_nvidia_gpu_operator/get_gnu-operator_pods
Expand All @@ -80,7 +96,7 @@ Operands(操作数)

可以看到运行了如下pods:

.. literalinclude:: install_nvidia_gpu_operator/get_gnu-operator_pods_output
.. literalinclude:: install_nvidia_gpu_operator_y-k8s/get_gnu-operator_pods_output
:language: bash
:caption: 安装完GNU Operator之后,检查集群中nvidia gnu-operators相关pods

Expand All @@ -100,99 +116,32 @@ CUDA VectorAdd

pod/cuda-vectoradd created

这里我遇到启动问题(容器)

排查
~~~~~~

- 检查pod状态::

kubectl get pods -o wide

可以看到没有就绪(NotReady)::

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cuda-vectoradd 1/2 NotReady 0 9h 10.0.3.168 z-k8s-n-1 <none> <none>

- 检查pods启动失败原因::

kubectl describe pods cuda-vectoradd

输出显示:

.. literalinclude:: install_nvidia_gpu_operator/simple_cuda_sample_1_fail
:language: bash
:caption: 运行一个简单的CUDA示例失败原因排查(实际正常,见下文)
:emphasize-lines: 17

.. note::
:ref:`install_nvidia_gpu_operator` 初次实践时遇到过问题和排查(见原文)

汗,原来这是正常的,这个NVIDIA CUDA的案例就是运算完成后自动退出,所以服务状态就是不可访问的(这个是一次性运行)。其实只要查看pod日志就可以验证CUDA是否工作正常(见下文)
- 检查:

- 通过检查日志来了解计算结果::
.. literalinclude:: install_nvidia_gpu_operator_y-k8s/get_pods
:caption: 检查 CUDA示例 pods

kubectl logs cuda-vectoradd
状态输出:

显示如下,表明NVIDIA GPU Operator安装正常::
.. literalinclude:: install_nvidia_gpu_operator_y-k8s/get_pods_output
:caption: 检查 CUDA示例 pods 输出信息

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

GPU节点调度异常排查
=====================
- 通过检查日志来了解计算结果:

- 发现重启了一次 ``z-k8s-n-1`` 之后,出现部分 ``gpu-operator`` 容器没有正常运行:
.. literalinclude:: install_nvidia_gpu_operator_y-k8s/logs_pods
:caption: 通过 ``kubectl logs`` 获取pods的日志来判断计算结果

.. literalinclude:: install_nvidia_gpu_operator/gpu-operator_pod_fail
:language: bash
:caption: 重启z-k8s-n-1之后 nvidia-device-plugin-validator pod没有启动
:emphasize-lines: 15,16
显示如下,表明NVIDIA GPU Operator安装正常:

- 通过 ``kubectl describe pods`` 命令检查没有启动原因:

.. literalinclude:: install_nvidia_gpu_operator/describe_pods_gpu-operator_pod_fail
:language: bash
:caption: 通过kubectl describe pods检查nvidia-operator-validator-rbmwr没有启动原因
:emphasize-lines: 7,8

为何会出现 ``failed to get sandbox runtime: no runtime for "nvidia" is configured``
.. literalinclude:: install_nvidia_gpu_operator_y-k8s/logs_pods_output
:caption: 通过 ``kubectl logs`` 获取pods的日志来判断计算结果

.. note::

检查 ``z-k8s-n-1`` ,发现原来是 ``NVIDIA GPU Operator`` 安装会修改 ``/etc/containerd/config.toml`` ,把我上文通过 ``containerd-config.path`` 修订的配置给冲掉了。看起来NVIDIA官方文档中对containerd配置进行patch的方法现在应该不需要了,通过安装 ``nvidia-container-toolkit`` 就能够自动修正配置。

这个猜测以后有机会再验证

``nvidia-device-plugin-validator`` 没有启动的原因是检查到设备为0导致无法运行:

.. literalinclude:: install_nvidia_gpu_operator/describe_pods_gpu-operator_pod_fail_1
:language: bash
:caption: 通过kubectl describe pods检查nvidia-device-plugin-validator没有启动是因为设备没有检测到
:emphasize-lines: 6

为什么呢?

参考了 `pod的状态出现UnexpectedAdmissionError是什么鬼? <https://izsk.me/2022/01/27/Kubernetes-pod-status-is-UnexpectedAdmissionError/>`_ 有所启发:

- 因为是 :ref:`ovmf_gpu_nvme` 而不是 :ref:`vgpu` ,实际上虚拟机中只有一块GPU卡
- 对于passthrough的GPU卡,实际上只能分配个一个pod容器,一旦分配,第二个需要使用GPU的pods就无法调度到这个节点上
- ``NVIDIA GPU Operator`` 的一组pod中的 ``nvidia-device-plugin-validator`` 是一个特殊的pod,你会看到每次启动之后,这个pods都会进入 ``Completed`` 状态

- ``nvidia-device-plugin-validator`` 只在节点启动时运行一次,功能就是检查验证工作节点是否有NVIDIA的GPU设备,一旦检查通过就会结束自己这个pod运行
- 另外一个类似的验证功能的pods是 ``nvidia-cuda-validator`` 也是启动时运行一次检验

- 我之前为了验证 ``NVIDIA GPU Operator`` 运行了一个 ``简单的CUDA示例,两个向量相加`` ,这个pods运行完成后不会自动删除,一直保持在 ``NotReady`` 状态。问题就在这里,这个pod占用了GPU设备,导致后续的Pods,例如 :ref:`stable_diffusion_on_k8s` 无法调度
- 我重启了GPU工作节点,上述 ``简单的CUDA示例,两个向量相加`` pod没有删除,在启动时会抢占GPU设备,这也就导致了 ``nvidia-device-plugin-validator`` 无法拿到GPU设备进行检测,无法通过验证

**解决方法**

- 删除掉不使用的 ``简单的CUDA示例,两个向量相加`` pod,然后就会看到 ``nvidia-device-plugin-validator`` 正常运行完成,进入了 ``Completed`` 状态
- 这也就解决了 :ref:`stable_diffusion_on_k8s` 无法调度的问题

计算结果后pod部署不删除会导致GPU始终被占用,就无法继续调度新的GPU计算任务,此时需要删除 ``Completed`` 状态的pod来释放GPU资源。详见 :ref:`gpu_node_schedule_err_debug`

下一步
===========
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-bksgp 1/1 Running 0 3m52s 10.233.89.132 y-k8s-n-1 <none> <none>
gpu-feature-discovery-wstlz 1/1 Running 0 3m52s 10.233.78.71 y-k8s-n-2 <none> <none>
gpu-operator-1690303523-node-feature-discovery-master-6f5b7rdpm 1/1 Running 0 7m46s 10.233.109.74 y-k8s-m-1 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-bblkw 1/1 Running 0 7m46s 10.233.109.75 y-k8s-m-1 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-dw5gs 1/1 Running 0 7m47s 10.233.93.213 y-k8s-m-3 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-l7hbw 1/1 Running 0 7m47s 10.233.89.129 y-k8s-n-1 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-nc2dh 1/1 Running 0 7m46s 10.233.78.67 y-k8s-n-2 <none> <none>
gpu-operator-1690303523-node-feature-discovery-worker-sb5hn 1/1 Running 0 7m47s 10.233.121.11 y-k8s-m-2 <none> <none>
gpu-operator-56849f4cc-82vqm 1/1 Running 0 7m46s 10.233.78.66 y-k8s-n-2 <none> <none>
nvidia-container-toolkit-daemonset-g7sf9 1/1 Running 0 3m55s 10.233.89.131 y-k8s-n-1 <none> <none>
nvidia-container-toolkit-daemonset-tgjqk 1/1 Running 0 3m55s 10.233.78.69 y-k8s-n-2 <none> <none>
nvidia-cuda-validator-45ngk 0/1 Completed 0 2m35s 10.233.78.73 y-k8s-n-2 <none> <none>
nvidia-cuda-validator-wjqvw 0/1 Completed 0 2m22s 10.233.89.136 y-k8s-n-1 <none> <none>
nvidia-dcgm-exporter-85qt9 1/1 Running 0 3m53s 10.233.89.133 y-k8s-n-1 <none> <none>
nvidia-dcgm-exporter-sgnkt 1/1 Running 0 3m53s 10.233.78.72 y-k8s-n-2 <none> <none>
nvidia-device-plugin-daemonset-2p6hw 1/1 Running 0 3m54s 10.233.78.74 y-k8s-n-2 <none> <none>
nvidia-device-plugin-daemonset-bccw8 1/1 Running 0 3m54s 10.233.89.134 y-k8s-n-1 <none> <none>
nvidia-device-plugin-validator-c4lcq 0/1 Completed 0 70s 10.233.78.75 y-k8s-n-2 <none> <none>
nvidia-device-plugin-validator-h8v2q 0/1 Completed 0 66s 10.233.89.137 y-k8s-n-1 <none> <none>
nvidia-operator-validator-mrhp9 1/1 Running 0 3m55s 10.233.89.135 y-k8s-n-1 <none> <none>
nvidia-operator-validator-vkgx5 1/1 Running 0 3m54s 10.233.78.70 y-k8s-n-2 <none> <none>
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
kubectl get pods -o wide
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cuda-vectoradd 0/1 Completed 0 24s 10.233.89.138 y-k8s-n-1 <none> <none>
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
kubectl logs cuda-vectoradd
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
4 changes: 4 additions & 0 deletions source/real/priv_cloud/y-k8s_nvidia_gpu_operator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,7 @@
=============================================================
y-k8s集群通过 :ref:`nvidia_gpu_operator` 部署 :ref:`gpu_k8s`
=============================================================

.. note::

通过 :ref:`install_nvidia_gpu_operator_y-k8s`

0 comments on commit cf81894

Please sign in to comment.