NVIDIA GPU support #2079

paulbaumgart · 2017-12-14T18:35:16Z

This would make it easier to use kubespray in conjuction with kubeflow.

I think this is what it would take:

Add two new boolean vars: accelerators_enabled and device_plugins_enabled to inventory/group_vars/k8s-cluster.yml
Expand the kube_feature_gates list var in main.yml to set Accelerators={{ accelerators_enabled|string }} and DevicePlugins={{ device_plugins_enabled|string }}
Create a new playbook called contrib/nvidia_gpu/nvidia_gpu_ubuntu.yml that does the following:
On every node:
a) Install a user-configurable nvidia-XYZ driver apt package. (Perhaps default to nvidia-384)
c) Install the nvidia-docker2 package following the instructions here: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
d) Update /etc/docker/daemon.json to set "default-runtime": "nvidia"
e) Run pkill -SIGHUP dockerd
f) Install the k8s-device-plugin daemonset via kubectl

The text was updated successfully, but these errors were encountered:

AtzeDeVries · 2018-01-12T12:43:13Z

I would like to work on this. I've tested a deployment with kubespray and (manually) added Nvidia GPU support. This functions well.
To make this work in an automated way using ansible/kubespray, there are a few things to consider.

Nvidia drivers are proprietary, and technicaly a licence should be accepted. Is there a prefered way to handle this in kubespray?
nvidia-docker relates to docker versions. The only out ot the box matching version i could find was docker-ce 1.13.
If you are running diffent GPU types, we should label the nodes correctly. Or is this out of scope of kubespray?

some notes
We best way to install the nvidia drivers on (ubuntu) linux is in this way:

download http://us.download.nvidia.com/XFree86/Linux-x86_64/384.111/NVIDIA-Linux-x86_64-384.111.run
make it executable
run it with the correct switches (to accept license and add make the install non interactive)

This will install the drivers without installing Xorg.

arielramraz · 2018-02-05T19:10:14Z

Hello,
2 small questions:

In step 2 - which main.yml there are many of them
do you have an example of nvidia_gpu_ubuntu.yml ?

Thank you in advance.

mlushpenko · 2018-02-11T19:57:41Z

@AtzeDeVries as I am playing with mining setups now, I can confirm that installing via the runfile is the best thing to do, I had too much pain with apt-get installations. Haven't had any problems with license or so, here is part of the code I am using (it's not very clean yet, but to give generic idea):

@arielramraz

# https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07#install-nvidia-graphics-driver-via-apt-get
- name: Install dependencies for run files
  apt:
    name: "{{ item }}"
    state: present
  with_items:
    - build-essential
    - dkms

- name: Donwload driver
  get_url:
    url: http://us.download.nvidia.com/XFree86/Linux-x86_64/390.25/NVIDIA-Linux-x86_64-390.25.run
    dest: /root
    mode: 0755

- name: Blacklist for Nouveau Driver
  copy:
    src: blacklist-nouveau.conf
    dest: /etc/modprobe.d/

- command: update-initramfs -u 

- name: Reboot machine (not quite sure why this step is needed, was following tutorial)
  include: reboot.yml

- name: Stop lightdm/gdm/kdm
  service:
    name: lightdm
    state: stopped

- name: Install driver
  command: /root/NVIDIA-Linux-x86_64-390.25.run --dkms -s

- name: Check if driver is installed correctly
  command: nvidia-smi

- name: Start lightdm/gdm/kdm
  service:
    name: lightdm
    state: started

blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

X server is needed if you need to access nvidia-settings as nvidia-smi doesn't expose all GPU characteristics. Generally, I am not sure if driver installation should be part of the code, perhaps just kubernetes and docker configureation and leave driver installation to the user as it is done on the host itself and not related directly to kubernetes?

arielramraz · 2018-02-13T09:47:35Z

Thank you very much for the help !!!

jayunit100 · 2019-01-25T15:21:59Z

xref #3438

fejta-bot · 2019-04-11T02:55:00Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-05-11T04:13:46Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-05-11T04:13:53Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

clkao mentioned this issue Jan 16, 2018

WIP: nvidia gpu support #2158

Closed

6 tasks

bhack mentioned this issue Jun 28, 2018

Include GPU daemonset in GKE configs? kubeflow/kubeflow#288

Closed

bhack mentioned this issue Jul 7, 2018

Bare metal kubeflow/kubeflow#1148

Closed

Atoms added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 11, 2019

k8s-ci-robot closed this as completed May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA GPU support #2079

NVIDIA GPU support #2079

paulbaumgart commented Dec 14, 2017 •

edited

AtzeDeVries commented Jan 12, 2018

arielramraz commented Feb 5, 2018 •

edited

mlushpenko commented Feb 11, 2018 •

edited

arielramraz commented Feb 13, 2018

jayunit100 commented Jan 25, 2019

fejta-bot commented Apr 11, 2019

fejta-bot commented May 11, 2019

k8s-ci-robot commented May 11, 2019

NVIDIA GPU support #2079

NVIDIA GPU support #2079

Comments

paulbaumgart commented Dec 14, 2017 • edited

AtzeDeVries commented Jan 12, 2018

arielramraz commented Feb 5, 2018 • edited

mlushpenko commented Feb 11, 2018 • edited

arielramraz commented Feb 13, 2018

jayunit100 commented Jan 25, 2019

fejta-bot commented Apr 11, 2019

fejta-bot commented May 11, 2019

k8s-ci-robot commented May 11, 2019

paulbaumgart commented Dec 14, 2017 •

edited

arielramraz commented Feb 5, 2018 •

edited

mlushpenko commented Feb 11, 2018 •

edited