Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU support #2079

Closed
paulbaumgart opened this issue Dec 14, 2017 · 8 comments
Closed

NVIDIA GPU support #2079

paulbaumgart opened this issue Dec 14, 2017 · 8 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@paulbaumgart
Copy link

paulbaumgart commented Dec 14, 2017

This would make it easier to use kubespray in conjuction with kubeflow.

I think this is what it would take:

  1. Add two new boolean vars: accelerators_enabled and device_plugins_enabled to inventory/group_vars/k8s-cluster.yml
  2. Expand the kube_feature_gates list var in main.yml to set Accelerators={{ accelerators_enabled|string }} and DevicePlugins={{ device_plugins_enabled|string }}
  3. Create a new playbook called contrib/nvidia_gpu/nvidia_gpu_ubuntu.yml that does the following:
    On every node:
    a) Install a user-configurable nvidia-XYZ driver apt package. (Perhaps default to nvidia-384)
    c) Install the nvidia-docker2 package following the instructions here: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
    d) Update /etc/docker/daemon.json to set "default-runtime": "nvidia"
    e) Run pkill -SIGHUP dockerd
    f) Install the k8s-device-plugin daemonset via kubectl
@AtzeDeVries
Copy link
Contributor

I would like to work on this. I've tested a deployment with kubespray and (manually) added Nvidia GPU support. This functions well.
To make this work in an automated way using ansible/kubespray, there are a few things to consider.

  • Nvidia drivers are proprietary, and technicaly a licence should be accepted. Is there a prefered way to handle this in kubespray?
  • nvidia-docker relates to docker versions. The only out ot the box matching version i could find was docker-ce 1.13.
  • If you are running diffent GPU types, we should label the nodes correctly. Or is this out of scope of kubespray?

some notes
We best way to install the nvidia drivers on (ubuntu) linux is in this way:

  1. download http://us.download.nvidia.com/XFree86/Linux-x86_64/384.111/NVIDIA-Linux-x86_64-384.111.run
  2. make it executable
  3. run it with the correct switches (to accept license and add make the install non interactive)

This will install the drivers without installing Xorg.

@clkao clkao mentioned this issue Jan 16, 2018
6 tasks
@arielramraz
Copy link

arielramraz commented Feb 5, 2018

Hello,
2 small questions:

  1. In step 2 - which main.yml there are many of them
  2. do you have an example of nvidia_gpu_ubuntu.yml ?

Thank you in advance.

@mlushpenko
Copy link
Contributor

mlushpenko commented Feb 11, 2018

@AtzeDeVries as I am playing with mining setups now, I can confirm that installing via the runfile is the best thing to do, I had too much pain with apt-get installations. Haven't had any problems with license or so, here is part of the code I am using (it's not very clean yet, but to give generic idea):

@arielramraz

# https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07#install-nvidia-graphics-driver-via-apt-get
- name: Install dependencies for run files
  apt:
    name: "{{ item }}"
    state: present
  with_items:
    - build-essential
    - dkms

- name: Donwload driver
  get_url:
    url: http://us.download.nvidia.com/XFree86/Linux-x86_64/390.25/NVIDIA-Linux-x86_64-390.25.run
    dest: /root
    mode: 0755

- name: Blacklist for Nouveau Driver
  copy:
    src: blacklist-nouveau.conf
    dest: /etc/modprobe.d/

- command: update-initramfs -u 

- name: Reboot machine (not quite sure why this step is needed, was following tutorial)
  include: reboot.yml

- name: Stop lightdm/gdm/kdm
  service:
    name: lightdm
    state: stopped

- name: Install driver
  command: /root/NVIDIA-Linux-x86_64-390.25.run --dkms -s

- name: Check if driver is installed correctly
  command: nvidia-smi

- name: Start lightdm/gdm/kdm
  service:
    name: lightdm
    state: started

blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

X server is needed if you need to access nvidia-settings as nvidia-smi doesn't expose all GPU characteristics. Generally, I am not sure if driver installation should be part of the code, perhaps just kubernetes and docker configureation and leave driver installation to the user as it is done on the host itself and not related directly to kubernetes?

@arielramraz
Copy link

Thank you very much for the help !!!

@Atoms Atoms added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2018
@jayunit100
Copy link

xref #3438

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 11, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants