Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

NVIDIA driver installation support on GPU instances #645

Merged
merged 8 commits into from
May 22, 2017

Conversation

everpeace
Copy link
Contributor

@everpeace everpeace commented May 11, 2017

Hello community! Thank you for developing the great tool! I'm glad to have opportunity to contribute this project because my colleague @mumoshu always encourages me.

As everybody know, AWS offers Nvidia GPU ready instance type families (P2 and G2). And, of course Kubernetes supports GPU resource scheduling since 1.6. However Nvidia drivers is not installed in default coreos ami used in kube-aws. Then, let's support it!

This PR implements auto installation support of Nvidia GPU driver. I borrowed some driver installation script from /Clarifai/coreos-nvidia.

Design summary

Configuration and what will happen

New configuration for this feature is really simple. worker.nodePool[i].gpu.nvidia.{enabled,version} is introduced in cluster.yaml.

  • default value of enabled is false.
  • user will be warned if
    • user set enabled: true when instanceType doesn't support GPU. In this case the configuration will be ignored.
    • user set enabled: false when instanceType does support GPU
  • when enabled: true on GPU supported instance type,
    • nvidia driver will be installed automatically in each node in the nodepool.
    • The installation will happen just before kubelet.service starting (see below).
    • And, kubelet will start with --feature-gates="Accelerators=true"
    • then container can mount nvidia driver like this
  • several tags are assigned to the node for enabling schedule on appropriate GPU model and its driver version by using nodeAffinity.
    • alpha.kubernetes.io/nvidia-gpu-name=<GPU hardware type name>
    • kube-aws.coreos.com/gpu=nvidia,
    • kube-aws.coreos.com/nvidia-gpu-version=<version>
    • Because substitution are not used in unit definition, I introduced /etc/default/kubectl for defining these label values in this commit.

Driver installation process

Most of installation script is borrowed from /Clarifai/coreos-nvidia. Especially, for device node installation, I referenced to Clarifai/coreos-nvidia#4 . I just described summary of installation process.

  • kubelet.service ruires nvidia-start.service
  • nvidia-start.service invokes build-and-install.sh, which installs nvidia drivers and kernel module files, via ExecStartPre. nvidia-start.service will create device nodes(nvidiactl and nvidia0,1,...). Other dynamic device nodes are controlled byudevadam (configuration is in this rule file)
    • nvidia-start.service is type=oneshot because kubelet.service should wait until nvidia-start.sh completely succeeded.
    • Restart policy cannot be used withtype=oneshot. nvidia-start.service doesn't use systemd's retry feature is not used but manual retry.sh is used.
  • nvidia-persistenced is also enabled for speeding up startup. this service is started/stopped via udevadam too.

How to try

  1. build kube-aws on this branch
  2. kube-aws up with minimal nodepool configuration below
    worker:
     nodePools:
      - name: p2xlarge
        count: 1
        instanceType: p2.xlarge
        rootVolume:
          size: 30
          type: gp2
        gpu:
          nvidia:
            enabled: true
            version: "375.66"
    
  3. check kubectl get nodes --show-labels. Then you'll see one node with gpu related labels.
  4. try starting this pod
    kubectl create -f pod.yaml
    
  5. log reports sample matrix multiplication is computed on gpus.
    kubectl logs gpu-pod
    

Feedbacks are always welcome!!

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label May 11, 2017
@codecov-io
Copy link

codecov-io commented May 11, 2017

Codecov Report

Merging #645 into master will decrease coverage by 1.12%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #645      +/-   ##
==========================================
- Coverage   38.26%   37.14%   -1.13%     
==========================================
  Files          51       51              
  Lines        3316     3201     -115     
==========================================
- Hits         1269     1189      -80     
+ Misses       1845     1836       -9     
+ Partials      202      176      -26
Impacted Files Coverage Δ
model/gpu.go 0% <0%> (ø)
model/node_pool_config.go 20.28% <0%> (-0.93%) ⬇️
core/controlplane/config/credential.go 57.14% <0%> (-3.09%) ⬇️
core/controlplane/config/tls_config.go
core/controlplane/config/token_config.go
core/controlplane/config/encrypted_assets.go 73.09% <0%> (ø)
core/controlplane/config/config.go 56.42% <0%> (+0.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f22f2cf...0546e94. Read the comment docs.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 11, 2017
@mumoshu
Copy link
Contributor

mumoshu commented May 12, 2017

cc @jollinshead @redbaron I believe you're currently running batch workloads on your kube-aws clusters. Are you also willing to run machine learning workloads utilizing GPUs? 😃

@mumoshu mumoshu changed the title Support Nvidia driver installation support on GPU instances. NVIDIA driver installation support on GPU instances May 12, 2017
# # (Experimental) GPU Driver installation support
# # Currently, only Nvidia driver is supported.
# # This setting takes effect only when node's instance family is p2 of g2.
# # Otherwise, installation will be skipped even if enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/kubernetes-incubator/kube-aws/pull/645/files#diff-5e5dcac90c0e906cb335a42b0352ce9cR47?, it seems like kube-aws emits a validation error when the gpu support is enabled on a node pool with instance type other than p2 or g2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry! I've missed that there is only a warning.

Anyway, I believe we'd better make it an error rather than a warning because an user does seem to intend to enable the GPU support but kube-aws was unable to do so.
WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok. yes, I agree with you. kube-aws should prohibit 'enabled:true' with instance types which doesn't support GPU. I will update my code.

ExecStart=/opt/nvidia/current/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

- path: /opt/nvidia-build/nvidia-start.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my education, would you mind sharing me how this systemd unit gets installed to systemd?
Can we set up the systemd unit via the units section of cloud-config like others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see the comment. nvidia-install.sh does.

tar -C ${ARTIFACT_DIR} -cvj ${TOOLS} > tools-${VERSION}.tar.bz2
tar -C ${ARTIFACT_DIR}/kernel -cvj $(basename -a ${ARTIFACT_DIR}/kernel/*.ko) > modules-${COMBINED_VERSION}.tar.bz2

- path: /opt/nvidia-build/build.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, coreos-nvidia supports cross-building against any version of Container Linux.
Then, just curious, but can we run build.sh locally in e.g. a Vagrant machine hosting Container Linux and then export built assets to the disk, then embed to cloud-config or put to S3 for faster startup of GPU-enabled nodes?

Copy link
Contributor Author

@everpeace everpeace May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, probably. I will try this my local vagrant!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out, building libraries, kernel modules and nvidia tools succeeded in vagrant!
Puting pre-build binaries to gpu nodes directly makes it startup faster. However kube-aws up process will be a more complex instead and it requires users to install vagrant and virtualbox.

Honestly speaking, building process takes several minutes (it probably 5 ~10 min. which depends on speed of downloading coreos dev container and nvidia installer). I believe this duration would be acceptable for many users because gpu node pool usually doesn't need to scale so quickly like normal nodepool which hosts service pods.

What do you think?? Do you prefer local build & put pre-build binaries directly for faster startup??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this duration would be acceptable for many users because gpu node pool usually doesn't need to scale so quickly like normal nodepool which hosts service pods.

I completely agree with you here 👍
The local build feature could be an extra thing we may or may not add in the future.

}

// This function is used when rendering cloud-config-worker
func (c NvidiaSetting) IsEnabledOn(instanceType string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the nice naming 👍

# # Make sure to choose 'docker' as container runtime when enabled this feature.
# gpu:
# nvidia:
# enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess an installed driver can be unusable once Container Linux is updated afterwards, due to an updated kernel.
// Do you think so too?
Then, I believe we'd better document it - maybe something like:

Ensure that automatic Container Linux is disabled(it is disabled by default btw). Otherwise the installed driver may stop working when an OS update resulted in an updated kernel

would work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're absolutely right. I will put this to the comment.

@mumoshu mumoshu modified the milestones: backlog, wip, tbd May 12, 2017
@mumoshu mumoshu added this to To be reminded in v0.9.7 May 12, 2017
@mumoshu mumoshu modified the milestones: tbd, v0.9.7-rc.<tbd> May 12, 2017
cp *.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable nvidia-start.service
systemctl start nvidia-start.service
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-install.sh install nvidia-start.service and nvidia-persitenced.service to the systemd. and this script only stats nvidia-start.service. the unit insmod nvidia module and then udevadam will spawn several actions in 71-nvidia.rules which includes nvidia-persisnteced.service.

Copy link
Contributor

@mumoshu mumoshu May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! So, basically we're copying systemd unit files written via write_files into /etc/systemd/system?
Then, it is possible to just write those units directly into the units section, right?

Copy link
Contributor Author

@everpeace everpeace May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we put nvidia-start.service by unit section, can we control when the service start? Unless we don't define enabled explicitly in unit definition, this doesn't automatically, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so.

You can even enable it by default.
More concretely, my idea is modifying nvidia-start.service to something like:

[Unit]
Description=Start NVIDIA daemon(?)
After=local-fs.target
Before=kublet.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStartPre=/opt/bin/retry /opt/nvidia/current/bin/nvidia-start.sh
ExecStart=/bin/true
[Install]
RequiredBy=kubelet.service

And trigger it via a systemd dependency from a newly introduced nvidia-install.service:

[Unit]
Description=Start NVIDIA daemon(?)
After=local-fs.target
Before=nvidia-start.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStartPre=/opt/bin/retry /opt/nvidia/current/bin/nvidia-install.sh
ExecStart=/bin/true
[Install]
RequiredBy=nvidia-start.service

where /opt/bin/retry is:

#/usr/bin/bash

set -e

# could be improved to a finite loop
while true; do
  if "$@"; then
    exit 0
  fi
  echo retrying "$@"
  # could be improved to an exponential backoff
  sleep 1
done

and omit ExecStartPre=/opt/nvidia-build/build-and-install.sh from kubelet.service.


systemctl daemon-reload
systemctl enable nvidia-start.service
systemctl start nvidia-start.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may sound like a nit but -

If we could transform the build-and-install.sh in ExecStartPre to a systemd unit like suggested in #645 (comment), systemctl daemon-reload and enable, start can be omitted altogether and such dependencies can be handled completely by systemd rather than by a bash script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu
Yes, sounds nice! I'll fix this.

deleted `systemctl` command from bash script.  Instead, above unit dependency is introduced.

nvidia-install.service, which just invokes build-and-install.sh is implemented type=oneshot because nvidia-start should wait until nvidia-install.service successed completely.
Enabling retry build-and-install.sh, /opt/nvidia-build/util/retry.sh is introduced.  It is because type=oneshot and Restart=always can't be used in systemd.
…ld-and-install.sh via ExecStartPre with retry.sh

kubelet.service 'Requires' and 'After' nvidia-star.service.
@everpeace
Copy link
Contributor Author

I updated 'Driver installation process' in this PR summary because of 2710606 and 0546e94

@everpeace
Copy link
Contributor Author

@mumoshu I updated systemd units' dependencies. I'm glad if you could take a look!

@mumoshu mumoshu merged commit ae54601 into kubernetes-retired:master May 22, 2017
@mumoshu
Copy link
Contributor

mumoshu commented May 22, 2017

@everpeace LGTM. Thanks for your efforts on the great feature 👍

@mumoshu mumoshu modified the milestones: v0.9.7-rc.1, v0.9.7-rc.<tbd> May 22, 2017
@everpeace everpeace mentioned this pull request May 25, 2017
camilb added a commit to camilb/kube-aws that referenced this pull request May 25, 2017
* kubernetes-incubator/master:
  Fix "install-kube-system" script when "clusterAutoscaler" is disabled.
  Remove obsolete etcd locking logic
  Re: cluster-autoscaler support
  Make `go test` timeout longer enough for Travis Fixes kubernetes-retired#667
  NVIDIA driver installation support on GPU instances (kubernetes-retired#645)
  Make kubelet flags more consistent
  Fix taint being assigned as labels
  Avoid unnecessary node replacements when TLS bootstrapping is enabled (kubernetes-retired#639)
  Update Kubernetes dashboard to v1.6.1. Update calico to v2.2.1.
  Fix typo in help message
@kylegato
Copy link
Contributor

Has anyone tested this w/ the new G3 instances yet?

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018
…ed#645)

AWS offers Nvidia GPU ready instance type families (P2 and G2).  And, of course Kubernetes supports GPU resource scheduling since 1.6.  However Nvidia drivers is not installed in default coreos ami used in kube-aws.  Then, let's support it!

This implements auto installation support of Nvidia GPU driver. Some driver installation script are borrowed from [/Clarifai/coreos-nvidia](https://github.com/Clarifai/coreos-nvidia/).

## Design summary
### Configuration and what will happen
New configuration for this feature is really simple.  `worker.nodePool[i].gpu.nvidia.{enabled,version}` is introduced in `cluster.yaml`.

- default value of `enabled` is false.
- user will be warned if 
  - user set `enabled: true` when `instanceType` doesn't support GPU.  In this case the configuration will be ignored.
  - user set `enabled: false` when `instanceType` does support GPU
- when `enabled: true` on GPU supported instance type, 
  - nvidia driver will be installed automatically in each node in the nodepool.  
  - The installation will happen just before `kubelet.service` starting (see below).  
  - And, `kubelet` will start with [`--feature-gates="Accelerators=true"`](https://github.com/everpeace/kube-aws/blob/feature/nvidia-gpu/core/controlplane/config/templates/cloud-config-worker#L212-L214)
  - then container can mount nvidia driver [like this](https://gist.github.com/everpeace/9e03050467d5ef5f66b7ce96b5fefa72#file-pod-yaml-L30-L53)
- several tags are assigned to the node for enabling schedule on appropriate GPU model and its driver version by using `nodeAffinity`.
  - `alpha.kubernetes.io/nvidia-gpu-name=<GPU hardware type name>`
  - `kube-aws.coreos.com/gpu=nvidia`,
  - `kube-aws.coreos.com/nvidia-gpu-version=<version>`
  -  Because substitution are not used in unit definition, I introduced `/etc/default/kubectl` for defining these label values in [this commit](kubernetes-retired@5c59944).

### Driver installation process
Most of installation script is borrowed from [/Clarifai/coreos-nvidia](https://github.com/Clarifai/coreos-nvidia/).  Especially, for device node installation, I referenced to Clarifai/coreos-nvidia#4 .  I just described summary of installation process.

- [`kubelet.service`](https://github.com/everpeace/kube-aws/blob/feature/nvidia-gpu/core/controlplane/config/templates/cloud-config-worker#L144-L147) ruires [`nvidia-start.service`](https://github.com/everpeace/kube-aws/blob/feature/nvidia-gpu/core/controlplane/config/templates/cloud-config-worker#L456-L471)
- [`nvidia-start.service`](https://github.com/everpeace/kube-aws/blob/feature/nvidia-gpu/core/controlplane/config/templates/cloud-config-worker#L456-L471)  invokes [`build-and-install.sh`](https://github.com/everpeace/kube-aws/blob/feature/nvidia-gpu/core/controlplane/config/templates/cloud-config-worker#L918-L947), which installs nvidia drivers and kernel module files, via `ExecStartPre`.  `nvidia-start.service` will create device nodes(`nvidiactl` and `nvidia0,1,...`).  Other dynamic device nodes are controlled by`udevadam` (configuration is in [this rule file](https://github.com/everpeace/kube-aws/blob/feature/nvidia-gpu/core/controlplane/config/templates/cloud-config-worker#L905-L939))
  - `nvidia-start.service` is `type=oneshot` because `kubelet.service` should wait until `nvidia-start.sh` completely succeeded.
  - `Restart` policy cannot be used with`type=oneshot`.  `nvidia-start.service` doesn't use systemd's retry feature is not used but manual `retry.sh` is used.
- [nvidia-persistenced](https://docs.nvidia.com/deploy/driver-persistence/#persistence-daemon) is also enabled for speeding up startup.  this service is started/stopped via `udevadam` too.

## How to try
1. build `kube-aws` on this branch
2. `kube-aws up` with minimal nodepool configuration below
   ```
   worker:
    nodePools:
     - name: p2xlarge
       count: 1
       instanceType: p2.xlarge
       rootVolume:
         size: 30
         type: gp2
       gpu:
         nvidia:
           enabled: true
           version: "375.66"
    ```
3.  check `kubectl get nodes --show-labels`.  Then you'll see one node with gpu related labels.
4. try starting this [pod](https://gist.github.com/everpeace/9e03050467d5ef5f66b7ce96b5fefa72#file-pod-yaml)
   ```
   kubectl create -f pod.yaml
   ```
5. log reports sample matrix multiplication is computed on gpus.
   ```
   kubectl logs gpu-pod
   ```

## Full changelog

* add /etc/default/kubelet to worker nodes.

* add nvidia driver installation support.

* add gpu related config test.

* it should be error when user gpu.nvidia.true with GPU unspported intance types.

This change is caused by:
kubernetes-retired#645 (comment)

* add note which warns that driver may stop working when OS is updated.

This change is caused by:
kubernetes-retired#645 (comment)

* move nvidia-{start, persisntenced}.service to `coreos.units` section.

creation for nvidia-persistenced user to `users` section, too.

This change is caused by:
kubernetes-retired#645 (comment)

* introduce unit dependency: kubelet --> nvidia-start --> nvidia-install

deleted `systemctl` command from bash script.  Instead, above unit dependency is introduced.

nvidia-install.service, which just invokes build-and-install.sh is implemented type=oneshot because nvidia-start should wait until nvidia-install.service successed completely.
Enabling retry build-and-install.sh, /opt/nvidia-build/util/retry.sh is introduced.  It is because type=oneshot and Restart=always can't be used in systemd.

* delete nvidia-install.service and now nvidia-start.service invoke build-and-install.sh via ExecStartPre with retry.sh

kubelet.service 'Requires' and 'After' nvidia-star.service.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
awaiting reply cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
No open projects
v0.9.7
To be reminded
Development

Successfully merging this pull request may close these issues.

None yet

5 participants