Upgrade CUDA from 9.1 to 10.0 #8482

fifar · 2020-02-06T09:57:40Z

Recent deep learning frameworks like TensorFlow and PyTorch require at least CUDA 10.0.

TensorFlow: https://www.tensorflow.org/install/gpu#software_requirements
PyTorch: https://pytorch.org/get-started/locally/

k8s-ci-robot · 2020-02-06T09:57:44Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2020-02-06T09:57:48Z

Welcome @fifar!

It looks like this is your first PR to kubernetes/kops 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kops has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-02-06T09:57:49Z

Hi @fifar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rifelpet · 2020-04-16T12:01:06Z

/ok-to-test

rifelpet · 2020-04-16T15:51:22Z

Thanks for the update! Do you think you could update the documentation with more recent versions that you've tested this on? https://github.com/kubernetes/kops/tree/master/hooks/nvidia-device-plugin#prerequisites otherwise can get this merged and update the docs in a separate PR

fifar · 2020-04-17T00:50:19Z

Thanks for the update! Do you think you could update the documentation with more recent versions that you've tested this on? https://github.com/kubernetes/kops/tree/master/hooks/nvidia-device-plugin#prerequisites otherwise can get this merged and update the docs in a separate PR

Sure, will update the doc

hooks/nvidia-device-plugin/README.md

hooks/nvidia-device-plugin/Makefile

hooks/nvidia-device-plugin/README.md

rifelpet · 2020-04-17T14:00:14Z

Looks great! Now that I read through the readme I'm wondering about the support for CUDA 9.1. The makefile and docker image will only support CUDA 10.0 correct? It might be weird to have docs that walkthrough setting up CUDA 9.1 if the makefile cant build a CUDA 9.1 image anymore. Though the docs do reference someone else's third party docker image, so theoretically that image should still work with CUDA 9.1.

Perhaps we add something to the readme like

For CUDA 10.0, run DOCKER_REGISTRY= make image push with the desired registry to self-host the docker image. For CUDA 9.1 the image is already hosted according to the InstanceGroup spec example but can be mirrored elsewhere.

rifelpet

sorry one more minor thing and then I think its good to merge :) thanks for sticking with this.

that e2e job failure is just a flake so we can retry it if we need to

hooks/nvidia-device-plugin/image/files/01-aws-nvidia-driver.sh

rifelpet · 2020-04-17T16:06:33Z

Thanks! Glad we can finally get this up to date
/lgtm
/approve

k8s-ci-robot · 2020-04-17T16:06:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fifar, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rifelpet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fifar · 2020-04-17T16:14:52Z

Thanks @rifelpet for your review, especially suggestions

marcoaleixo · 2020-05-01T21:29:41Z

Hello!

I'm trying to use this new cuda10.0

apiVersion: v1
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my_cluster
  name: ai_nodes_gpu
spec:
  image: kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
  kubelet:
    featureGates:
      DevicePlugins: "true"
  hooks:
  - execContainer:
      image: marcooliv/nvidia-device-plugin-cuda-10.0

  machineType: p2.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: ai_nodes_gpu
    spot: ai_nodes_gpu
  role: Node
  subnets:
  - us-east-1a

After I create the node I run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml

Some informations about this nvidia-gpu Node:

Container Runtime Version: docker://18.9.9 Kubelet Version: v1.16.8 Kube-Proxy Version: v1.16.8
The node is created but the nvidia device isn't available.
Am I missing something?

fifar · 2020-05-02T01:03:03Z

@marcoaleixo The process of a GPU node being ready takes minutes and consists of two steps: 1) the node joins the cluster (say, 2~3 minutes) 2) devices are exposed which is done by the hook container (say, 5~6 minutes). So, after creating the cluster, take a rest, come back and check.
I use this command to check if GPU devices are ready kubectl --namespace=<namespace> get node -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'.

marcoaleixo · 2020-05-02T03:30:22Z

@fifar thank you for the response.
I'm trying again but It seems my node is in loop?!
Can it take more than 15 minutes?

Your command is returning "none" for all my nodes. In AWS console my Node is ready.

Are you able to test my docker hub image? Or can you share a Node.yaml config?

edit

I think I found the problem:

`kubectl logs -f nvidia-device-plugin-daemonset-bkq57 --namespace=kube-system

2020/05/02 03:49:30 Loading NVML
2020/05/02 03:49:30 Failed to initialize NVML: could not load NVML library.
2020/05/02 03:49:30 If this is a GPU node, did you set the docker default runtime to nvidia?
2020/05/02 03:49:30 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/05/02 03:49:30 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
`

fifar · 2020-05-02T05:25:43Z

@marcoaleixo Not sure what the issue is.

You can try my image fifarzh/nvidia-device-plugin:0.2.0-cuda10.0
I noticed you had these configurations which I don't set

spec:
  kubelet:
    featureGates:
      DevicePlugins: "true"

marcoaleixo · 2020-05-02T15:43:54Z

@fifar Yeah, same error "Failed to initialize NVML: could not load NVML library."
Did you make any modification inside your node?
Are you using P2 instance?
Your "apiVersion" is "apiVersion: kops/v1alpha2"?

fifar · 2020-05-02T16:13:22Z

@marcoaleixo Below is the example

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name
  name: gpu
spec:
  hooks:
  - execContainer:
      image: fifarzh/nvidia-device-plugin:0.2.0-cuda10.0
  image: kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
  machineType: 'p2.xlarge'
  maxSize: 1
  minSize: 1
  rootVolumeSize: 256
  nodeLabels:
    kops.k8s.io/instancegroup: gpu
  role: Node
  subnets:
  - us-east-2a

marcoaleixo · 2020-05-02T17:23:39Z

@fifar even with your configuration didn't work :/

fifar · 2020-05-03T00:50:55Z

@marcoaleixo sorry, it couldn't help. Please note that my environment is Kubernetes 1.15.5 + kops 1.15.0, check the first row of the test matrix here.

And could you make the gpu node ssh-able and ssh into it, then you can check logs in the directory /nvidia-device-plugin and check init services like kubelet.service and kubelet (if they are running, what statuses of them).

marcoaleixo · 2020-05-03T16:42:50Z

@fifar Connected via SSH I'm running manually every script and when I run the "nvidia-device-plugin.sh" I'm receiving the error

+++ systemctl stop protokube Failed to stop protokube.service: Unit protokube.service not loaded.

The kubelet is running.
My driver is 410.129 and my CUDA is 10.0, but for some reason kubectl --namespace=default get node -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu' return none for all nodes.

Is protokube the main reason of the problem?!

marcoaleixo · 2020-05-03T20:34:22Z

Well, @fifar
Worked when I build my image without the lines with "systemctl stop protokube" and "systemctl start protokube" inside the file "02-nvidia-docker.sh".

Ty!

fifar · 2020-05-04T03:56:50Z

Well, @fifar
Worked when I build my image without the lines with "systemctl stop protokube" and "systemctl start protokube" inside the file "02-nvidia-docker.sh".

Ty!

Yeah, these init services are tricky. Good to know you finally get your cluster working. @marcoaleixo

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 6, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 6, 2020

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 6, 2020

k8s-ci-robot requested review from joshbranham and robinpercy February 6, 2020 09:58

fifar force-pushed the nvidia-device-plugin-cuda10.0 branch from 5bb1a67 to 9b2ed91 Compare April 16, 2020 08:49

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 16, 2020

Upgrade CUDA from 9.1 to 10.0

ffbd7d7

fifar force-pushed the nvidia-device-plugin-cuda10.0 branch from 9b2ed91 to ffbd7d7 Compare April 16, 2020 09:03

fifar changed the title ~~Upgraded CUDA from 9.1 to 10.0~~ Upgrade CUDA from 9.1 to 10.0 Apr 16, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 16, 2020

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2020

Updated README and Makefile

ac457e9

fifar force-pushed the nvidia-device-plugin-cuda10.0 branch from 7acd9a5 to ac457e9 Compare April 17, 2020 07:58

fifar commented Apr 17, 2020

View reviewed changes

hooks/nvidia-device-plugin/README.md Show resolved Hide resolved

rifelpet reviewed Apr 17, 2020

View reviewed changes

hooks/nvidia-device-plugin/Makefile Outdated Show resolved Hide resolved

rifelpet reviewed Apr 17, 2020

View reviewed changes

hooks/nvidia-device-plugin/README.md Outdated Show resolved Hide resolved

Addressed comments

350d8e6

fifar requested a review from rifelpet April 17, 2020 13:32

Docker image explanation

7d9ad2b

rifelpet reviewed Apr 17, 2020

View reviewed changes

hooks/nvidia-device-plugin/image/files/01-aws-nvidia-driver.sh Outdated Show resolved Hide resolved

hooks/nvidia-device-plugin/image/files/01-aws-nvidia-driver.sh Outdated Show resolved Hide resolved

fifar added 2 commits April 17, 2020 23:14

Minor changes

14e5330

HTTPS for NVIDIA drivers

9b0055e

fifar requested a review from rifelpet April 17, 2020 15:21

k8s-ci-robot assigned rifelpet Apr 17, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2020

k8s-ci-robot merged commit a523e6e into kubernetes:master Apr 17, 2020

k8s-ci-robot added this to the v1.18 milestone Apr 17, 2020

fifar deleted the nvidia-device-plugin-cuda10.0 branch April 17, 2020 16:15

rifelpet mentioned this pull request Jul 9, 2020

Add cuda 10 support #7912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade CUDA from 9.1 to 10.0 #8482

Upgrade CUDA from 9.1 to 10.0 #8482

fifar commented Feb 6, 2020 •

edited

Loading

k8s-ci-robot commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020

rifelpet commented Apr 16, 2020

rifelpet commented Apr 16, 2020

fifar commented Apr 17, 2020

rifelpet commented Apr 17, 2020

rifelpet left a comment •

edited

Loading

rifelpet commented Apr 17, 2020

k8s-ci-robot commented Apr 17, 2020

fifar commented Apr 17, 2020

marcoaleixo commented May 1, 2020 •

edited

Loading

fifar commented May 2, 2020

marcoaleixo commented May 2, 2020 •

edited

Loading

fifar commented May 2, 2020

marcoaleixo commented May 2, 2020 •

edited

Loading

fifar commented May 2, 2020

marcoaleixo commented May 2, 2020

fifar commented May 3, 2020 •

edited

Loading

marcoaleixo commented May 3, 2020 •

edited

Loading

marcoaleixo commented May 3, 2020

fifar commented May 4, 2020

Upgrade CUDA from 9.1 to 10.0 #8482

Upgrade CUDA from 9.1 to 10.0 #8482

Conversation

fifar commented Feb 6, 2020 • edited Loading

k8s-ci-robot commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020

k8s-ci-robot commented Feb 6, 2020

rifelpet commented Apr 16, 2020

rifelpet commented Apr 16, 2020

fifar commented Apr 17, 2020

rifelpet commented Apr 17, 2020

rifelpet left a comment • edited Loading

Choose a reason for hiding this comment

rifelpet commented Apr 17, 2020

k8s-ci-robot commented Apr 17, 2020

fifar commented Apr 17, 2020

marcoaleixo commented May 1, 2020 • edited Loading

fifar commented May 2, 2020

marcoaleixo commented May 2, 2020 • edited Loading

edit

fifar commented May 2, 2020

marcoaleixo commented May 2, 2020 • edited Loading

fifar commented May 2, 2020

marcoaleixo commented May 2, 2020

fifar commented May 3, 2020 • edited Loading

marcoaleixo commented May 3, 2020 • edited Loading

marcoaleixo commented May 3, 2020

fifar commented May 4, 2020

fifar commented Feb 6, 2020 •

edited

Loading

rifelpet left a comment •

edited

Loading

marcoaleixo commented May 1, 2020 •

edited

Loading

marcoaleixo commented May 2, 2020 •

edited

Loading

marcoaleixo commented May 2, 2020 •

edited

Loading

fifar commented May 3, 2020 •

edited

Loading

marcoaleixo commented May 3, 2020 •

edited

Loading