Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade CUDA from 9.1 to 10.0 #8482

Merged
merged 6 commits into from
Apr 17, 2020

Conversation

fifar
Copy link
Contributor

@fifar fifar commented Feb 6, 2020

Recent deep learning frameworks like TensorFlow and PyTorch require at least CUDA 10.0.

TensorFlow: https://www.tensorflow.org/install/gpu#software_requirements
PyTorch: https://pytorch.org/get-started/locally/

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 6, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @fifar!

It looks like this is your first PR to kubernetes/kops 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kops has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 6, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @fifar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 6, 2020
@fifar fifar force-pushed the nvidia-device-plugin-cuda10.0 branch from 5bb1a67 to 9b2ed91 Compare April 16, 2020 08:49
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 16, 2020
@fifar fifar force-pushed the nvidia-device-plugin-cuda10.0 branch from 9b2ed91 to ffbd7d7 Compare April 16, 2020 09:03
@fifar fifar changed the title Upgraded CUDA from 9.1 to 10.0 Upgrade CUDA from 9.1 to 10.0 Apr 16, 2020
@rifelpet
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 16, 2020
@rifelpet
Copy link
Member

Thanks for the update! Do you think you could update the documentation with more recent versions that you've tested this on? https://github.com/kubernetes/kops/tree/master/hooks/nvidia-device-plugin#prerequisites otherwise can get this merged and update the docs in a separate PR

@fifar
Copy link
Contributor Author

fifar commented Apr 17, 2020

Thanks for the update! Do you think you could update the documentation with more recent versions that you've tested this on? https://github.com/kubernetes/kops/tree/master/hooks/nvidia-device-plugin#prerequisites otherwise can get this merged and update the docs in a separate PR

Sure, will update the doc

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2020
@fifar fifar force-pushed the nvidia-device-plugin-cuda10.0 branch from 7acd9a5 to ac457e9 Compare April 17, 2020 07:58
@fifar fifar requested a review from rifelpet April 17, 2020 13:32
@rifelpet
Copy link
Member

Looks great! Now that I read through the readme I'm wondering about the support for CUDA 9.1. The makefile and docker image will only support CUDA 10.0 correct? It might be weird to have docs that walkthrough setting up CUDA 9.1 if the makefile cant build a CUDA 9.1 image anymore. Though the docs do reference someone else's third party docker image, so theoretically that image should still work with CUDA 9.1.

Perhaps we add something to the readme like

For CUDA 10.0, run DOCKER_REGISTRY= make image push with the desired registry to self-host the docker image. For CUDA 9.1 the image is already hosted according to the InstanceGroup spec example but can be mirrored elsewhere.

Copy link
Member

@rifelpet rifelpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry one more minor thing and then I think its good to merge :) thanks for sticking with this.

that e2e job failure is just a flake so we can retry it if we need to

@fifar fifar requested a review from rifelpet April 17, 2020 15:21
@rifelpet
Copy link
Member

Thanks! Glad we can finally get this up to date
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fifar, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2020
@k8s-ci-robot k8s-ci-robot merged commit a523e6e into kubernetes:master Apr 17, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Apr 17, 2020
@fifar
Copy link
Contributor Author

fifar commented Apr 17, 2020

Thanks @rifelpet for your review, especially suggestions

@fifar fifar deleted the nvidia-device-plugin-cuda10.0 branch April 17, 2020 16:15
@marcoaleixo
Copy link

marcoaleixo commented May 1, 2020

Hello!

I'm trying to use this new cuda10.0

apiVersion: v1
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my_cluster
  name: ai_nodes_gpu
spec:
  image: kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
  kubelet:
    featureGates:
      DevicePlugins: "true"
  hooks:
  - execContainer:
      image: marcooliv/nvidia-device-plugin-cuda-10.0

  machineType: p2.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: ai_nodes_gpu
    spot: ai_nodes_gpu
  role: Node
  subnets:
  - us-east-1a

After I create the node I run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml

Some informations about this nvidia-gpu Node:

Container Runtime Version: docker://18.9.9 Kubelet Version: v1.16.8 Kube-Proxy Version: v1.16.8
The node is created but the nvidia device isn't available.
Am I missing something?

@fifar
Copy link
Contributor Author

fifar commented May 2, 2020

@marcoaleixo The process of a GPU node being ready takes minutes and consists of two steps: 1) the node joins the cluster (say, 2~3 minutes) 2) devices are exposed which is done by the hook container (say, 5~6 minutes). So, after creating the cluster, take a rest, come back and check.
I use this command to check if GPU devices are ready kubectl --namespace=<namespace> get node -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'.

@marcoaleixo
Copy link

marcoaleixo commented May 2, 2020

@fifar thank you for the response.
I'm trying again but It seems my node is in loop?!
Can it take more than 15 minutes?

loop

Your command is returning "none" for all my nodes. In AWS console my Node is ready.

Are you able to test my docker hub image? Or can you share a Node.yaml config?

edit

I think I found the problem:

`kubectl logs -f nvidia-device-plugin-daemonset-bkq57 --namespace=kube-system

2020/05/02 03:49:30 Loading NVML
2020/05/02 03:49:30 Failed to initialize NVML: could not load NVML library.
2020/05/02 03:49:30 If this is a GPU node, did you set the docker default runtime to nvidia?
2020/05/02 03:49:30 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/05/02 03:49:30 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
`

@fifar
Copy link
Contributor Author

fifar commented May 2, 2020

@marcoaleixo Not sure what the issue is.

  1. You can try my image fifarzh/nvidia-device-plugin:0.2.0-cuda10.0
  2. I noticed you had these configurations which I don't set
spec:
  kubelet:
    featureGates:
      DevicePlugins: "true"

@marcoaleixo
Copy link

marcoaleixo commented May 2, 2020

@fifar Yeah, same error "Failed to initialize NVML: could not load NVML library."
Did you make any modification inside your node?
Are you using P2 instance?
Your "apiVersion" is "apiVersion: kops/v1alpha2"?

@fifar
Copy link
Contributor Author

fifar commented May 2, 2020

@marcoaleixo Below is the example

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name
  name: gpu
spec:
  hooks:
  - execContainer:
      image: fifarzh/nvidia-device-plugin:0.2.0-cuda10.0
  image: kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
  machineType: 'p2.xlarge'
  maxSize: 1
  minSize: 1
  rootVolumeSize: 256
  nodeLabels:
    kops.k8s.io/instancegroup: gpu
  role: Node
  subnets:
  - us-east-2a

@marcoaleixo
Copy link

@fifar even with your configuration didn't work :/

@fifar
Copy link
Contributor Author

fifar commented May 3, 2020

@marcoaleixo sorry, it couldn't help. Please note that my environment is Kubernetes 1.15.5 + kops 1.15.0, check the first row of the test matrix here.

And could you make the gpu node ssh-able and ssh into it, then you can check logs in the directory /nvidia-device-plugin and check init services like kubelet.service and kubelet (if they are running, what statuses of them).

@marcoaleixo
Copy link

marcoaleixo commented May 3, 2020

@fifar Connected via SSH I'm running manually every script and when I run the "nvidia-device-plugin.sh" I'm receiving the error

+++ systemctl stop protokube Failed to stop protokube.service: Unit protokube.service not loaded.

The kubelet is running.
My driver is 410.129 and my CUDA is 10.0, but for some reason kubectl --namespace=default get node -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu' return none for all nodes.

Is protokube the main reason of the problem?!

@marcoaleixo
Copy link

Well, @fifar
Worked when I build my image without the lines with "systemctl stop protokube" and "systemctl start protokube" inside the file "02-nvidia-docker.sh".

Ty!

@fifar
Copy link
Contributor Author

fifar commented May 4, 2020

Well, @fifar
Worked when I build my image without the lines with "systemctl stop protokube" and "systemctl start protokube" inside the file "02-nvidia-docker.sh".

Ty!

Yeah, these init services are tricky. Good to know you finally get your cluster working. @marcoaleixo

@rifelpet rifelpet mentioned this pull request Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants