Include GPU daemonset in GKE configs? #288

jlewi · 2018-02-24T02:20:39Z

On GKE users need to deploy a daemonset to configure GPU nodes

https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#installing_drivers

I think the main advantage is eliminating another step users have to do get a GKE cluster configured.

I'm guessing we can probably already start the daemonset as it only runs on GPU nodes so starting it on a CPU cluster is a null op.

I'm wondering if having multiple copies of the daemonet is actually problematic (e.g. 1 started manually and 1 via Kubeflow). My guess is one of them would succeed on a given node and mark the node as processed.

One problem could be is if the daemonsets are installing different versions of the drivers users might want up with a random mix of driver versions on nodes.

/cc @vishh

The text was updated successfully, but these errors were encountered:

bhack · 2018-02-24T13:12:39Z

Seems that we are quite stalled for other providers or very frequent Kops deployments on different providers kubernetes/kubernetes#54011

jlewi · 2018-02-24T17:49:14Z

Thanks @bhack.

jlewi · 2018-04-30T19:47:20Z

@kunmingg we should think about whether it makes sense to have the bootstrapper decide to enable this.

bhack · 2018-04-30T20:14:52Z

See also GoogleCloudPlatform/container-engine-accelerators#51 and kubernetes/kops#4971

lluunn · 2018-06-20T22:11:55Z

/assign

lluunn · 2018-06-21T18:39:16Z

Can we just always deploy the daemonset as it has tolerations on gpu:
https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/daemonset.yaml#L45?

bhack · 2018-06-21T19:53:08Z

GoogleCloudPlatform/container-engine-accelerators#51 (comment)

lluunn · 2018-06-21T20:24:14Z

Thanks @bhack , looks really cool.

IIUC, using this, we can just run a (driver) container and avoid installing the driver?
Is this going to be the suggested pattern?

Paste the link here:
https://docs.google.com/presentation/d/1NY4X2K6BMaByfnF9rMEcNq6hS3NtmOKGTfihZ44zfrw/edit#slide=id.g3730f1de4f_0_16
https://github.com/NVIDIA/nvidia-container-runtime/

jlewi · 2018-06-28T10:18:30Z

@lluunn Can we just follow whatever GKE recommends and then update it when GKE updates their solution?

bhack · 2018-06-28T10:21:50Z

Why GKE only?

jlewi · 2018-06-28T10:45:32Z

This issue is GKE specific. I don't think Kubeflow has the bandwidth to try to figure out the proper way to install and configure GPUs on different K8s distributions. This should be solved upstream in Kubernetes.

GPUs are supported in GKE. The sole purpose of this issue is to remove the extra step of making users deploy the daemonset to install GPU drivers on GKE clusters.
https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#installing_drivers

bhack · 2018-06-28T10:59:37Z

It is not so fair IMHO.
We could just point users to the offcial Kubernetes documentation without including specific partial solution in the repository.

Don't seems to you a little bit more vendor neutral?

jlewi · 2018-06-28T11:21:56Z

We explicitly want to provide mechanisms for vendors to optimize the experience for their particular platform. So in this particular case we want to provide a good deployment experience on GCP. We expect other providers will want to do the same and they are free to do so. Alibaba for example is customizing the Kubeflow deployment to using docker images mirror'd on their cloud to improve performance.

The ideal situation would be there would be non-vendor specific solution for GPUs in K8s but that's not the case.

bhack · 2018-06-28T12:11:39Z

Yes I understand you point of view as GCP member but so in that case who will care of bare metal kubernetes? Bare metal is not a vendor I suppose.

The ideal situation would be there would be non-vendor specific solution for GPUs in K8s but that's not the case.

I agree with you that it is ideal. But the kubernetes docs upstream status now is the only solution that we have and it is still making difference between GCP and others.

bhack · 2018-06-28T12:20:15Z

/cc @vishh @3XX0 for GoogleCloudPlatform/container-engine-accelerators#51 (comment)

vishh · 2018-06-28T17:50:25Z

Driver installation is distro and environment (primarily legal) specific. Only Nvidia can truly simplify that and they are working on it as mentioned in GoogleCloudPlatform/container-engine-accelerators#51 (comment) I agree with Jeremy that it is difficult to abstract that in kubeflow. Bringing up k8s is non-trivial. Kubeflow relies on k8s to be up and running. Similarly, KF should rely on GPUs to be provisioned and managed as part of k8s layer.

…

On Thu, Jun 28, 2018 at 5:20 AM bhack ***@***.***> wrote: /cc @vissh <https://github.com/vissh> @3XX0 <https://github.com/3XX0> for GoogleCloudPlatform/container-engine-accelerators#51 (comment) <GoogleCloudPlatform/container-engine-accelerators#51 (comment)> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#288 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKCZ0jqXa8CJCJLKcRhOxPEDO4mVEks5uBMoIgaJpZM4SRtS7> .

bhack · 2018-06-28T18:10:12Z

@vishh Yes I've already linked that comment in this issue
I agree but actually, in k8s, we document "officially" two solution: GKE/GCE and "the others".

It is hard to integrate in this repository all the others k8s nodes GPU solutions:

So probably only GCP will have first class support in the repository if we merge the daemonset cause how we will integrate all these solutions untill nivida help us to find a solution to integrate upstream in k8s?

bhack · 2018-06-28T19:00:16Z

Oh.. sorry for the noise you have already merge the GKE solution some hours ago. What documentation we will add i.e. for #32?

jlewi · 2018-07-07T22:54:45Z

@bhack I responded on #32 (this comment).

You raise a good point about bare metal; feel free to open an issue about bare metal. I don't have much experience about deploying on bare metal so I'm not sure where to start in terms of getting a good bare metal deployment story.

Closing this issue because the deploy.sh script for GKE (see #1111) will now create the deployment set.

* Add namespace parameter to studyJob client API * Change if statement for namespace * Create func getNamespace

jlewi added the platform/gcp label Feb 24, 2018

jlewi added area/0.2.0 sprint/2018-06-11-to-06-22 labels Jun 11, 2018

jlewi mentioned this issue Jun 12, 2018

[v1alpha2] Add CI test kubeflow/training-operator#589

Closed

k8s-ci-robot assigned lluunn Jun 20, 2018

lluunn mentioned this issue Jun 21, 2018

bootstrapper deploying gpu driver #1059

Closed

lluunn mentioned this issue Jun 28, 2018

Install gpu driver in deployment manager #1094

Merged

bhack mentioned this issue Jul 1, 2018

AWS support! #32

Closed

jlewi closed this as completed Jul 7, 2018

bhack mentioned this issue Jul 7, 2018

Bare metal #1148

Closed

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021

Extend studyjob client API (kubeflow#288)

28c5b1c

* Add namespace parameter to studyJob client API * Change if statement for namespace * Create func getNamespace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include GPU daemonset in GKE configs? #288

Include GPU daemonset in GKE configs? #288

jlewi commented Feb 24, 2018

bhack commented Feb 24, 2018

jlewi commented Feb 24, 2018

jlewi commented Apr 30, 2018

bhack commented Apr 30, 2018

lluunn commented Jun 20, 2018

lluunn commented Jun 21, 2018

bhack commented Jun 21, 2018

lluunn commented Jun 21, 2018

jlewi commented Jun 28, 2018

bhack commented Jun 28, 2018

jlewi commented Jun 28, 2018

bhack commented Jun 28, 2018

jlewi commented Jun 28, 2018

bhack commented Jun 28, 2018 •

edited

bhack commented Jun 28, 2018 •

edited

vishh commented Jun 28, 2018 via email

bhack commented Jun 28, 2018 •

edited

bhack commented Jun 28, 2018

jlewi commented Jul 7, 2018

Include GPU daemonset in GKE configs? #288

Include GPU daemonset in GKE configs? #288

Comments

jlewi commented Feb 24, 2018

bhack commented Feb 24, 2018

jlewi commented Feb 24, 2018

jlewi commented Apr 30, 2018

bhack commented Apr 30, 2018

lluunn commented Jun 20, 2018

lluunn commented Jun 21, 2018

bhack commented Jun 21, 2018

lluunn commented Jun 21, 2018

jlewi commented Jun 28, 2018

bhack commented Jun 28, 2018

jlewi commented Jun 28, 2018

bhack commented Jun 28, 2018

jlewi commented Jun 28, 2018

bhack commented Jun 28, 2018 • edited

bhack commented Jun 28, 2018 • edited

vishh commented Jun 28, 2018 via email

bhack commented Jun 28, 2018 • edited

bhack commented Jun 28, 2018

jlewi commented Jul 7, 2018

bhack commented Jun 28, 2018 •

edited

bhack commented Jun 28, 2018 •

edited

bhack commented Jun 28, 2018 •

edited