Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include GPU daemonset in GKE configs? #288

Closed
jlewi opened this issue Feb 24, 2018 · 19 comments
Closed

Include GPU daemonset in GKE configs? #288

jlewi opened this issue Feb 24, 2018 · 19 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Feb 24, 2018

On GKE users need to deploy a daemonset to configure GPU nodes

https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#installing_drivers

I think the main advantage is eliminating another step users have to do get a GKE cluster configured.

I'm guessing we can probably already start the daemonset as it only runs on GPU nodes so starting it on a CPU cluster is a null op.

I'm wondering if having multiple copies of the daemonet is actually problematic (e.g. 1 started manually and 1 via Kubeflow). My guess is one of them would succeed on a given node and mark the node as processed.

One problem could be is if the daemonsets are installing different versions of the drivers users might want up with a random mix of driver versions on nodes.

/cc @vishh

@bhack
Copy link

bhack commented Feb 24, 2018

Seems that we are quite stalled for other providers or very frequent Kops deployments on different providers kubernetes/kubernetes#54011

@jlewi
Copy link
Contributor Author

jlewi commented Feb 24, 2018

Thanks @bhack.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 30, 2018

@kunmingg we should think about whether it makes sense to have the bootstrapper decide to enable this.

@bhack
Copy link

bhack commented Apr 30, 2018

@lluunn
Copy link
Contributor

lluunn commented Jun 20, 2018

/assign

@lluunn
Copy link
Contributor

lluunn commented Jun 21, 2018

Can we just always deploy the daemonset as it has tolerations on gpu:
https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/daemonset.yaml#L45?

@bhack
Copy link

bhack commented Jun 21, 2018

@lluunn
Copy link
Contributor

lluunn commented Jun 21, 2018

Thanks @bhack , looks really cool.

IIUC, using this, we can just run a (driver) container and avoid installing the driver?
Is this going to be the suggested pattern?

Paste the link here:
https://docs.google.com/presentation/d/1NY4X2K6BMaByfnF9rMEcNq6hS3NtmOKGTfihZ44zfrw/edit#slide=id.g3730f1de4f_0_16
https://github.com/NVIDIA/nvidia-container-runtime/

@jlewi
Copy link
Contributor Author

jlewi commented Jun 28, 2018

@lluunn Can we just follow whatever GKE recommends and then update it when GKE updates their solution?

@bhack
Copy link

bhack commented Jun 28, 2018

Why GKE only?

@jlewi
Copy link
Contributor Author

jlewi commented Jun 28, 2018

This issue is GKE specific. I don't think Kubeflow has the bandwidth to try to figure out the proper way to install and configure GPUs on different K8s distributions. This should be solved upstream in Kubernetes.

GPUs are supported in GKE. The sole purpose of this issue is to remove the extra step of making users deploy the daemonset to install GPU drivers on GKE clusters.
https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#installing_drivers

@bhack
Copy link

bhack commented Jun 28, 2018

It is not so fair IMHO.
We could just point users to the offcial Kubernetes documentation without including specific partial solution in the repository.

Don't seems to you a little bit more vendor neutral?

@jlewi
Copy link
Contributor Author

jlewi commented Jun 28, 2018

We explicitly want to provide mechanisms for vendors to optimize the experience for their particular platform. So in this particular case we want to provide a good deployment experience on GCP. We expect other providers will want to do the same and they are free to do so. Alibaba for example is customizing the Kubeflow deployment to using docker images mirror'd on their cloud to improve performance.

The ideal situation would be there would be non-vendor specific solution for GPUs in K8s but that's not the case.

@bhack
Copy link

bhack commented Jun 28, 2018

Yes I understand you point of view as GCP member but so in that case who will care of bare metal kubernetes? Bare metal is not a vendor I suppose.

The ideal situation would be there would be non-vendor specific solution for GPUs in K8s but that's not the case.

I agree with you that it is ideal. But the kubernetes docs upstream status now is the only solution that we have and it is still making difference between GCP and others.

@bhack
Copy link

bhack commented Jun 28, 2018

@vishh
Copy link
Contributor

vishh commented Jun 28, 2018 via email

@bhack
Copy link

bhack commented Jun 28, 2018

@vishh Yes I've already linked that comment in this issue
I agree but actually, in k8s, we document "officially" two solution: GKE/GCE and "the others".

It is hard to integrate in this repository all the others k8s nodes GPU solutions:

So probably only GCP will have first class support in the repository if we merge the daemonset cause how we will integrate all these solutions untill nivida help us to find a solution to integrate upstream in k8s?

@bhack
Copy link

bhack commented Jun 28, 2018

Oh.. sorry for the noise you have already merge the GKE solution some hours ago. What documentation we will add i.e. for #32?

@bhack bhack mentioned this issue Jul 1, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Jul 7, 2018

@bhack I responded on #32 (this comment).

You raise a good point about bare metal; feel free to open an issue about bare metal. I don't have much experience about deploying on bare metal so I'm not sure where to start in terms of getting a good bare metal deployment story.

Closing this issue because the deploy.sh script for GKE (see #1111) will now create the deployment set.

@jlewi jlewi closed this as completed Jul 7, 2018
@bhack bhack mentioned this issue Jul 7, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
* Add namespace parameter to studyJob client API

* Change if statement for namespace

* Create func getNamespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants