New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include GPU daemonset in GKE configs? #288
Comments
Seems that we are quite stalled for other providers or very frequent Kops deployments on different providers kubernetes/kubernetes#54011 |
Thanks @bhack. |
@kunmingg we should think about whether it makes sense to have the bootstrapper decide to enable this. |
/assign |
Can we just always deploy the daemonset as it has tolerations on gpu: |
Thanks @bhack , looks really cool. IIUC, using this, we can just run a (driver) container and avoid installing the driver? Paste the link here: |
@lluunn Can we just follow whatever GKE recommends and then update it when GKE updates their solution? |
Why GKE only? |
This issue is GKE specific. I don't think Kubeflow has the bandwidth to try to figure out the proper way to install and configure GPUs on different K8s distributions. This should be solved upstream in Kubernetes. GPUs are supported in GKE. The sole purpose of this issue is to remove the extra step of making users deploy the daemonset to install GPU drivers on GKE clusters. |
It is not so fair IMHO. Don't seems to you a little bit more vendor neutral? |
We explicitly want to provide mechanisms for vendors to optimize the experience for their particular platform. So in this particular case we want to provide a good deployment experience on GCP. We expect other providers will want to do the same and they are free to do so. Alibaba for example is customizing the Kubeflow deployment to using docker images mirror'd on their cloud to improve performance. The ideal situation would be there would be non-vendor specific solution for GPUs in K8s but that's not the case. |
Yes I understand you point of view as GCP member but so in that case who will care of bare metal kubernetes? Bare metal is not a vendor I suppose.
I agree with you that it is ideal. But the kubernetes docs upstream status now is the only solution that we have and it is still making difference between GCP and others. |
Driver installation is distro and environment (primarily legal) specific.
Only Nvidia can truly simplify that and they are working on it as mentioned
in
GoogleCloudPlatform/container-engine-accelerators#51 (comment)
I agree with Jeremy that it is difficult to abstract that in kubeflow.
Bringing up k8s is non-trivial. Kubeflow relies on k8s to be up and
running. Similarly, KF should rely on GPUs to be provisioned and managed as
part of k8s layer.
…On Thu, Jun 28, 2018 at 5:20 AM bhack ***@***.***> wrote:
/cc @vissh <https://github.com/vissh> @3XX0 <https://github.com/3XX0> for GoogleCloudPlatform/container-engine-accelerators#51
(comment)
<GoogleCloudPlatform/container-engine-accelerators#51 (comment)>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#288 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKCZ0jqXa8CJCJLKcRhOxPEDO4mVEks5uBMoIgaJpZM4SRtS7>
.
|
@vishh Yes I've already linked that comment in this issue It is hard to integrate in this repository all the others k8s nodes GPU solutions:
So probably only GCP will have first class support in the repository if we merge the daemonset cause how we will integrate all these solutions untill nivida help us to find a solution to integrate upstream in k8s? |
Oh.. sorry for the noise you have already merge the GKE solution some hours ago. What documentation we will add i.e. for #32? |
@bhack I responded on #32 (this comment). You raise a good point about bare metal; feel free to open an issue about bare metal. I don't have much experience about deploying on bare metal so I'm not sure where to start in terms of getting a good bare metal deployment story. Closing this issue because the deploy.sh script for GKE (see #1111) will now create the deployment set. |
* Add namespace parameter to studyJob client API * Change if statement for namespace * Create func getNamespace
On GKE users need to deploy a daemonset to configure GPU nodes
I think the main advantage is eliminating another step users have to do get a GKE cluster configured.
I'm guessing we can probably already start the daemonset as it only runs on GPU nodes so starting it on a CPU cluster is a null op.
I'm wondering if having multiple copies of the daemonet is actually problematic (e.g. 1 started manually and 1 via Kubeflow). My guess is one of them would succeed on a given node and mark the node as processed.
One problem could be is if the daemonsets are installing different versions of the drivers users might want up with a random mix of driver versions on nodes.
/cc @vishh
The text was updated successfully, but these errors were encountered: