Ability to prefer using all gpus on a single node #781

ashahab · 2018-08-17T00:15:53Z

We are interested in having the ability in tf-operator to prefer a single node and use it's gpus if possible. That can dramatically increase training performance if the workers and ps don't have to talk over network.

gaocegege · 2018-08-17T15:33:42Z

Yeah, I agree with you. While it is not in our scope. We should support the feature via the scheduler kube-arbitrator. https://github.com/kubernetes-incubator/kube-arbitrator/

ashahab · 2018-08-17T16:15:33Z

Thanks for the reply. How can I try out kube-arbitrator along with kubeflow?

…

On Fri, Aug 17, 2018 at 8:33 AM Ce Gao ***@***.***> wrote: Yeah, I agree with you. While it is not in our scope. We should support the feature via the scheduler kube-arbitrator. https://github.com/kubernetes-incubator/kube-arbitrator/ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqJPWRyHbAmTSMutAxrFotEbiIZnd_uks5uRuJYgaJpZM4WAxIx> .

gaocegege · 2018-08-17T20:16:18Z

You need to enable gang scheduling in tf operator and let kube-arbitrator to schedule the training jobs.

ashahab · 2018-08-17T21:07:46Z

How do I deploy kube-abitrator? I don't see it deployed with the default kubeflow installation.

…

On Fri, Aug 17, 2018 at 1:16 PM Ce Gao ***@***.***> wrote: You need to enable gang scheduling in tf operator and let kube-arbitrator to schedule the training jobs. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqJPXZxFBZeHMKP6XmJJ2-_9_uXYqeFks5uRySUgaJpZM4WAxIx> .

ChanYiLin · 2018-08-18T03:36:39Z

So first of all you can find or build your own kube-arbitrator image here
https://github.com/kubernetes-incubator/kube-arbitrator/blob/master/doc/usage/tutorial.md

after that you can use following yaml file (I forgot where to find the sample, so here is my own taml file)

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: kube-batchd
  namespace: kube-system
spec:
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      containers:
      - command:
        - ./opt/kube-batchd
        - --kubeconfig=/tmp/kubernetes/conf/admin.conf  # change according to your environment
        - --scheduler-name=kube-batchd
        image: {YOUR-IMAGE}
        name: kube-second-scheduler
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts:  # change according to your environment
        - mountPath: /tmp/kubernetes/conf
          name: kubeconfig
          readOnly: true
      hostNetwork: false
      hostPID: false
      volumes:
        - hostPath:      # change according to your environment
            path: /tmp/kubernetes/conf
          name: kubeconfig

NOTE: kube-arbitrator need to collect cluster information(such as Pod, Node, CRD, etc) for scheduing, so the service account used by the deployment must have permission to access those cluster resources, otherwise, kube-arbitrator will fail to startup. (from the README)

On the tf-operator side, there is an option EnableGangScheduling you have to set to True.
Then in the tfjob yaml file, assign the scheduler to each Pod (Master, PS, Worker).

It should work like the following video.
https://www.youtube.com/watch?v=hhwU7reNJDU

ChanYiLin · 2018-08-18T03:44:05Z

@ashahab
However, kube-arbitrator still can't achieve what you want.
It can only schedule all the Pods together to prevent the situation that some pods of the tfjob are bound to nodes while some pods can't due to lack of resources so the job is pending there.

ChanYiLin · 2018-08-18T03:51:37Z

@gaocegege
IMO, I don't think scheduling all the worker of tfjob together is also in the scope of kube-arbitrator, since this requirement only happens in job like distributed tensorflow training.

Another thing is I also found there is no option for user to assign schedulerName in v1alpha2 tfjob spec like we did in v1alpha1. So it seems that we have to add this setting to all the PodSpec.
in v1alpha1

//types.go
// SchedulerName specifies the name of scheduler which should handle the TFJob
	SchedulerName string `json:"schedulerName,omitempty"`

// replica.go
pod.Spec.SchedulerName = s.Job.SchedulerName()

ashahab · 2018-08-18T17:08:12Z

Got it. Thanks for the info. Can I use the default scheduler with pod affinity to achieve what we need?

…

On Fri, Aug 17, 2018, 8:51 PM Jack ***@***.***> wrote: @gaocegege <https://github.com/gaocegege> IMO, I don't think scheduling all the worker of tfjob together is also in the scope of kube-arbitrator, since this requirement only happens in job like distributed tensorflow training. Another thing is I also found there is no option for user to assign schedulerName in v1alpha2 tfjob spec like we did in v1alpha1. So it seems that we have to add this setting to all the PodSpec. in v1alpha1 //types.go // SchedulerName specifies the name of scheduler which should handle the TFJob SchedulerName string `json:"schedulerName,omitempty"` // replica.go pod.Spec.SchedulerName = s.Job.SchedulerName() — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqJPfhVZCkkm-6mURKZuEYScxaDkjt-ks5uR49LgaJpZM4WAxIx> .

ChanYiLin · 2018-08-18T17:25:46Z

Yes, you can.
In the tfjob yaml file, PS/Worker/Master parts are actually Pod spec in Kubernetes.
You can follow the format of Pod spec to add anything you want.

gaocegege · 2018-08-20T02:31:53Z

@ashahab

Agree with @ChanYiLin , I am closing the issue. If you have any question feel free to add new comments here.

gaocegege added the community/question label Aug 17, 2018

gaocegege closed this as completed Aug 20, 2018

cheyang mentioned this issue Aug 20, 2018

Make TFJob in bin pack mode kubeflow/arena#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to prefer using all gpus on a single node #781

Ability to prefer using all gpus on a single node #781

ashahab commented Aug 17, 2018

gaocegege commented Aug 17, 2018

ashahab commented Aug 17, 2018 via email

gaocegege commented Aug 17, 2018

ashahab commented Aug 17, 2018 via email

ChanYiLin commented Aug 18, 2018

ChanYiLin commented Aug 18, 2018

ChanYiLin commented Aug 18, 2018

ashahab commented Aug 18, 2018 via email

ChanYiLin commented Aug 18, 2018

gaocegege commented Aug 20, 2018

Ability to prefer using all gpus on a single node #781

Ability to prefer using all gpus on a single node #781

Comments

ashahab commented Aug 17, 2018

gaocegege commented Aug 17, 2018

ashahab commented Aug 17, 2018 via email

gaocegege commented Aug 17, 2018

ashahab commented Aug 17, 2018 via email

ChanYiLin commented Aug 18, 2018

ChanYiLin commented Aug 18, 2018

ChanYiLin commented Aug 18, 2018

ashahab commented Aug 18, 2018 via email

ChanYiLin commented Aug 18, 2018

gaocegege commented Aug 20, 2018