Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to prefer using all gpus on a single node #781

Closed
ashahab opened this issue Aug 17, 2018 · 10 comments
Closed

Ability to prefer using all gpus on a single node #781

ashahab opened this issue Aug 17, 2018 · 10 comments

Comments

@ashahab
Copy link

ashahab commented Aug 17, 2018

We are interested in having the ability in tf-operator to prefer a single node and use it's gpus if possible. That can dramatically increase training performance if the workers and ps don't have to talk over network.

@gaocegege
Copy link
Member

Yeah, I agree with you. While it is not in our scope. We should support the feature via the scheduler kube-arbitrator. https://github.com/kubernetes-incubator/kube-arbitrator/

@ashahab
Copy link
Author

ashahab commented Aug 17, 2018 via email

@gaocegege
Copy link
Member

You need to enable gang scheduling in tf operator and let kube-arbitrator to schedule the training jobs.

@ashahab
Copy link
Author

ashahab commented Aug 17, 2018 via email

@ChanYiLin
Copy link
Member

So first of all you can find or build your own kube-arbitrator image here
https://github.com/kubernetes-incubator/kube-arbitrator/blob/master/doc/usage/tutorial.md

after that you can use following yaml file (I forgot where to find the sample, so here is my own taml file)

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: kube-batchd
  namespace: kube-system
spec:
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      containers:
      - command:
        - ./opt/kube-batchd
        - --kubeconfig=/tmp/kubernetes/conf/admin.conf  # change according to your environment
        - --scheduler-name=kube-batchd
        image: {YOUR-IMAGE}
        name: kube-second-scheduler
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts:  # change according to your environment
        - mountPath: /tmp/kubernetes/conf
          name: kubeconfig
          readOnly: true
      hostNetwork: false
      hostPID: false
      volumes:
        - hostPath:      # change according to your environment
            path: /tmp/kubernetes/conf
          name: kubeconfig

NOTE: kube-arbitrator need to collect cluster information(such as Pod, Node, CRD, etc) for scheduing, so the service account used by the deployment must have permission to access those cluster resources, otherwise, kube-arbitrator will fail to startup. (from the README)

On the tf-operator side, there is an option EnableGangScheduling you have to set to True.
Then in the tfjob yaml file, assign the scheduler to each Pod (Master, PS, Worker).

It should work like the following video.
https://www.youtube.com/watch?v=hhwU7reNJDU

@ChanYiLin
Copy link
Member

@ashahab
However, kube-arbitrator still can't achieve what you want.
It can only schedule all the Pods together to prevent the situation that some pods of the tfjob are bound to nodes while some pods can't due to lack of resources so the job is pending there.

@ChanYiLin
Copy link
Member

@gaocegege
IMO, I don't think scheduling all the worker of tfjob together is also in the scope of kube-arbitrator, since this requirement only happens in job like distributed tensorflow training.

Another thing is I also found there is no option for user to assign schedulerName in v1alpha2 tfjob spec like we did in v1alpha1. So it seems that we have to add this setting to all the PodSpec.
in v1alpha1

//types.go
// SchedulerName specifies the name of scheduler which should handle the TFJob
	SchedulerName string `json:"schedulerName,omitempty"`
// replica.go
pod.Spec.SchedulerName = s.Job.SchedulerName()

@ashahab
Copy link
Author

ashahab commented Aug 18, 2018 via email

@ChanYiLin
Copy link
Member

Yes, you can.
In the tfjob yaml file, PS/Worker/Master parts are actually Pod spec in Kubernetes.
You can follow the format of Pod spec to add anything you want.

@gaocegege
Copy link
Member

@ashahab

Agree with @ChanYiLin , I am closing the issue. If you have any question feel free to add new comments here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants