Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

Open
karthikv2k opened this issue May 10, 2019 · 9 comments

Comments

@karthikv2k
Copy link

Problem

Currently in Kubeflow, we have a controller per framework (e.g. TF-Job, and PyTorch-Operator) and to support a new framework, the message we are giving is that users have to write a new controller. This is a lot of friction for data scientists who most likely don’t know Go-Lang and K8s. Even if they do, getting a version of controller deployed in a corp cluster is not easy.

Proposed Solution

However, in reality users actually don’t have to write a new controller if they have a generic Gang scheduling API and in fact TF-Job controller exposes a restricted version of the API that works for almost all of the use cases. In fact, the Google AI Platform team implemented distributed PyTorch and XGBoost jobs using TF-Job API for the Google AI-Hub. So if we can create a controller for gang scheduling it will make it easy to add support for new frameworks.

Advantages

Less effort to support a new framework (users don’t need K8s or Go-Lang expertise)
A better story for portability between Kubeflow and other platforms like Mesos. The same container can be used in other platforms without any changes.

Other infras that support some version of gang scheduling API

Frameworks Support

From my understanding, distributed training for following frameworks can be implemented easily using just a generic gang scheduling.

  • TensorFlow
  • Horovod
  • PyTorch
  • XGBoost
  • Julia
  • LightGBM

Rough API Spec

Almost same as current tf-job spec but with more generic names and generalizing #worker groups.

apiVersion: kubeflow.org/v1beta1
kind: GangJob
metadata:
  generateName: gangjob
  namespace: kubeflow
spec:
  replicaSpecs:
    WorkerGroup1:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:
    WorkerGroup2:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            Command:
    .
    .
    .
    WorkerGroupN:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: 
            image: 
            command:
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.86. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@karthikv2k
Copy link
Author

CC @richardsliu @abhi-g

@richardsliu
Copy link
Contributor

/cc @k82cn
/cc @gaocegege
/cc @johnugeorge

@johnugeorge
Copy link
Member

In this proposal, How is the distributed training environment setup for each framework? Eg: TF_CONFIG env in tensorflow(https://www.tensorflow.org/guide/distribute_strategy#setting_up_tf_config_environment_variable) and MASTER_ADDR, MASTER_PORT etc in pytorch (https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods)

It looks similar to the common operator discussion that would support all frameworks that are described in the proposal.

@k82cn
Copy link
Contributor

k82cn commented May 12, 2019

Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the SchedulingSpec to communicate with kube-batch; for the other part, e.g. multiple pod template, it's more about controller instead of scheduling policy. Both of them are fundamental feature to k8s. please refer to kubernetes/kubernetes#68357 , http://github.com/volcano-sh/volcano on what we're doing there :)

@karthikv2k
Copy link
Author

karthikv2k commented May 13, 2019

In this proposal, How is the distributed training environment setup for each framework?

The user's code will be responsible for setting the right environment variables for the framework they are using. The gang scheduler/controller cab set a cluster spec in an env variable that is similar to TF_CONFIG. Taking a cluster spec and convert into framework specific env should be trivial.

It looks similar to the common operator discussion that would support all frameworks that are described in the proposal.

where can find "common operator discussion"? form the name it looks similar.

@karthikv2k
Copy link
Author

Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the SchedulingSpec to communicate with kube-batch; for the other part, e.g. multiple pod template, it's more about controller instead of scheduling policy. Both of them are fundamental feature to k8s. please refer to kubernetes/kubernetes#68357 , http://github.com/volcano-sh/volcano on what we're doing there :)

@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that!
However, I haven't got a clear idea on when to use volcano vs a Kubeflow job operator. Are these complimentary offerings?

@k82cn
Copy link
Contributor

k82cn commented May 15, 2019

@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that!

Very glad to hear that :)

However, I haven't got a clear idea on when to use volcano vs a Kubeflow job operator. Are these complimentary offerings?

Volcano is to enhance k8s's batch capability (based on kubernetes/kubernetes#68357 ); Kubeflow is easier for user to use ML frameworks :)

And we're going to work together on batch scheduling part :)

@johnugeorge
Copy link
Member

@karthikv2k This is the issue tracking it kubeflow/training-operator#960

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants