Proposal for a Common Operator #960

richardsliu · 2019-03-15T00:11:51Z

Proposal

Move common API types and lower-level libraries to a new, common repository.

Motivation

TFJob is currently in v1beta1 (v1beta2 after 0.5.0) and is fairly stable. Its common types and libraries are being used in other operators like pytorch and MPI.

As we continue to grow and support other distributed training frameworks, it makes sense to refactor and make these common types available in a standalone repository. This has the following advantages:

Reducing code duplication
Other operators do not need to import tf-operator as a dependency
Providing a cleaner pattern for future operators
More convenient for long-term API governance
Preserves the independence of each operator (as opposed to introducing a single monolithic operator for all frameworks)

Details

Create a common-operator repository, consisting of the following directories from tf-operator:

pkg/api/common
pkg/common
pkg/control
pkg/logger
pkg/util

If possible, py/kubeflow (containing test methods) can also be moved to common.

For the common API, we can introduce a new type:

type RunPolicy struct {
	// CleanPodPolicy defines the policy to kill pods after TFJob is
	// succeeded.
	// Default to Running.
	CleanPodPolicy *CleanPodPolicy `json:"cleanPodPolicy,omitempty"`

	// TTLSecondsAfterFinished is the TTL to clean up tf-jobs (temporary
	// before kubernetes adds the cleanup controller).
	// It may take extra ReconcilePeriod seconds for the cleanup, since
	// reconcile gets called periodically.
	// Default to infinite.
	TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
 
	// Specifies the duration in seconds relative to the startTime that the job may be active
	// before the system tries to terminate it; value must be positive integer
	// +optional
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`

	// Optional number of retries before marking this job failed.
	// Defaults to 6
	// +optional
	BackoffLimit *int32 `json:"backoffLimit,omitempty"`

        // Not implemented today - see https://github.com/kubeflow/tf-operator/issues/916#issuecomment-458729706
        SchedulingPolicy *SchedulingPolicy `json:"schedulingPolicy,omitempty"`
}

This will be included as part of Job specs. So after this refactoring, a job spec should contain just replica types and other framework-specific details:

type TFJobSpec struct {
	// Common run policy
	RunPolicy *common.RunPolicy `json:"runPolicy,omitempty"`

	// TFReplicaSpecs is map of TFReplicaType and ReplicaSpec
	// specifies the TF replicas to run.
	// For example,
	//   {
	//     "PS": ReplicaSpec,
	//     "Worker": ReplicaSpec,
	//   }
	TFReplicaSpecs map[TFReplicaType]*common.ReplicaSpec `json:"tfReplicaSpecs"`
}

This way the operators do not need to duplicate common functionalities. But the operators are still loosely-coupled enough such that they do not have to rely on a single implementation.

How does this sound?

@gaocegege @johnugeorge @terrytangyuan @cheyang @k82cn

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2019-03-15T00:31:18Z

Looks great to me in general! I really like the new introduced type RunPolicy. Also are we missing JobStatus here which is what initially triggered our further discussion on the common operator?

richardsliu · 2019-03-15T00:34:33Z

@terrytangyuan JobStatus is already in TFJob, so there should not be any changes:

type TFJob struct {
	metav1.TypeMeta `json:",inline"`

	// Standard object's metadata.
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Specification of the desired behavior of the TFJob.
	Spec TFJobSpec `json:"spec,omitempty"`

	// Most recently observed status of the TFJob.
	// This data may not be up to date.
	// Populated by the system.
	// Read-only.
	Status common.JobStatus `json:"status,omitempty"`
}

terrytangyuan · 2019-03-15T00:44:22Z

Thanks. Sounds good then.

johnugeorge · 2019-03-15T01:40:13Z

Yes. As we discussed, we can wait till
#958 and
#954 are merged.

Since this would be a breaking API change, should we target in 0.5 itself? We can refactor API but code can moved later?

gaocegege · 2019-03-15T01:56:53Z

SGTM

I think it will help us a lot. One thing that I care about is if the common operator will be a real operator or just a repository to store the common-maintained CRD APIs.

richardsliu · 2019-03-15T02:09:38Z

@johnugeorge This will be after 0.5, and after those 2 PRs are merged.

@gaocegege I think it is cleaner for it to be just a repository for common APIs. What do you think?

johnugeorge · 2019-03-15T03:49:07Z

@richardsliu Since it would be a breaking API change, won't this need one more release before v1?

gaocegege · 2019-03-15T04:59:37Z

@richardsliu SGTM, this is much cleaner.

k82cn · 2019-03-15T08:32:12Z

+1 to this proposal!

Actually, we already have such a common operator for those framework; and we plan to open source it recently (I can give a demo in weekly meeting if it's ok). It will try to support other frameworks; not only ML frameworks, but also some others, e.g. bigdata. And this operator will work with kube-batch for batch scheduling capability. I'm thihking whether we can work together on that.

ScorpioCPH · 2019-03-15T15:01:35Z

+1 for common operator, and for common API, can we abstract more? such as define a common master/slave model for distributed jobs.

richardsliu · 2019-03-15T17:48:28Z

@k82cn @ScorpioCPH I have considered further abstractions to make the operator common for distributed jobs, but it can become difficult to keep one implementation across different frameworks. For example the MPI operator has framework-specific details that do not apply to TF or PyTorch. I think it will be cleaner and safer to just refactor the most shared and least controversial elements (such as RunPolicy) for now. However I would be interested to see how the common operator works.

k82cn · 2019-03-16T01:18:52Z

but it can become difficult to keep one implementation across different frameworks

We already have something there, run TF and MPI job by one operator/controller; maybe I can give a demo later, and discuss our next step. WDYT?

ScorpioCPH · 2019-03-17T14:27:45Z

@richardsliu SGTM, let's achieve the final goal step by step :)

richardsliu · 2019-03-18T17:24:24Z

@k82cn Thanks for the information you shared, we would be glad to see a demo. Will you be available on Mar 27 at 8:30 a.m. Beijing time (our community call in a week)?

richardsliu · 2019-03-18T18:57:53Z

@k82cn My current thought is that for the upcoming v1.0 release, we (kubeflow) will limit the scope of changes to just abstracting the common APIs and types. This will help stabilize our patterns across different training frameworks without introducing significant changes.

Meanwhile, I would like to see how volcano develops in parallel, in particular its batch-scheduling functionalities. Post v1, we can look at options for using volcano to introduce batch scheduling support to our common operators. How does that sound?

richardsliu · 2019-03-28T21:47:29Z

The common repository is created now: https://github.com/kubeflow/common. I thought about naming it api to follow the conventions of Kubernetes, but unlike https://github.com/kubernetes/api/ which contains only types and generated Go files, we also have common libraries like the job controller.

Directory structure will be mapped as follows:

pkg/api/common/<version> -->   common/operator/<version>
pkg/control              -->   common/job_controller [1]
pkg/common/jobcontroller -->   common/job_controller
pkg/common/util/<version>/unstructured --> common/unstructured/<version>
pkg/common/util/<version>/testutil     --> common/test_util/<version>
pkg/common/logger         -->     common/util [2]
pkg/common/util           -->     common/util

[1] This is originally its own package called control. But I only see it being used by jobcontroller, so I think it can be merged.
[2] Combined into util since there's not much on its own.

richardsliu · 2019-03-28T22:39:35Z

@johnugeorge @terrytangyuan @gaocegege
A couple of open questions:

What should the initial version be? TFJob and PytorchJob are current in v1beta2 while MPI is in v1alpha2. Should we just start with v1?
What to do about dependency management? Other kubeflow components like kfctl are moving towards using Go modules (https://github.com/golang/go/wiki/Modules) but it requires Golang 1.11+.

terrytangyuan · 2019-03-29T00:48:20Z

@johnugeorge @terrytangyuan @gaocegege
A couple of open questions:

What should the initial version be? TFJob and PytorchJob are current in v1beta2 while MPI is in v1alpha2. Should we just start with v1?

Yes v1 sounds good to me.

What to do about dependency management? Other kubeflow components like kfctl are moving towards using Go modules (https://github.com/golang/go/wiki/Modules) but it requires Golang 1.11+.

I am fine with Go modules. It seems like the majority of kubeflow projects doesn't use Go modules yet (only kfp uses Go modules from a quick glance) though. MPI operator should be fine with Golang 1.11+.

johnugeorge · 2019-03-29T17:18:38Z

Sounds good to me.

k82cn · 2019-03-30T01:12:33Z

Post v1, we can look at options for using volcano to introduce batch scheduling support to our common operators. How does that sound?

That's great to work together on the batch scheduling part :)

ScorpioCPH · 2019-04-02T06:01:47Z

@richardsliu kubebuilder provides powerful libraries and tools to build operator, we have use it to build some operators internal :)

merlintang · 2019-04-14T05:07:13Z

This would be very useful for other types of operators. I have one question here, can we provide the document or example for other operator to use the common operator? this would help new users to begin.

richardsliu · 2019-04-16T00:36:40Z

@merlintang Agree, we should create an example to demonstrate how to use the common APIs. Meanwhile, @jian-he has committed this PR which defined the common interfaces to be implemented by custom operators: kubeflow/common#12.

jlewi · 2019-04-22T12:13:12Z

@richardsliu What is the remaining work to close out this issue?

jian-he · 2019-04-22T19:36:47Z

@jlewi
The common repo still needs a couple of patches to go in and also requires documentation and examples.

richardsliu · 2019-04-22T21:04:15Z

We also need a plan to migrate the existing operators (tf, pytorch, mxnet, etc) to the common library.

richardsliu · 2019-05-09T17:53:19Z

After discussion with contributors, we are postponing the migration of TF and Pytorch to the common library. Instead, a new operator (e.g. XGBoost or MPI) can start using it first, which gives us sufficient time to find all the issue. So this issue will no longer be blocking for TFJob 1.0.

merlintang · 2019-05-09T18:07:32Z

XGBoost Operator based on common lib would be a better way to debug and demonstrate.

…

On Thu, May 9, 2019 at 10:53 AM Richard Liu ***@***.***> wrote: After discussion with contributors, we are postponing the migration of TF and Pytorch to the common library. Instead, a new operator (e.g. XGBoost or MPI) can start using it first, which gives us sufficient time to find all the issue. So this issue will no longer be blocking for TFJob 1.0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#960 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAK5R6PS777EOB6EQ5HA66TPURQJJANCNFSM4G6V2D5A> .

gaocegege · 2019-07-29T08:07:48Z

I think we can close the issue now. We already have the repo for the common operator.

This was referenced Mar 27, 2019

Pytorch documentation for v1beta1 and v1beta2 kubeflow/website#545

Merged

TFJob 1.0 #968

Closed

This was referenced Apr 2, 2019

Add common types to kubeflow/common kubeflow/common#2

Merged

Common job controller library kubeflow/common#5

Merged

richardsliu added this to To Do in TFJob-PyTorch 1.0 Apr 8, 2019

jlewi moved this from To Do to In Progress in TFJob-PyTorch 1.0 Apr 15, 2019

richardsliu removed this from In Progress in TFJob-PyTorch 1.0 May 9, 2019

johnugeorge mentioned this issue May 15, 2019

Proposal: using gang scheduling API for generic distributed training support in Kubeflow kubeflow/common#37

Open

gaocegege closed this as completed Jul 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a Common Operator #960

Proposal for a Common Operator #960

richardsliu commented Mar 15, 2019

terrytangyuan commented Mar 15, 2019

richardsliu commented Mar 15, 2019

terrytangyuan commented Mar 15, 2019 •

edited

Loading

johnugeorge commented Mar 15, 2019

gaocegege commented Mar 15, 2019

richardsliu commented Mar 15, 2019

johnugeorge commented Mar 15, 2019

gaocegege commented Mar 15, 2019

k82cn commented Mar 15, 2019

ScorpioCPH commented Mar 15, 2019

richardsliu commented Mar 15, 2019

k82cn commented Mar 16, 2019

ScorpioCPH commented Mar 17, 2019

richardsliu commented Mar 18, 2019

richardsliu commented Mar 18, 2019 •

edited

Loading

richardsliu commented Mar 28, 2019

richardsliu commented Mar 28, 2019

terrytangyuan commented Mar 29, 2019 •

edited

Loading

johnugeorge commented Mar 29, 2019

k82cn commented Mar 30, 2019

ScorpioCPH commented Apr 2, 2019

merlintang commented Apr 14, 2019

richardsliu commented Apr 16, 2019

jlewi commented Apr 22, 2019

jian-he commented Apr 22, 2019

richardsliu commented Apr 22, 2019

richardsliu commented May 9, 2019

merlintang commented May 9, 2019 via email

gaocegege commented Jul 29, 2019

Proposal for a Common Operator #960

Proposal for a Common Operator #960

Comments

richardsliu commented Mar 15, 2019

Proposal

Motivation

Details

terrytangyuan commented Mar 15, 2019

richardsliu commented Mar 15, 2019

terrytangyuan commented Mar 15, 2019 • edited Loading

johnugeorge commented Mar 15, 2019

gaocegege commented Mar 15, 2019

richardsliu commented Mar 15, 2019

johnugeorge commented Mar 15, 2019

gaocegege commented Mar 15, 2019

k82cn commented Mar 15, 2019

ScorpioCPH commented Mar 15, 2019

richardsliu commented Mar 15, 2019

k82cn commented Mar 16, 2019

ScorpioCPH commented Mar 17, 2019

richardsliu commented Mar 18, 2019

richardsliu commented Mar 18, 2019 • edited Loading

richardsliu commented Mar 28, 2019

richardsliu commented Mar 28, 2019

terrytangyuan commented Mar 29, 2019 • edited Loading

johnugeorge commented Mar 29, 2019

k82cn commented Mar 30, 2019

ScorpioCPH commented Apr 2, 2019

merlintang commented Apr 14, 2019

richardsliu commented Apr 16, 2019

jlewi commented Apr 22, 2019

jian-he commented Apr 22, 2019

richardsliu commented Apr 22, 2019

richardsliu commented May 9, 2019

merlintang commented May 9, 2019 via email

gaocegege commented Jul 29, 2019

terrytangyuan commented Mar 15, 2019 •

edited

Loading

richardsliu commented Mar 18, 2019 •

edited

Loading

terrytangyuan commented Mar 29, 2019 •

edited

Loading