[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

gaocegege · 2018-01-11T07:43:50Z

Checklist

For tracking all issues we discussed above, we'd better set a checklist as below:

We plan to contribute our controller caicloud/kubeflow-controller to tensorflow/k8s. To do this we summarize the differences between tensorflow/k8s and caicloud/kubeflow-controller. Welcome comments to help us solve these conflicts :-)

Version: 2 (Every time I updated the content I will update the version number here)

CRD

TFJobSpec

TensorBoard support (TBD)

#209 (comment)

I think the current design of CRD is dedicated to TensorFlow training jobs. While single CRD for multiple TFJob types might not a good choose. So I'd like to close this issue, and we can open another PR to discuss the CRD of TensorBoard and TensorFlow Serving. @jlewi WDYT

caicloud/kubeflow-controller's CRD only support training, while the TFJobSpec in tensorflow/k8s still has TensorBoard. I think we could remove the field if we do not support TensorBoard in the CRD.

    //TensorBoardSpec specifies the configuration to start a TensorBoard deployment
	TensorBoard *TensorBoardSpec `json:"tensorboard"`

TfImage (TBD)

    // TfImage defines the tensorflow docker image that should be used for Tensorboard
	// and the default parameter server
	TfImage string `json:"tfImage,omitempty"`

We think the image is not necessary since the TFReplica has a pod template, which contains image field.

TerminationPolicy (TBD)

	// TerminationPolicy specifies the condition that the tfjob should be considered finished.
	TerminationPolicy *TerminationPolicySpec `json:"terminationPolicy,omitempty"`

We placed the policy in TFReplicaSpec, not int TFJobSpec, although we haven't implement the logic to finish the TFJob according to the condition defined in TerminationPolicySpec. Personally, I think we should discuss about it.

Dir (TBD)

        // DataDir specify the path of dataset
	DataDir string `json:"dataDir,omitempty"`
	// ModelDir specify the path of checkpoint, graph, etc
	ModelDir string `json:"modelDir,omitempty"`
	// LogDir specify the path of tf.events files
	LogDir string `json:"logDir,omitempty"`
	// ExportDir specify the path of saved model
        ExportDir string `json:"exportDir,omitempty"`

There is an issue about it: #224. Personally, I think if this controller is not used directly by AI engineers and there is no need to add these dirs since these could be added in TFReplicaSpec's pod template, since we all know how to write pod specification.

TFReplicaSpec

Type (TBD)

We have three type: local, worker, and PS. tensorflow/k8s has master, worker, and PS.

IsDefaultPS (TBD)

	// IsDefaultPS denotes if the parameter server should use the default grpc_tensorflow_server
	IsDefaultPS bool

We do not have a pre-defined grpc_tensorflow_server in the repository, so we do not have this field.

TfPort (TBD)

	// TfPort is the port to use for TF services.
	TfPort        *int32 `json:"tfPort,omitempty" protobuf:"varint,1,opt,name=tfPort"`

Container spec in pod template has Port so I am not sure if we need this field.

TfJobStatus

TFReplicasStates (TBD)

Now we keep consistent with tensorflow/k8s but there is one comment to use a new field to replace TFReplicasStates: https://github.com/caicloud/kubeflow-controller/issues/80#issuecomment-356509119

    // ReplicasStates provides the number of TFReplicas in each status.
	TFReplicasStates map[TFReplicaState]int `json:"tfReplicasStates,omitempty"`

// TFClusterStatus represents a TensorFlow cluster status.
// See: https://www.tensorflow.org/deploy/distributed
//      https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
// It is a map from "job_name + task_index" to status (submitted, created, failed):
// {
//     "worker_0": "created",
//     "worker_1": "created",
//     "worker_2": "submitted",
//     "worker_4": "failed",
//     "ps_0": "created",
//     "ps_1": "submitted",
// }
type TFClusterStatus map[string]string

We think the TFReplicasStates misses a little info and we should keep all statuses of the instances.

Controller

When to create CRD (TBD)

There is a discussion about this section: #281. tensorflow/k8s creates CRD when the controller is initialized, and our controller assumes that the CRD has been created before the controller is run.

How to manage replicas (Solved)

In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.

There are lots of discussions: #45 and https://github.com/caicloud/kubeflow-controller/issues/71. Now we come to an agreement:

Pod is more appropriate, then we could reuse the code in caicloud/kubeflow-controller.

Operator or controller (Solved)

According to #206, we decide to move to the controller pattern

The text was updated successfully, but these errors were encountered:

DjangoPeng · 2018-01-15T03:40:53Z

@jlewi PTAL

gaocegege · 2018-01-15T09:06:50Z

@jlewi

I think we could discuss the differences together, some of them are not necessary for upstream and we could remove them.

Now I am trying to refactor the code and there are some things I could do now: #312 . But most of the refactor work is blocked by this issue. 😄

Look forward to receiving your reply

jlewi · 2018-01-16T13:34:32Z

TfImage

TfImage is used as the default TensorFlow image for parameter servers and TensorBoard. If we get rid of it what image would we use for TensorBoard and parameter servers?

TerminationPolicy

TerminationPolicy as currently implemented is a property of the job and describes when the job should be considered terminated. So why would you add it to the replica?

Dirs

Per #224 lets add these at higher layers not in TfJobSpec.

Type

I think we should get rid of type and instead add properties that control the behavior of each replica e.g. restart behavior. Each replica can then be given a name by the user.

IsDefaultPs

Are you suggesting we get rid of support for standard gRPC servers for TF or keep it?

TfPort

You raise a good point about this being redundant with the ports in PodTemplateSpec. I think it makes sense to get rid of TfPort and just allow it to be specified in PodTemplateSpec. We could require that either

No ports be specified in which case a default port is specified
If a single port is specified we use that port
If multiple ports are specified one must be named TfPort

TfJobStatus

Regarding having a map from "job_name + task_index" to status (submitted, created, failed), won't this get pretty verbose for really large jobs? Why wouldn't we just emit events for each "job_name + task_index"?

What's the purpose of tracking individual replica state in the TfJobStatus?

It looks like StatefulSet only tracks aggregate statistics like readyReplicas.

When to create CRD

See #281.

DjangoPeng · 2018-01-18T03:08:57Z

@gaocegege Thanks for sorting out the difference.
@jlewi Thanks for your reply.

TfImage

I prefer get rid of the TfImage field. For TensorBoard, it does not belong to the TFJob scope, so we just need to focus on PS. There are two alternative ways as below:

Use TensorFlow official image, e.g. tensorflow/tensorflow:1.4.0 .
We maintain a set of stable TensorFlow images according to our requirement and release cycle.

TerminationPolicy

@gaocegege I remember that we ever offline discussed the field. Is it for fine-grained controlling and termination policy at replica level?

Dirs

We have implemented and wrapped them in a higher level component internally, so please ignore them.

Type

It's helpful for users to set the replicas name. IMO, an explicit type field is benefit to AI engineer. Meanwhile we can provide some advanced policy to control the replicas of the same type.

IsDefaultPs

Outside Google, it's not a common case to provide default PS. We'd better get rid of it and focus on TensorFlow distributed training job lifecycle management.

TfPort

Vote for No ports be specified in which case a default port is specified .

TfJobStatus

For more complicated case, like model parallelism, there are different computation works for each replica, so we have to get the status and index of them. Even for data parallelism case, when it comes to policy (restart, restart and restore, don't restart), we have to known the index of replicas. If chief worker fail, the whole distributed job may fail. If non-chief worker fail, we may have different policy.

When to create CRD

Agree. We can discuss it in #281

CRD Name

I noticed that there is a difference of CRD name. We prefer TFJob cause TF is an abbreviation of TensorFlow. According to the Golang and Kubernetes convention, TFJob is better as well. While in tensorflow/k8s, it's TfJob. Could you give more info about the purpose and consideration of TfJob? @jlewi

jlewi · 2018-01-19T01:40:03Z

@gaocegege @DjangoPeng Could we split this up into multiple issues for the unresolved issues? For me at least its easier to discuss a single issue per thread rather than a whole bunch of issues.

Most of what your proposing seems good to me but it will be easier to track and resolve as separate issues.

TfImage

I'd be ok getting rid of TfImage and perhaps as a result TB and DefaultPs (since they depend on it). Running TB was in many ways a hack and there are probably better solutions for running TB.

IsDefaultPs

I'd be ok getting rid of this and considering alternative suggestions; e.g. providing a Docker image that could be used as a TF server and maybe some ksonnet templates.

Type

I think this bears more discussion and relates to how we think about TerminationPolicy. e.g. should we derive TerminationPolicy for the replica from its type or should we allow TerminationPolicy to be set explicitly (in which case maybe we don't need type).

TfPort

Seems reasonable. I can't think of reason why users would want to specify a port.

TfJobStatus

I think determining the right API for Status deserves more thought and its own issue. What we have right now is very confusing.

CRD Name

My vote would be to follow the Kubernetes conventions.

DjangoPeng · 2018-01-19T03:58:33Z

I have opened four issues to track clear and definitive parts. A checklist has been added at the most top of this issue.

PTAL @jlewi

DjangoPeng · 2018-01-19T03:59:28Z

@ScorpioCPH Could you please open a dedicated issue for TfJobStatus?

ScorpioCPH · 2018-01-19T06:45:40Z

FYI #333.

DjangoPeng · 2018-01-19T09:46:20Z

@ScorpioCPH I have updated it to the checklist.

jlewi · 2018-01-20T15:54:32Z

I think I've commented on all the issues.

gaocegege · 2018-04-22T10:41:08Z

Closed by #492

ScorpioCPH mentioned this issue Jan 15, 2018

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Merged

jlewi mentioned this issue Jan 15, 2018

Make the TfJob controller more event driven #314

Closed

DjangoPeng added kind/enhancement kind/question labels Jan 18, 2018

This was referenced Jan 19, 2018

Deprecate TfPort and set default port for users #327

Closed

[Convention] Replace Tf with TF in CRD #328

Closed

Deprecate the IsDefaultPS field #329

Closed

Deprecate the TfImage field #330

Closed

ScorpioCPH mentioned this issue Jan 19, 2018

Refactor TFJobStatus in CRD API #333

Closed

DjangoPeng removed the kind/question label Jan 22, 2018

gaocegege added kind/discussion area/api area/lifecycle area/crd area/operator team/caicloud kind/api-change and removed kind/enhancement labels Feb 1, 2018

DjangoPeng mentioned this issue Apr 19, 2018

[v1alpha2] The state of distributed model training. #544

Closed

gaocegege closed this as completed Apr 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

gaocegege commented Jan 11, 2018 •

edited

DjangoPeng commented Jan 15, 2018

gaocegege commented Jan 15, 2018

jlewi commented Jan 16, 2018

DjangoPeng commented Jan 18, 2018

jlewi commented Jan 19, 2018

DjangoPeng commented Jan 19, 2018

DjangoPeng commented Jan 19, 2018

ScorpioCPH commented Jan 19, 2018

DjangoPeng commented Jan 19, 2018

jlewi commented Jan 20, 2018

gaocegege commented Apr 22, 2018

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Comments

gaocegege commented Jan 11, 2018 • edited

Checklist

CRD

TFJobSpec

TensorBoard support (TBD)

TfImage (TBD)

TerminationPolicy (TBD)

Dir (TBD)

TFReplicaSpec

Type (TBD)

IsDefaultPS (TBD)

TfPort (TBD)

TfJobStatus

TFReplicasStates (TBD)

Controller

When to create CRD (TBD)

How to manage replicas (Solved)

Operator or controller (Solved)

DjangoPeng commented Jan 15, 2018

gaocegege commented Jan 15, 2018

jlewi commented Jan 16, 2018

DjangoPeng commented Jan 18, 2018

jlewi commented Jan 19, 2018

TfImage

IsDefaultPs

Type

TfPort

TfJobStatus

CRD Name

DjangoPeng commented Jan 19, 2018

DjangoPeng commented Jan 19, 2018

ScorpioCPH commented Jan 19, 2018

DjangoPeng commented Jan 19, 2018

jlewi commented Jan 20, 2018

gaocegege commented Apr 22, 2018

gaocegege commented Jan 11, 2018 •

edited