Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Closed
6 tasks done
gaocegege opened this issue Jan 11, 2018 · 11 comments
Closed
6 tasks done

Comments

@gaocegege
Copy link
Member

gaocegege commented Jan 11, 2018

Checklist

For tracking all issues we discussed above, we'd better set a checklist as below:

We plan to contribute our controller caicloud/kubeflow-controller to tensorflow/k8s. To do this we summarize the differences between tensorflow/k8s and caicloud/kubeflow-controller. Welcome comments to help us solve these conflicts :-)

Version: 2 (Every time I updated the content I will update the version number here)

CRD

TFJobSpec

TensorBoard support (TBD)

#209 (comment)

I think the current design of CRD is dedicated to TensorFlow training jobs. While single CRD for multiple TFJob types might not a good choose. So I'd like to close this issue, and we can open another PR to discuss the CRD of TensorBoard and TensorFlow Serving. @jlewi WDYT

caicloud/kubeflow-controller's CRD only support training, while the TFJobSpec in tensorflow/k8s still has TensorBoard. I think we could remove the field if we do not support TensorBoard in the CRD.

    //TensorBoardSpec specifies the configuration to start a TensorBoard deployment
	TensorBoard *TensorBoardSpec `json:"tensorboard"`

TfImage (TBD)

    // TfImage defines the tensorflow docker image that should be used for Tensorboard
	// and the default parameter server
	TfImage string `json:"tfImage,omitempty"`

We think the image is not necessary since the TFReplica has a pod template, which contains image field.

TerminationPolicy (TBD)

	// TerminationPolicy specifies the condition that the tfjob should be considered finished.
	TerminationPolicy *TerminationPolicySpec `json:"terminationPolicy,omitempty"`

We placed the policy in TFReplicaSpec, not int TFJobSpec, although we haven't implement the logic to finish the TFJob according to the condition defined in TerminationPolicySpec. Personally, I think we should discuss about it.

Dir (TBD)

        // DataDir specify the path of dataset
	DataDir string `json:"dataDir,omitempty"`
	// ModelDir specify the path of checkpoint, graph, etc
	ModelDir string `json:"modelDir,omitempty"`
	// LogDir specify the path of tf.events files
	LogDir string `json:"logDir,omitempty"`
	// ExportDir specify the path of saved model
        ExportDir string `json:"exportDir,omitempty"`

There is an issue about it: #224. Personally, I think if this controller is not used directly by AI engineers and there is no need to add these dirs since these could be added in TFReplicaSpec's pod template, since we all know how to write pod specification.

TFReplicaSpec

Type (TBD)

We have three type: local, worker, and PS. tensorflow/k8s has master, worker, and PS.

IsDefaultPS (TBD)

	// IsDefaultPS denotes if the parameter server should use the default grpc_tensorflow_server
	IsDefaultPS bool

We do not have a pre-defined grpc_tensorflow_server in the repository, so we do not have this field.

TfPort (TBD)

	// TfPort is the port to use for TF services.
	TfPort        *int32 `json:"tfPort,omitempty" protobuf:"varint,1,opt,name=tfPort"`

Container spec in pod template has Port so I am not sure if we need this field.

TfJobStatus

TFReplicasStates (TBD)

Now we keep consistent with tensorflow/k8s but there is one comment to use a new field to replace TFReplicasStates: https://github.com/caicloud/kubeflow-controller/issues/80#issuecomment-356509119

    // ReplicasStates provides the number of TFReplicas in each status.
	TFReplicasStates map[TFReplicaState]int `json:"tfReplicasStates,omitempty"`
// TFClusterStatus represents a TensorFlow cluster status.
// See: https://www.tensorflow.org/deploy/distributed
//      https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
// It is a map from "job_name + task_index" to status (submitted, created, failed):
// {
//     "worker_0": "created",
//     "worker_1": "created",
//     "worker_2": "submitted",
//     "worker_4": "failed",
//     "ps_0": "created",
//     "ps_1": "submitted",
// }
type TFClusterStatus map[string]string

We think the TFReplicasStates misses a little info and we should keep all statuses of the instances.

Controller

When to create CRD (TBD)

There is a discussion about this section: #281. tensorflow/k8s creates CRD when the controller is initialized, and our controller assumes that the CRD has been created before the controller is run.

How to manage replicas (Solved)

In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.

There are lots of discussions: #45 and https://github.com/caicloud/kubeflow-controller/issues/71. Now we come to an agreement:

Pod is more appropriate, then we could reuse the code in caicloud/kubeflow-controller.

Operator or controller (Solved)

According to #206, we decide to move to the controller pattern

@DjangoPeng
Copy link
Member

@jlewi PTAL

@gaocegege
Copy link
Member Author

@jlewi

I think we could discuss the differences together, some of them are not necessary for upstream and we could remove them.

Now I am trying to refactor the code and there are some things I could do now: #312 . But most of the refactor work is blocked by this issue. 😄

Look forward to receiving your reply

@jlewi
Copy link
Contributor

jlewi commented Jan 16, 2018

TfImage

TfImage is used as the default TensorFlow image for parameter servers and TensorBoard. If we get rid of it what image would we use for TensorBoard and parameter servers?

TerminationPolicy

TerminationPolicy as currently implemented is a property of the job and describes when the job should be considered terminated. So why would you add it to the replica?

Dirs

Per #224 lets add these at higher layers not in TfJobSpec.

Type

I think we should get rid of type and instead add properties that control the behavior of each replica e.g. restart behavior. Each replica can then be given a name by the user.

IsDefaultPs

Are you suggesting we get rid of support for standard gRPC servers for TF or keep it?

TfPort

You raise a good point about this being redundant with the ports in PodTemplateSpec. I think it makes sense to get rid of TfPort and just allow it to be specified in PodTemplateSpec. We could require that either

  • No ports be specified in which case a default port is specified
  • If a single port is specified we use that port
  • If multiple ports are specified one must be named TfPort

TfJobStatus

Regarding having a map from "job_name + task_index" to status (submitted, created, failed), won't this get pretty verbose for really large jobs? Why wouldn't we just emit events for each "job_name + task_index"?

What's the purpose of tracking individual replica state in the TfJobStatus?

It looks like StatefulSet only tracks aggregate statistics like readyReplicas.

When to create CRD

See #281.

@DjangoPeng
Copy link
Member

@gaocegege Thanks for sorting out the difference.
@jlewi Thanks for your reply.

TfImage

I prefer get rid of the TfImage field. For TensorBoard, it does not belong to the TFJob scope, so we just need to focus on PS. There are two alternative ways as below:

  • Use TensorFlow official image, e.g. tensorflow/tensorflow:1.4.0 .
  • We maintain a set of stable TensorFlow images according to our requirement and release cycle.

TerminationPolicy

@gaocegege I remember that we ever offline discussed the field. Is it for fine-grained controlling and termination policy at replica level?

Dirs

We have implemented and wrapped them in a higher level component internally, so please ignore them.

Type

It's helpful for users to set the replicas name. IMO, an explicit type field is benefit to AI engineer. Meanwhile we can provide some advanced policy to control the replicas of the same type.

IsDefaultPs

Outside Google, it's not a common case to provide default PS. We'd better get rid of it and focus on TensorFlow distributed training job lifecycle management.

TfPort

Vote for No ports be specified in which case a default port is specified .

TfJobStatus

For more complicated case, like model parallelism, there are different computation works for each replica, so we have to get the status and index of them. Even for data parallelism case, when it comes to policy (restart, restart and restore, don't restart), we have to known the index of replicas. If chief worker fail, the whole distributed job may fail. If non-chief worker fail, we may have different policy.

When to create CRD

Agree. We can discuss it in #281

CRD Name

I noticed that there is a difference of CRD name. We prefer TFJob cause TF is an abbreviation of TensorFlow. According to the Golang and Kubernetes convention, TFJob is better as well. While in tensorflow/k8s, it's TfJob. Could you give more info about the purpose and consideration of TfJob? @jlewi

@jlewi
Copy link
Contributor

jlewi commented Jan 19, 2018

@gaocegege @DjangoPeng Could we split this up into multiple issues for the unresolved issues? For me at least its easier to discuss a single issue per thread rather than a whole bunch of issues.

Most of what your proposing seems good to me but it will be easier to track and resolve as separate issues.

TfImage

I'd be ok getting rid of TfImage and perhaps as a result TB and DefaultPs (since they depend on it). Running TB was in many ways a hack and there are probably better solutions for running TB.

IsDefaultPs

I'd be ok getting rid of this and considering alternative suggestions; e.g. providing a Docker image that could be used as a TF server and maybe some ksonnet templates.

Type

I think this bears more discussion and relates to how we think about TerminationPolicy. e.g. should we derive TerminationPolicy for the replica from its type or should we allow TerminationPolicy to be set explicitly (in which case maybe we don't need type).

TfPort

Seems reasonable. I can't think of reason why users would want to specify a port.

TfJobStatus

I think determining the right API for Status deserves more thought and its own issue. What we have right now is very confusing.

CRD Name

My vote would be to follow the Kubernetes conventions.

@DjangoPeng
Copy link
Member

I have opened four issues to track clear and definitive parts. A checklist has been added at the most top of this issue.

PTAL @jlewi

@DjangoPeng
Copy link
Member

@ScorpioCPH Could you please open a dedicated issue for TfJobStatus?

@ScorpioCPH
Copy link
Member

FYI #333.

@DjangoPeng
Copy link
Member

@ScorpioCPH I have updated it to the checklist.

@jlewi
Copy link
Contributor

jlewi commented Jan 20, 2018

I think I've commented on all the issues.

@gaocegege
Copy link
Member Author

Closed by #492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants