Refactor TFJobStatus in CRD API #333

ScorpioCPH · 2018-01-19T06:44:55Z

Hi, this is a separate tracker/discussion issue from #283.

Motivation

As we discussed, TFJobStatus need to be more specified to tracking individual TF job status, here is our considerations:

Different data/model parallel training
- Synchronous and Asynchronous
Mange the Pod/Service lifecycle more efficient
- Restart or ignore failed Pod due to different strategies
Give us more preciser information to make good control of the whole process

What we have now

There is a little confusing about TFReplicaStatus what we have now:

type TFReplicaStatus struct {
	TFReplicaType `json:"tf_replica_type"`

	// State is the overall state of the replica
	State ReplicaState `json:"state"`

	// ReplicasStates provides the number of replicas in each status.
	ReplicasStates map[ReplicaState]int
}

TFReplicasStates just tell us how many Replicas of each state (Running, Failed, Succeeded), it is hard to maintain and seems useless to user (e.g. which one is running?)
I think user care about the whole status of a TFJob:
- Does this Job start training?
- And the status of each "job_name + task_index"

Proposed Design

Add TFClusterStatus in TFJob.Status:

// TFClusterStatus represents a TensorFlow cluster status.
// See: https://www.tensorflow.org/deploy/distributed
//      https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
// It is a map from "job_name + task_index" to status (submitted, created, failed):
// chief worker is worker_0 by default (i'm not sure about this)
// {
//     "worker_0": "created",
//     "worker_1": "created",
//     "worker_2": "submitted",
//     "worker_4": "failed",
//     "ps_0": "created",
//     "ps_1": "submitted",
// }
type TFClusterStatus map[string]string

This topic is open to discuss, so please show your ideas and let's make it clearer together :)

@jlewi @jimexist @DjangoPeng @gaocegege @ddysher @mqliang @mitake WDYT?

The text was updated successfully, but these errors were encountered:

jlewi · 2018-01-20T15:35:04Z

Is scalability an issue here? What if a job has 50 or 100 replicas? That seems like it would make the status unnecessarily verbose.
The pattern of most controllers (e.g. ReplicaSet, StatefulSet) seems to be to provide aggregated statistics? Why would we do anything different?
In many cases, it seems like the state of the job can be adequately captured by an overall status for the job (e.g. creating, running, done) and individual replica status is just unnecessary verbosity.
I think the only time you care about individual replicas is when one replica is in a different state (e.g. crash looping). Can we handle this case without reporting all replicas all the time?
- We can use aggregate statistics to indicate to the user that not all replicas are in the same state (e.g. some are crash looping)
- We can also use events to indicate that replicas are crash looping or terminated.
- User can identify afflicted pods by looking at events or getting status for all pods associated with a job.
- This flow seems like the common flow in K8s when dealing with controllers that start multiple replicas.

/cc @erikerlandson @foxish

ScorpioCPH · 2018-01-21T14:28:57Z

Is scalability an issue here? What if a job has 50 or 100 replicas? That seems like it would make the status unnecessarily verbose.

@jlewi
Thanks, this is a good question!
Will re-think about this, a quick comments: maybe we should let user to decide whether we restart a failed Pod or ignore it (by StartPolicy).

jlewi · 2018-01-23T06:12:12Z

Yes. I generally agree we might want to let user decide what the restart behavior should be.

ScorpioCPH · 2018-02-05T09:35:11Z

@jlewi @gaocegege How about this structure for TFJobStatus:

type TFJobStatus struct {
	// Status for chief worker
	// The number of actively running chief worker.
	ActiveChief int32 `json:"activeChief"`
	// The number of chief worker which was completed.
	CompletedChief int32 `json:"completedChief"`
	//The number of chief worker which was failed.
	FailedChief int32 `json:"failedChief"`

	// Status for workers (includes chief worker)
	// The number of actively running workers.
	ActiveWorkers int32 `json:"activeWorkers"`
	// The number of workers which were completed.
	CompletedWorkers int32 `json:"completedWorkers"`
	//The number of workers which were failed.
	FailedWorkers int32 `json:"failedWorkers"`

	// Status for PSs
	// The number of actively running PS.
	ActivePSs int32 `json:"activePSs"`
	//The number of PSs which were failed.
	FailedPSs int32 `json:"failedPSs"`
}

DjangoPeng · 2018-02-05T09:40:10Z

@ScorpioCPH How about using Running instead of Active?
There is only one chief worker, so only one of ActiveChief, CompletedChief and FailedChief are 1, and others would be 0. Am I right?

gaocegege · 2018-02-05T09:53:04Z

LGTM

ScorpioCPH · 2018-02-05T09:59:21Z

only one of ActiveChief, CompletedChief and FailedChief are 1, and others would be 0.

Yes. And the same as local job:

	// Status for local job
	// The number of actively running local job.
	ActiveLocalJob int32 `json:"activeLocalJob"`
	// The number of local job which was completed.
	CompletedLocalJob int32 `json:"completedLocalJob"`
	//The number of local job which was failed.
	FailedLocalJob int32 `json:"failedLocalJob"`

jlewi · 2018-02-06T00:09:52Z

What is a LocalJob?

Why not

type TFJobStatus struct {
       ReplicaStatus []ReplicaStatus
}

type ReplicaStatus struct{
    Name string
    Active int32
    Completed int32
    Failed int32
}

This seems more flexible in terms of supporting changes to the Replica type (see #64).

ScorpioCPH · 2018-02-06T02:05:37Z

What is a LocalJob?

LocalJob is a non-distributed training job.

I'm not sure it is a good pattern to keep information in deeper layers. I would think it might be a little bit tough to get the information directly and explicitly.

And about flexible and changing to the Replica type, Can you give a example about the changes?
And as we use TFJob to represent the TensorFlow jobs, so keep the status more explicitly maybe better.

jlewi · 2018-02-06T12:46:36Z

LocalJob is a non-distributed training job.
What does a non-distributed job require special handling? Why isn't that just a job with a single chief, no workers and no PS?

And about flexible and changing to the Replica type, Can you give a example about the changes?
And as we use TFJob to represent the TensorFlow jobs, so keep the status more explicitly maybe better

The proposal in #64 is to get rid of ReplicaType and introduce different properties to control different behaviors such as restart behavior rather than the inferring this based on ReplicaType.

One motivation for this is to add replicas to do evaluation which I belief is part of the Estimator API (#61).

If we define an Enum of replica types and explicitly have fields for each replica type, then any time TF introduces a new type of process (e.g. eval worker) than our API has to change.

I'm not sure it is a good pattern to keep information in deeper layers. I would think it might be a little bit tough to get the information directly and explicitly.

Can you explain? If we have an array or map of Replica status then its much easier to programmatically check all replicas because you can just iterate over the map and list. I think using a container like a map or list makes it clear that all items are the same and should be treated identically.

I think if you have distinct fields for each ReplicaType that makes it harder to process programmatically and doesn't make it clear that they are identical.

Also we should be consistent with the TFSpec. In the Spec we treat Replicas as a list of TFReplicaSpec. So I'd expect status has a ReplicaStatus which is a list of TFReplicaStatus.

In the same way that pod is an arbitrary list of containers, I think we should make TFJob an arbitrary list of replicated pods. What's the advantage of limiting TFJob to workers, ps, chief?

ScorpioCPH · 2018-02-06T17:15:13Z

Can you explain?

Sure, each time I want to get how many worker Pods are running, I must iterate over the whole array, filter which type == workers, return the value i want. This is

If we have an array or map of Replica status then its much easier to programmatically check all replicas because you can just iterate over the map and list.

A map is better than array/list, we can have many map operations (e.g. set/get).

pod is an arbitrary list of containers.

But the containers have the same template (no different type).

How abut use map:

type TFJobStatus struct {
    // type is string
    ReplicaStatus map[string]ReplicaStatus
}

type ReplicaStatus struct{
    Active int32
    Completed int32
    Failed int32
}

jlewi · 2018-02-06T22:40:32Z

I think the spec and status should be consistent; either both lists or both maps

So option 1

type TFJobSpec struct {
   // ReplicaSpecs specifies the TF replicas to run.
   ReplicaSpecs []*TFReplicaSpec `json:"replicaSpecs"`
}

type TFJobStatus struct {
     ReplicaStatus []*TFReplicaStatus 
}

Option 2

type TFJobSpec struct {
   // ReplicaSpecs specifies the TF replicas to run.
  ReplicaSpecs map[string]*TFReplicaSpec
}

type TFJobStatus struct {
   ReplicaStatus map[string]*TFReplicaStatus
}

I think I prefer lists for the following reasons

This is the convention in K8s
- e.g. a Pod contains a list of containers/volumes not a map of containers/volumes
With a map the question becomes do we store the name in ReplicaSpecs/ReplicaStatus or just rely on the key in the map
- If we don't store the name in ReplicaSpecs/ReplicaStatus then I think that makes for messy code because we will usually want to deal with the tuples (name, ReplicaSpec) or (name, ReplicaStatus)
- If we do store the name in ReplicaSpecs/ReplicaStatus then the key will duplicate information in ReplicaSpecs.name/ReplicaStatus.name and inevitably this will lead to bugs when they get out of sync.

Its true that if you are dealing with a list you have to do more work to find the Spec/Status for a particular Replica; but the same is true with getting container spec/status in a pod with multiple containers..

ScorpioCPH · 2018-02-07T01:52:00Z

Pod contains a list of containers/volumes not a map of containers/volumes.

TFJobs are not the same as Pods, we have ReplicaType, this is the key point which make things more complex.

store the name in ReplicaSpecs/ReplicaStatus.

Why should we store the name, i think it is read-only. Is there any chance we will modify it?

Another case, we create Pods/Services due to ReplicaSpecs, then watch these Pods are created to update the ReplicaStatus. We can set this directly by TFJob.Status["worker"].active without iteration.

gaocegege · 2018-02-07T07:12:05Z

I prefer map, for the following reasons:

the data structure is easy to iterate. When i implement caicloud/kubeflow-controller, I found that it is not easy to get the index of the spec in the list: https://github.com/caicloud/kubeflow-controller/blob/master/pkg/tensorflow/distributed.go#L201. And if we replace the spec list with map, it will be easier.
the config is tidy:

ps:
#spec
worker:
#spec

And pod has lists for containers and volumes because each container in one pod is equivalent, IMO. In our case, PS and workers are different roles so I think the map is more appropriate.

ScorpioCPH · 2018-02-08T11:46:52Z

@jlewi WDYT? Maybe we can start this change in api/v1alpha2.

jlewi · 2018-02-09T14:40:19Z

@ScorpioCPH yes lets start api/v1alpha2.go and use that to itterate on an API.

TFJobs are not the same as Pods, we have ReplicaType, this is the key point which make things more complex.

I'm not sure I understand your point about ReplicaType. My suggestion is to get rid of ReplicaType and just have Name and various properties (e.g. TerminationPolicy) that determine the behavior of replica.

At this point a TFJob consists of many Replicas each identified by a Name. The pattern in K8s e.g. with volumes and containers, is that when you have a container containing many items a list of named entities is used rather than a map from name to entity.

So if we use a map and not a list we aren't following K8s conventions.

Why should we store the name, i think it is read-only. Is there any chance we will modify it?

If you don't store the name then I think it makes code more complicated because ReplicaSpec is
incomplete because it doesn't contain the name. So instead of doing

func `somefunc(r ReplicaSpec) {
...
}

we'd have to do

func `somefunc(name string, r ReplicaSpec) {
...
}

name is a key property of ReplicaSpec so it should be a part of the struct.

If you put name in ReplicaSpec and use a map then the user has to specify name in two places.

Another case, we create Pods/Services due to ReplicaSpecs, then watch these Pods are created to update the ReplicaStatus. We can set this directly by TFJob.Status["worker"].active without iteration.

I agree this might be a little more convenient but I think the API issues mentioned above are more important because they aren't hidden. In this case we can define a simple func

func updateStatus(j *TFJobStatus, r *ReplicaStatus) {
....
}

And pod has lists for containers and volumes because each container in one pod is equivalent, IMO. In our case, PS and workers are different roles so I think the map is more appropriate.

I don't understand this. Each container can run a different binary and have different properties (e.g. resource requirements). That seems analogous to the situation we have with replica.

gaocegege · 2018-02-09T16:24:20Z

I don't understand this. Each container can run a different binary and have different properties (e.g. resource requirements). That seems analogous to the situation we have with replica.

Sorry for the vague comment

My opinion is that the containers could be handled by the same logic, but in our case, PS and worker have different execution path. Because PS and worker are different roles, then we have to deal with them in different ways.

The logic to handle container creation may be:

func HandleContainerCreation(pod *api.Pod) {
    for _, container := range pod.Containers {
        # Same Logic here
    }
}

But we can not use the same code for PS and worker, then we have to:

func HandleTFJobCreation(tfJob *api.TFJob) {
    for _, spec := range tfJob.Specs {
        if isPS(spec) {
            # PS logic
        }
        else {
            # Worker logic
        }
    }
}

For example, we will check if all workers are finished, because if they do the TFJob is considered to be finished. But we do not check the PS, since PS are never finished if we do not kill them. When I am trying to implement an event-driven operator for TFJob, I find that the function getPSSpec(tfJob) and getWorkerSpec(tfJob) are used heavily since it is a list and I have to handle them in different ways.

I think we are similar to ResourceList, which is a map in Kubernetes core v1. But it also works for me if we use list, I am not strongly against it.

jlewi · 2018-02-13T14:12:26Z

You make a good point with ResourceList.

The fact that we need different logic for different types of containers seems unrelated to whether its a map or list. We will still need to have different logic. However, with the proposed changes to get rid of ReplicaType, the logic will depend instead on the TerminationPolicy of the replica.

@gaocegege @ScorpioCPH you convinced me that a map is better. Lets go with a map from replica name to spec and a map for replica name to status.

jlewi · 2018-03-02T13:35:58Z

@ScorpioCPH @gaocegege Any update on api/v1alpha2?

I checked the refactor branch. I didn't see a [v1alpha2] (https://github.com/kubeflow/tf-operator/tree/refactor/pkg/apis/tensorflow) directory.

gaocegege · 2018-03-02T13:38:10Z

@jlewi I think we are going to implement v1alpha2 these days.

ScorpioCPH · 2018-03-02T14:36:12Z

I'm working on a design doc for v1alpha2. Let's design first then implement it :)

gaocegege · 2018-04-22T10:37:08Z

Closed by #492

ScorpioCPH mentioned this issue Jan 19, 2018

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Closed

6 tasks

jlewi added this to the Kubecon Europe milestone Jan 23, 2018

gaocegege added area/api kind/api-change labels Feb 1, 2018

ScorpioCPH mentioned this issue Feb 9, 2018

Create Pod instead of Job #344

Merged

gaocegege added the priority/p0 label Feb 9, 2018

gaocegege added the difficulty/medium label Feb 16, 2018

DjangoPeng mentioned this issue Apr 19, 2018

[v1alpha2] The state of distributed model training. #544

Closed

gaocegege closed this as completed Apr 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor TFJobStatus in CRD API #333

Refactor TFJobStatus in CRD API #333

ScorpioCPH commented Jan 19, 2018

jlewi commented Jan 20, 2018

ScorpioCPH commented Jan 21, 2018

jlewi commented Jan 23, 2018

ScorpioCPH commented Feb 5, 2018

DjangoPeng commented Feb 5, 2018

gaocegege commented Feb 5, 2018

ScorpioCPH commented Feb 5, 2018

jlewi commented Feb 6, 2018

ScorpioCPH commented Feb 6, 2018

jlewi commented Feb 6, 2018

ScorpioCPH commented Feb 6, 2018

jlewi commented Feb 6, 2018

ScorpioCPH commented Feb 7, 2018

gaocegege commented Feb 7, 2018

ScorpioCPH commented Feb 8, 2018

jlewi commented Feb 9, 2018

gaocegege commented Feb 9, 2018 •

edited

jlewi commented Feb 13, 2018

jlewi commented Mar 2, 2018

gaocegege commented Mar 2, 2018

ScorpioCPH commented Mar 2, 2018 via email •

edited

gaocegege commented Apr 22, 2018

Refactor TFJobStatus in CRD API #333

Refactor TFJobStatus in CRD API #333

Comments

ScorpioCPH commented Jan 19, 2018

Motivation

What we have now

Proposed Design

jlewi commented Jan 20, 2018

ScorpioCPH commented Jan 21, 2018

jlewi commented Jan 23, 2018

ScorpioCPH commented Feb 5, 2018

DjangoPeng commented Feb 5, 2018

gaocegege commented Feb 5, 2018

ScorpioCPH commented Feb 5, 2018

jlewi commented Feb 6, 2018

ScorpioCPH commented Feb 6, 2018

jlewi commented Feb 6, 2018

ScorpioCPH commented Feb 6, 2018

jlewi commented Feb 6, 2018

ScorpioCPH commented Feb 7, 2018

gaocegege commented Feb 7, 2018

ScorpioCPH commented Feb 8, 2018

jlewi commented Feb 9, 2018

gaocegege commented Feb 9, 2018 • edited

jlewi commented Feb 13, 2018

jlewi commented Mar 2, 2018

gaocegege commented Mar 2, 2018

ScorpioCPH commented Mar 2, 2018 via email • edited

gaocegege commented Apr 22, 2018

gaocegege commented Feb 9, 2018 •

edited

ScorpioCPH commented Mar 2, 2018 via email •

edited