Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor TFJobStatus in CRD API #333

Closed
ScorpioCPH opened this issue Jan 19, 2018 · 22 comments
Closed

Refactor TFJobStatus in CRD API #333

ScorpioCPH opened this issue Jan 19, 2018 · 22 comments

Comments

@ScorpioCPH
Copy link
Member

Hi, this is a separate tracker/discussion issue from #283.

Motivation

As we discussed, TFJobStatus need to be more specified to tracking individual TF job status, here is our considerations:

  • Different data/model parallel training
    • Synchronous and Asynchronous
  • Mange the Pod/Service lifecycle more efficient
    • Restart or ignore failed Pod due to different strategies
  • Give us more preciser information to make good control of the whole process

What we have now

There is a little confusing about TFReplicaStatus what we have now:

type TFReplicaStatus struct {
	TFReplicaType `json:"tf_replica_type"`

	// State is the overall state of the replica
	State ReplicaState `json:"state"`

	// ReplicasStates provides the number of replicas in each status.
	ReplicasStates map[ReplicaState]int
}
  • TFReplicasStates just tell us how many Replicas of each state (Running, Failed, Succeeded), it is hard to maintain and seems useless to user (e.g. which one is running?)
  • I think user care about the whole status of a TFJob:
    • Does this Job start training?
    • And the status of each "job_name + task_index"

Proposed Design

Add TFClusterStatus in TFJob.Status:

// TFClusterStatus represents a TensorFlow cluster status.
// See: https://www.tensorflow.org/deploy/distributed
//      https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
// It is a map from "job_name + task_index" to status (submitted, created, failed):
// chief worker is worker_0 by default (i'm not sure about this)
// {
//     "worker_0": "created",
//     "worker_1": "created",
//     "worker_2": "submitted",
//     "worker_4": "failed",
//     "ps_0": "created",
//     "ps_1": "submitted",
// }
type TFClusterStatus map[string]string

This topic is open to discuss, so please show your ideas and let's make it clearer together :)

@jlewi @jimexist @DjangoPeng @gaocegege @ddysher @mqliang @mitake WDYT?

@jlewi
Copy link
Contributor

jlewi commented Jan 20, 2018

  • Is scalability an issue here? What if a job has 50 or 100 replicas? That seems like it would make the status unnecessarily verbose.

  • The pattern of most controllers (e.g. ReplicaSet, StatefulSet) seems to be to provide aggregated statistics? Why would we do anything different?

  • In many cases, it seems like the state of the job can be adequately captured by an overall status for the job (e.g. creating, running, done) and individual replica status is just unnecessary verbosity.

  • I think the only time you care about individual replicas is when one replica is in a different state (e.g. crash looping). Can we handle this case without reporting all replicas all the time?

    • We can use aggregate statistics to indicate to the user that not all replicas are in the same state (e.g. some are crash looping)
    • We can also use events to indicate that replicas are crash looping or terminated.
    • User can identify afflicted pods by looking at events or getting status for all pods associated with a job.
    • This flow seems like the common flow in K8s when dealing with controllers that start multiple replicas.

/cc @erikerlandson @foxish

@ScorpioCPH
Copy link
Member Author

Is scalability an issue here? What if a job has 50 or 100 replicas? That seems like it would make the status unnecessarily verbose.

@jlewi
Thanks, this is a good question!
Will re-think about this, a quick comments: maybe we should let user to decide whether we restart a failed Pod or ignore it (by StartPolicy).

@jlewi
Copy link
Contributor

jlewi commented Jan 23, 2018

Yes. I generally agree we might want to let user decide what the restart behavior should be.

@ScorpioCPH
Copy link
Member Author

@jlewi @gaocegege How about this structure for TFJobStatus:

type TFJobStatus struct {
	// Status for chief worker
	// The number of actively running chief worker.
	ActiveChief int32 `json:"activeChief"`
	// The number of chief worker which was completed.
	CompletedChief int32 `json:"completedChief"`
	//The number of chief worker which was failed.
	FailedChief int32 `json:"failedChief"`

	// Status for workers (includes chief worker)
	// The number of actively running workers.
	ActiveWorkers int32 `json:"activeWorkers"`
	// The number of workers which were completed.
	CompletedWorkers int32 `json:"completedWorkers"`
	//The number of workers which were failed.
	FailedWorkers int32 `json:"failedWorkers"`

	// Status for PSs
	// The number of actively running PS.
	ActivePSs int32 `json:"activePSs"`
	//The number of PSs which were failed.
	FailedPSs int32 `json:"failedPSs"`
}

@DjangoPeng
Copy link
Member

@ScorpioCPH How about using Running instead of Active?
There is only one chief worker, so only one of ActiveChief, CompletedChief and FailedChief are 1, and others would be 0. Am I right?

@gaocegege
Copy link
Member

LGTM

@ScorpioCPH
Copy link
Member Author

only one of ActiveChief, CompletedChief and FailedChief are 1, and others would be 0.

Yes. And the same as local job:

	// Status for local job
	// The number of actively running local job.
	ActiveLocalJob int32 `json:"activeLocalJob"`
	// The number of local job which was completed.
	CompletedLocalJob int32 `json:"completedLocalJob"`
	//The number of local job which was failed.
	FailedLocalJob int32 `json:"failedLocalJob"`

@jlewi
Copy link
Contributor

jlewi commented Feb 6, 2018

What is a LocalJob?

Why not

type TFJobStatus struct {
       ReplicaStatus []ReplicaStatus
}

type ReplicaStatus struct{
    Name string
    Active int32
    Completed int32
    Failed int32
}

This seems more flexible in terms of supporting changes to the Replica type (see #64).

@ScorpioCPH
Copy link
Member Author

What is a LocalJob?

LocalJob is a non-distributed training job.

I'm not sure it is a good pattern to keep information in deeper layers. I would think it might be a little bit tough to get the information directly and explicitly.

And about flexible and changing to the Replica type, Can you give a example about the changes?
And as we use TFJob to represent the TensorFlow jobs, so keep the status more explicitly maybe better.

@jlewi
Copy link
Contributor

jlewi commented Feb 6, 2018

LocalJob is a non-distributed training job.
What does a non-distributed job require special handling? Why isn't that just a job with a single chief, no workers and no PS?

And about flexible and changing to the Replica type, Can you give a example about the changes?
And as we use TFJob to represent the TensorFlow jobs, so keep the status more explicitly maybe better

The proposal in #64 is to get rid of ReplicaType and introduce different properties to control different behaviors such as restart behavior rather than the inferring this based on ReplicaType.

One motivation for this is to add replicas to do evaluation which I belief is part of the Estimator API (#61).

If we define an Enum of replica types and explicitly have fields for each replica type, then any time TF introduces a new type of process (e.g. eval worker) than our API has to change.

I'm not sure it is a good pattern to keep information in deeper layers. I would think it might be a little bit tough to get the information directly and explicitly.

Can you explain? If we have an array or map of Replica status then its much easier to programmatically check all replicas because you can just iterate over the map and list. I think using a container like a map or list makes it clear that all items are the same and should be treated identically.

I think if you have distinct fields for each ReplicaType that makes it harder to process programmatically and doesn't make it clear that they are identical.

Also we should be consistent with the TFSpec. In the Spec we treat Replicas as a list of TFReplicaSpec. So I'd expect status has a ReplicaStatus which is a list of TFReplicaStatus.

In the same way that pod is an arbitrary list of containers, I think we should make TFJob an arbitrary list of replicated pods. What's the advantage of limiting TFJob to workers, ps, chief?

@ScorpioCPH
Copy link
Member Author

Can you explain?

Sure, each time I want to get how many worker Pods are running, I must iterate over the whole array, filter which type == workers, return the value i want. This is

If we have an array or map of Replica status then its much easier to programmatically check all replicas because you can just iterate over the map and list.

A map is better than array/list, we can have many map operations (e.g. set/get).

pod is an arbitrary list of containers.

But the containers have the same template (no different type).

How abut use map:

type TFJobStatus struct {
    // type is string
    ReplicaStatus map[string]ReplicaStatus
}

type ReplicaStatus struct{
    Active int32
    Completed int32
    Failed int32
}

@jlewi
Copy link
Contributor

jlewi commented Feb 6, 2018

I think the spec and status should be consistent; either both lists or both maps

So option 1

type TFJobSpec struct {
   // ReplicaSpecs specifies the TF replicas to run.
   ReplicaSpecs []*TFReplicaSpec `json:"replicaSpecs"`
}

type TFJobStatus struct {
     ReplicaStatus []*TFReplicaStatus 
}

Option 2

type TFJobSpec struct {
   // ReplicaSpecs specifies the TF replicas to run.
  ReplicaSpecs map[string]*TFReplicaSpec
}

type TFJobStatus struct {
   ReplicaStatus map[string]*TFReplicaStatus
}

I think I prefer lists for the following reasons

  • This is the convention in K8s
    • e.g. a Pod contains a list of containers/volumes not a map of containers/volumes
  • With a map the question becomes do we store the name in ReplicaSpecs/ReplicaStatus or just rely on the key in the map
    • If we don't store the name in ReplicaSpecs/ReplicaStatus then I think that makes for messy code because we will usually want to deal with the tuples (name, ReplicaSpec) or (name, ReplicaStatus)
    • If we do store the name in ReplicaSpecs/ReplicaStatus then the key will duplicate information in ReplicaSpecs.name/ReplicaStatus.name and inevitably this will lead to bugs when they get out of sync.

Its true that if you are dealing with a list you have to do more work to find the Spec/Status for a particular Replica; but the same is true with getting container spec/status in a pod with multiple containers..

@ScorpioCPH
Copy link
Member Author

Pod contains a list of containers/volumes not a map of containers/volumes.

TFJobs are not the same as Pods, we have ReplicaType, this is the key point which make things more complex.

store the name in ReplicaSpecs/ReplicaStatus.

Why should we store the name, i think it is read-only. Is there any chance we will modify it?

Another case, we create Pods/Services due to ReplicaSpecs, then watch these Pods are created to update the ReplicaStatus. We can set this directly by TFJob.Status["worker"].active without iteration.

@gaocegege
Copy link
Member

I prefer map, for the following reasons:

ps:
#spec
worker:
#spec

And pod has lists for containers and volumes because each container in one pod is equivalent, IMO. In our case, PS and workers are different roles so I think the map is more appropriate.

@ScorpioCPH
Copy link
Member Author

@jlewi WDYT? Maybe we can start this change in api/v1alpha2.

@jlewi
Copy link
Contributor

jlewi commented Feb 9, 2018

@ScorpioCPH yes lets start api/v1alpha2.go and use that to itterate on an API.

TFJobs are not the same as Pods, we have ReplicaType, this is the key point which make things more complex.

I'm not sure I understand your point about ReplicaType. My suggestion is to get rid of ReplicaType and just have Name and various properties (e.g. TerminationPolicy) that determine the behavior of replica.

At this point a TFJob consists of many Replicas each identified by a Name. The pattern in K8s e.g. with volumes and containers, is that when you have a container containing many items a list of named entities is used rather than a map from name to entity.

So if we use a map and not a list we aren't following K8s conventions.

Why should we store the name, i think it is read-only. Is there any chance we will modify it?

If you don't store the name then I think it makes code more complicated because ReplicaSpec is
incomplete because it doesn't contain the name. So instead of doing

func `somefunc(r ReplicaSpec) {
...
}

we'd have to do

func `somefunc(name string, r ReplicaSpec) {
...
}

name is a key property of ReplicaSpec so it should be a part of the struct.

If you put name in ReplicaSpec and use a map then the user has to specify name in two places.

Another case, we create Pods/Services due to ReplicaSpecs, then watch these Pods are created to update the ReplicaStatus. We can set this directly by TFJob.Status["worker"].active without iteration.

I agree this might be a little more convenient but I think the API issues mentioned above are more important because they aren't hidden. In this case we can define a simple func

func updateStatus(j *TFJobStatus, r *ReplicaStatus) {
....
}

And pod has lists for containers and volumes because each container in one pod is equivalent, IMO. In our case, PS and workers are different roles so I think the map is more appropriate.

I don't understand this. Each container can run a different binary and have different properties (e.g. resource requirements). That seems analogous to the situation we have with replica.

@gaocegege
Copy link
Member

gaocegege commented Feb 9, 2018

I don't understand this. Each container can run a different binary and have different properties (e.g. resource requirements). That seems analogous to the situation we have with replica.

Sorry for the vague comment

My opinion is that the containers could be handled by the same logic, but in our case, PS and worker have different execution path. Because PS and worker are different roles, then we have to deal with them in different ways.

The logic to handle container creation may be:

func HandleContainerCreation(pod *api.Pod) {
    for _, container := range pod.Containers {
        # Same Logic here
    }
}

But we can not use the same code for PS and worker, then we have to:

func HandleTFJobCreation(tfJob *api.TFJob) {
    for _, spec := range tfJob.Specs {
        if isPS(spec) {
            # PS logic
        }
        else {
            # Worker logic
        }
    }
}

For example, we will check if all workers are finished, because if they do the TFJob is considered to be finished. But we do not check the PS, since PS are never finished if we do not kill them. When I am trying to implement an event-driven operator for TFJob, I find that the function getPSSpec(tfJob) and getWorkerSpec(tfJob) are used heavily since it is a list and I have to handle them in different ways.

I think we are similar to ResourceList, which is a map in Kubernetes core v1. But it also works for me if we use list, I am not strongly against it.

@jlewi
Copy link
Contributor

jlewi commented Feb 13, 2018

You make a good point with ResourceList.

The fact that we need different logic for different types of containers seems unrelated to whether its a map or list. We will still need to have different logic. However, with the proposed changes to get rid of ReplicaType, the logic will depend instead on the TerminationPolicy of the replica.

@gaocegege @ScorpioCPH you convinced me that a map is better. Lets go with a map from replica name to spec and a map for replica name to status.

@jlewi
Copy link
Contributor

jlewi commented Mar 2, 2018

@ScorpioCPH @gaocegege Any update on api/v1alpha2?

I checked the refactor branch. I didn't see a [v1alpha2] (https://github.com/kubeflow/tf-operator/tree/refactor/pkg/apis/tensorflow) directory.

@gaocegege
Copy link
Member

@jlewi I think we are going to implement v1alpha2 these days.

@ScorpioCPH
Copy link
Member Author

ScorpioCPH commented Mar 2, 2018 via email

@gaocegege
Copy link
Member

Closed by #492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants