-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job controller proposal #11746
Job controller proposal #11746
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project, in which case you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
/sub |
FWIW, I'd be curious as to whether you've thought about how much of what you need to implement for Jobs is specific to them, or whether it'd be useful to split out some of the functionality into a separate standalone object. That is, instead of Job -> Pod, perhaps Job -> {ForeverPod, RunToCompletionPod, Pet} -> Pod(s) would also make sense? |
When I think users care about which pods are doing which work units. For example, if there are repeated failures on the same work unit, users usually want to know about that so they can debug. Or they want to know which work units are taking the longest so they can change their sharding scheme. But the JobScheduler is oblivious to what work unit a pod does. Does this make it less useful to users? Or are those things handled by a different component. If the latter, I'm having trouble envisioning how it all fits together, so maybe you could present an example (pseudocode) of a map-reduce master that uses Job as a building block? |
How would you envision running a map-reduce which has a master which starts a variable number of workers (e.g based on the size of the input files or other heuristics)? A Job of size 1 that runs a master, and then the master starts the workers as a Job with a computed size, and the master does not exit with success until all the work is done? How are separate map, shuffle and reduce stages implemented? A Mapper Job followed by a Shuffler Job followed by a Reducer Job? Can I make those phases overlapping, while still respecting a total resource or pod count quota? What does that look like? |
@alex-mohr can you elaborate more on your Job -> {ForeverPod, RunToCompletionPod, Pet} -> Pod(s) suggestion? My understanding for the three objects you've mentioned between a Job and a Pod is basically the same. Iow. all of them represent a job that will be run to a completion, and the actual time it's being run depends entirely on the author of the Job. The only one that might be stepping out, imho, would be ForeverPod, which could be implemented with current ReplicationController as well, but again that depends on the Job author. He might create a Job, without any constraints which might end up running forever. |
CLAs look good, thanks! |
## Motivation | ||
Jobs are needed for executing multi-pod computation to completion; a good example | ||
here would be the ability to implement any type of batch oriented tasks or a MapReduce | ||
or Hadoop style workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove "or a MapReduce or Hadoop style workload"
@soltysh I was getting at whether Pod is building block you want for Job, or whether there's an intermediate Thing (regardless of what we actually call it) that might be useful either independently or in other contexts. See last para of #1624 (comment) |
@alex-mohr |
@erictune @alex-mohr my understanding of the Job is, it will be a supplement to a ReplicationController, which is responsible for running an app/task/whatever your image does for ever; whereas Job runs to a completion. Obviously those runs will be represented by a intermediate object, you'll be actually getting status from (in my proposal it's called the
Additionally I agree that adding the ability to assign "virtual amount" of work per job execution will be viable. I'll update my proposal after the weekend, if you don't mind. |
} | ||
|
||
// JobExec represents the current state of a single execution of a Job. | ||
type JobExec struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name shouldn't be abbreviated (so JobExecution). Task is also appropriate. Is there any opposition to using that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task might deserves to be a top level api object in it's own right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also implement ActiveDeadlineSeconds on pods (attempts) if there is not already a way to do that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task is extremely overloaded in cluster management, but I think it's a good choice here.
Can you explain what you mean by making it a top-level object? As-is it can presumably be incorporated into other types, if that's what you were after?
I think we do already have ActiveDeadlineSeconds on pods: https://github.com/GoogleCloudPlatform/kubernetes/blob/affba42a0520ecf6bab040fb7971284ef9bf450a/pkg/kubelet/kubelet.go#L1358
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "top-level object", I mean it should have a dedicated REST path in the apiserver and storage path in etcd (like replication controller, node, pod, endpoint, see this), not be embedded in the JobStatus like it is in the current revision of this proposal. It also makes sense that Task would have it's own Spec and Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would a task resource add compared to pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to replicate the behavior of ReplicationController
with run-once pods in mind. Which means, I agree with Brian for not creating top-level object for JobExecution
. The only difference between JE and Pods is the former groups certain amount of Pods, but that does not deserve its own object, imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
other than the one suggestion, LGTM. |
I'm sure there will be updates to this as you implement, so happy to merge this now. |
@erictune let me change the MaxParallelism to Parallelism as you suggested and let's merge it. |
1fcded7
to
01731a7
Compare
@erictune changed MaxParallelism to Parallelism, now it's ready for merge. Thank you! |
Once we have this in, I'll update the ScheduledJob proposal (#11980) to match API proposed here. |
01731a7
to
d978719
Compare
Fixed travis failure. |
// job should be run with. Defaults to 1. | ||
Completions *int | ||
|
||
// Selector is a label query over pods that should match the pod count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is out of date.
I don't think we need further iterations on the proposal at this point. It's pretty minimalistic. You might need to rebase in order to make shippable pass. |
I'll update that comment, I just did a rebase to make travis happy, will look into shippable as well. |
I don't see any difference in your shippable.yml from that at master/HEAD, so I just restarted Shippable in case it was a random failure. |
d978719
to
688f3da
Compare
I've updated Selector comment, hopefully Shippable will like me more now. |
Shippable failure is caused by a github outage. https://status.github.com/messages |
I kicked shippable to get a green status prior to merging. |
@pmorie to answer your questions:
Still you'll be searching for particular condition JobSucceeded, if that value is True then you're all set.
Nope for both. This is the difference between phases and conditions. There are no direct phases you can observe a job to be in, iow. no state machine. Conditions, as stated in here "...represent the latest available observations of an object's current state...". There's an issue regarding that topic I recommend reading.
Can you elaborate on it a bit? Do you mean something like prematurely killing a job if we know it won't reach the desired Completions? |
## Motivation | ||
|
||
Jobs are needed for executing multi-pod computation to completion; a good example | ||
here would be the ability to implement any type of batch oriented tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should remove "any" b/c - workflow DAGs or graphs are not supported.
Job controller proposal
To continue discussion started in openshift/origin#3693
// cc @smarterclayton @timothysc @pmorie @bgrant0607 @davidopp @nwendling @derekwaynecarr