New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job DAG workflow proposal #17787
job DAG workflow proposal #17787
Conversation
Labelling this PR as size/L |
GCE e2e test build/test passed for commit 2480ed60d8306a93ae64959dfbd09ea27ad6055d. |
The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label? |
|
||
## Implementation | ||
|
||
The basic idea consists in adding a label selector to the current Job API object. The new selector will determine the parent jobs. The parent jobs are the jobs current job will depend on. The current job will be scheduled once all the parent jobs will run to completion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we prevent circular dependencies, or are they ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this kind of system does not permit cycles (DAG), the tool we currently have in production does not permit cycles either.
It's true that this is the probably the major drawback of the label selector approach, see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the parent job does not exist? Are we going to wait indefinitely? Warn a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we'll wait indefinitely, and warning a user is an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status is the best option for that, imho. See this comment.
|
||
#### JobSpec and JobStatus | ||
|
||
No modifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a status field informing that a job is waiting for parents, if parents selector is specified. That might be useful in cases I was talking about in this comment. Or better, extend our current JobConditions to include that status.
|
||
## Known drawbacks | ||
|
||
* Using only a label selector won't permit to implement backtracking for failures, a common functionality for a DAG worflow system. In this cases [controllerRef](https://github.com/kubernetes/kubernetes/issues/2210#issuecomment-134878215) could help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of backtracking you mean here? Does this address the issue partially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't used indentation ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backtracking I mean the ability to backtrack from a failed job
to its parent job (which most of the time is the reason for the failure). Our current tool has this troubleshooting ability since the causality of the job is very clear. Using labels selector
and boolean field
only won't give us this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the misunderstanding comes from my fault. I wrote implement but I wasn't thinking to some specific functionality. Just a way to troubleshoot failures.
I think I'm done, at least for now ;) |
The initial proposal was written following @bgrant0607's idea that's why I proposed a generalized label selector. But I agree with @davidopp's comments. The #341 model is too powerful here and it could produce some unexpected behaviour. |
@soltysh PTAL |
GCE e2e test build/test passed for commit e2ee02ed7d382989d092538c6bf34c1cc3c6112b. |
I agree with @sdminonne that k8s should provide users with some simple solution for building basic job workflows ootb. Once user is familiar with the basic ideas, and most importantly with the time, he will be able to better specify his requirements, it'll be easier for him to choose one of the more powerful solutions. @bgrant0607 the problems you're mentioning apply atm to jobs, imho. What is proposed here is slight enhancement that will greatly improve user experience with jobs. Just my 2 cents ;) |
The controller has to topologically sort all jobs every cycle, in order to find things it can work on. It might as well do DFS to detect cycles. If there are millions of jobs in a namespace, doing this every job reconcilation cycle could be somewhat slow? But millions of jobs in a namespace seems unlikely for the near future. Thousands seems like it should not slow things down too much. It could do some kind of incremental update, but that adds complexity, so we should not do that at first. But it is nice to have as an out. So, I think I am okay with the sorting overheads. |
Because a missing job is treated as an unsatisfied dependency, jobs in a graph can be created in any order, and nothing starts until a dependency-less job is created. This works well. And a user can delete any job that has not been started yet, and insert a modified one without fear of a race. So that is good. |
@sdminonne had you considered a list of job names instead of label selectors? Absent a clear use case for selectors, would this be easier for users to understand? |
How would one start a DAG from a ScheduledJob? |
@soltysh Job doesn't currently have the problems I mentioned. Job could be a composable building block for implementing a multi-cluster Ubernetes Job. It doesn't define a Job-level notion of success, for instance. Nor does it currently assign indices. Job doesn't guarantee execution order. |
I agree with @erictune that a label selector isn't a good idea in this case. |
If `job.spec.parentSelector` is absent the `job` will be started immediately. | ||
If `job.spec.parentSelector` is pointing to a non-existing job, the job will wait indefinitely or | ||
until a matching | ||
job will be created and completed accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no notion of Job-level success or failure. The dependent jobs would start regardless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if a user deleted a job that other jobs were waiting on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dependent jobs consume the output of preceding jobs, they'd need to wait on the data, anyway, assuming weak consistency. Why not just start all the jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone did kubectl -f <file>
that contained all the jobs or kubectl -f <directory>
, what should happen if the jobs were created out of order? Should the behavior be the same as when depended-upon jobs were already deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that at some point we'll need to GC terminated jobs, so they may disappear spontaneously unless we do something to prevent that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdminonne My point about Ubernetes is that either Job would need to become cross-cluster aware, which is undesirable for failure isolation reasons, or there would need to be a way to externalize dependency resolution that is different from this mechanism, at which point that might as well be used in the single-cluster case also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdminonne JobSpec contains a PodTemplate, which contains labels. That's not a problem.
Were you thinking that to create a graph of N Jobs, N ScheduledJobs would be scheduled at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding unique names: Whether using names or label selectors, in order to prevent conflicts between successive executions of a DAG, one would need to ensure that either previous instances were deleted before the subsequent set was launched or that each successive set was comprised of jobs with unique names and label selectors. Otherwise, later runs could delay completion of earlier ones (if using label selectors) or could be prevented from being created (due to name conflicts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding scheduling: I was thinking to create a graph of jobs without any particular order except for the first node of the graph ( no parent => nil job label selector). Every job will wait for the termination (not failure or success) of their parents and the will start their pods - as soon job terminates it could eventually trigger children jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About names (or labels): I see your point (I think), so a mechanism to uniquely identify jobs must be forward propagated from the graph entry point (starting job) to the final nodes. Again if I'm understanding well: modifying the parentSelector (I know I need the modify the name and the object) to childSelector the graph approach would work.
Sort of relevant: #13567 |
Note that we're also discussion initializers, which are another mechanism for deferring execution until preconditions are satisfied. #17305 @derekwaynecarr |
A list of object references could maybe be extended to cross-cluster object references in the future. |
type JobSpec struct { | ||
... | ||
// Job labels selector to detect parent jobs. | ||
ParentSelector map[string]string `json:"parentSelector"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"parent" is what Chronos uses, but has a different meaning in other systems, and doesn't inherently imply "after", so we should use another term here.
Also, having worked on a number of workflow and dataflow systems, the system can make MUCH better decisions if it has a representation of the whole DAG at once. |
@bgrant0607: I'm OK to modify this proposal to remove the label selector (which is the only thing that was added :) ) and to replace with an array of reference at the price that the graph will become a |
💯 cross-ref: |
A common need is to be able to compose workflows -- to run a whole workflow as a single step of a larger workflow. The ways I've seen that accomplished at the configuration layer on top of an approach similar to the one in this proposal add a lot of complexity I'd like to avoid. |
@bgrant0607 yep, as I was mentioning at kubecon, in our production system we have a feature reach workflow system running thousands of job per day. My goal was to have a minimal support to compose jobs, no more - no less. If we wants to support |
@sdminonne What do you mean by |
@bgrant0607 the ability to manually re-start a sequence of steps and the ability to run only a subset of the sequence |
@sdminonne Let's keep this open for now. Would you write an alternative proposal with a |
@erictune absolutely. Sorry, probably it was not clear. My idea was to modify this proposal since what @bgrant0607 pointed out are real weak points. But I'm OK to keep this one |
|
||
A proposal to modify [`Job` resource](../../docs/user-guide/jobs.md) to implement a minimal | ||
[Workflow managment system](https://en.wikipedia.org/wiki/Workflow_management_system) in kubernetes. | ||
Workflows (aka DAG workflows since jobs are organized in a Direct Acyclic Graph) are ubiquitous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: directed
@erictune I'm taking more time for the new proposal but I'm still working on it. |
@sdminonne may I close this since we are working in #18827 now? |
@erictune sure. I close it. |
As mentioned in #11980
@pmorie @soltysh @erictune @bgrant0607 @mikedanese