Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job DAG workflow proposal #17787

Closed
wants to merge 4 commits into from
Closed

Conversation

sdminonne
Copy link
Contributor

@sdminonne
Copy link
Contributor Author

@EricMountain-1A @davidopp

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 25, 2015
@k8s-github-robot
Copy link

Labelling this PR as size/L

@mikedanese mikedanese added the area/api Indicates an issue on api area. label Nov 25, 2015
@k8s-bot
Copy link

k8s-bot commented Nov 25, 2015

GCE e2e test build/test passed for commit 2480ed60d8306a93ae64959dfbd09ea27ad6055d.

@k8s-github-robot
Copy link

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?


## Implementation

The basic idea consists in adding a label selector to the current Job API object. The new selector will determine the parent jobs.  The parent jobs are the jobs current job will depend on. The current job will be scheduled once all the parent jobs will run to completion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we prevent circular dependencies, or are they ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this kind of system does not permit cycles (DAG), the tool we currently have in production does not permit cycles either.
It's true that this is the probably the major drawback of the label selector approach, see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the parent job does not exist? Are we going to wait indefinitely? Warn a user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'll wait indefinitely, and warning a user is an option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status is the best option for that, imho. See this comment.


#### JobSpec and JobStatus

No modifications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a status field informing that a job is waiting for parents, if parents selector is specified. That might be useful in cases I was talking about in this comment. Or better, extend our current JobConditions to include that status.


## Known drawbacks

* Using only a label selector won't permit to implement backtracking for failures, a common functionality for a DAG worflow system. In this cases [controllerRef](https://github.com/kubernetes/kubernetes/issues/2210#issuecomment-134878215) could help.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of backtracking you mean here? Does this address the issue partially?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't used indentation ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backtracking I mean the ability to backtrack from a failed job to its parent job (which most of the time is the reason for the failure). Our current tool has this troubleshooting ability since the causality of the job is very clear. Using labels selector and boolean field only won't give us this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the misunderstanding comes from my fault. I wrote implement but I wasn't thinking to some specific functionality. Just a way to troubleshoot failures.

@soltysh
Copy link
Contributor

soltysh commented Nov 27, 2015

I think I'm done, at least for now ;)

@sdminonne
Copy link
Contributor Author

The initial proposal was written following @bgrant0607's idea that's why I proposed a generalized label selector. But I agree with @davidopp's comments. The #341 model is too powerful here and it could produce some unexpected behaviour.

@sdminonne
Copy link
Contributor Author

@soltysh PTAL

@k8s-bot
Copy link

k8s-bot commented Nov 30, 2015

GCE e2e test build/test passed for commit e2ee02ed7d382989d092538c6bf34c1cc3c6112b.

@soltysh
Copy link
Contributor

soltysh commented Dec 2, 2015

I agree with @sdminonne that k8s should provide users with some simple solution for building basic job workflows ootb. Once user is familiar with the basic ideas, and most importantly with the time, he will be able to better specify his requirements, it'll be easier for him to choose one of the more powerful solutions. @bgrant0607 the problems you're mentioning apply atm to jobs, imho. What is proposed here is slight enhancement that will greatly improve user experience with jobs. Just my 2 cents ;)

@erictune
Copy link
Member

erictune commented Dec 2, 2015

The controller has to topologically sort all jobs every cycle, in order to find things it can work on. It might as well do DFS to detect cycles. If there are millions of jobs in a namespace, doing this every job reconcilation cycle could be somewhat slow? But millions of jobs in a namespace seems unlikely for the near future. Thousands seems like it should not slow things down too much. It could do some kind of incremental update, but that adds complexity, so we should not do that at first. But it is nice to have as an out. So, I think I am okay with the sorting overheads.

@erictune
Copy link
Member

erictune commented Dec 2, 2015

Because a missing job is treated as an unsatisfied dependency, jobs in a graph can be created in any order, and nothing starts until a dependency-less job is created. This works well. And a user can delete any job that has not been started yet, and insert a modified one without fear of a race. So that is good.

@erictune
Copy link
Member

erictune commented Dec 2, 2015

@sdminonne had you considered a list of job names instead of label selectors? Absent a clear use case for selectors, would this be easier for users to understand?

@bgrant0607
Copy link
Member

How would one start a DAG from a ScheduledJob?

@bgrant0607
Copy link
Member

@soltysh Job doesn't currently have the problems I mentioned. Job could be a composable building block for implementing a multi-cluster Ubernetes Job. It doesn't define a Job-level notion of success, for instance. Nor does it currently assign indices. Job doesn't guarantee execution order.

@bgrant0607
Copy link
Member

I agree with @erictune that a label selector isn't a good idea in this case.

If `job.spec.parentSelector` is absent the `job` will be started immediately.
If `job.spec.parentSelector` is pointing to a non-existing job, the job will wait indefinitely or
until a matching
job will be created and completed accordingly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no notion of Job-level success or failure. The dependent jobs would start regardless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if a user deleted a job that other jobs were waiting on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dependent jobs consume the output of preceding jobs, they'd need to wait on the data, anyway, assuming weak consistency. Why not just start all the jobs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone did kubectl -f <file> that contained all the jobs or kubectl -f <directory>, what should happen if the jobs were created out of order? Should the behavior be the same as when depended-upon jobs were already deleted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that at some point we'll need to GC terminated jobs, so they may disappear spontaneously unless we do something to prevent that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne My point about Ubernetes is that either Job would need to become cross-cluster aware, which is undesirable for failure isolation reasons, or there would need to be a way to externalize dependency resolution that is different from this mechanism, at which point that might as well be used in the single-cluster case also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdminonne JobSpec contains a PodTemplate, which contains labels. That's not a problem.

Were you thinking that to create a graph of N Jobs, N ScheduledJobs would be scheduled at the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding unique names: Whether using names or label selectors, in order to prevent conflicts between successive executions of a DAG, one would need to ensure that either previous instances were deleted before the subsequent set was launched or that each successive set was comprised of jobs with unique names and label selectors. Otherwise, later runs could delay completion of earlier ones (if using label selectors) or could be prevented from being created (due to name conflicts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding scheduling: I was thinking to create a graph of jobs without any particular order except for the first node of the graph ( no parent => nil job label selector). Every job will wait for the termination (not failure or success) of their parents and the will start their pods - as soon job terminates it could eventually trigger children jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About names (or labels): I see your point (I think), so a mechanism to uniquely identify jobs must be forward propagated from the graph entry point (starting job) to the final nodes. Again if I'm understanding well: modifying the parentSelector (I know I need the modify the name and the object) to childSelector the graph approach would work.

@bgrant0607
Copy link
Member

Sort of relevant: #13567

@bgrant0607
Copy link
Member

Note that we're also discussion initializers, which are another mechanism for deferring execution until preconditions are satisfied. #17305 @derekwaynecarr

@bgrant0607
Copy link
Member

A list of object references could maybe be extended to cross-cluster object references in the future.

type JobSpec struct {
...
// Job labels selector to detect parent jobs.
ParentSelector map[string]string `json:"parentSelector"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"parent" is what Chronos uses, but has a different meaning in other systems, and doesn't inherently imply "after", so we should use another term here.

@bgrant0607
Copy link
Member

Also, having worked on a number of workflow and dataflow systems, the system can make MUCH better decisions if it has a representation of the whole DAG at once.

@sdminonne
Copy link
Contributor Author

@bgrant0607: I'm OK to modify this proposal to remove the label selector (which is the only thing that was added :) ) and to replace with an array of reference at the price that the graph will become a chain.

@timothysc
Copy link
Member

Also, having worked on a number of workflow and dataflow systems, the system can make MUCH better decisions if it has a representation of the whole DAG at once.

💯

cross-ref:
http://ccl.cse.nd.edu/software/makeflow/
http://pegasus.isi.edu/

@bgrant0607
Copy link
Member

A common need is to be able to compose workflows -- to run a whole workflow as a single step of a larger workflow. The ways I've seen that accomplished at the configuration layer on top of an approach similar to the one in this proposal add a lot of complexity I'd like to avoid.

@sdminonne
Copy link
Contributor Author

@bgrant0607 yep, as I was mentioning at kubecon, in our production system we have a feature reach workflow system running thousands of job per day. My goal was to have a minimal support to compose jobs, no more - no less. If we wants to support re-start and partial-run (not sure the name you use on this) we need a new workflow resource. Thanks again for you comment and time on this.

@bgrant0607
Copy link
Member

@sdminonne What do you mean by re-start and partial-run? They sound related to failure handling.

@sdminonne
Copy link
Contributor Author

@bgrant0607 the ability to manually re-start a sequence of steps and the ability to run only a subset of the sequence

@erictune
Copy link
Member

erictune commented Dec 4, 2015

@sdminonne Let's keep this open for now. Would you write an alternative proposal with a workflow resource as well? Then the community can compare this proposal and the workflow side by side?

@sdminonne
Copy link
Contributor Author

@erictune absolutely. Sorry, probably it was not clear. My idea was to modify this proposal since what @bgrant0607 pointed out are real weak points. But I'm OK to keep this one as is and I'll write the proposal with the workflow resource next week (trying for Wed more or less).
Thanks


A proposal to modify [`Job` resource](../../docs/user-guide/jobs.md) to implement a minimal
[Workflow managment system](https://en.wikipedia.org/wiki/Workflow_management_system) in kubernetes.
Workflows (aka DAG workflows since jobs are organized in a Direct Acyclic Graph) are ubiquitous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: directed

@sdminonne
Copy link
Contributor Author

@erictune I'm taking more time for the new proposal but I'm still working on it.

@erictune
Copy link
Member

@sdminonne may I close this since we are working in #18827 now?

@sdminonne
Copy link
Contributor Author

@erictune sure. I close it.

@sdminonne sdminonne closed this Jan 13, 2016
@sdminonne sdminonne deleted the dag_workflow branch April 20, 2016 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet