job DAG workflow proposal #17787

sdminonne · 2015-11-25T16:55:00Z

As mentioned in #11980
@pmorie @soltysh @erictune @bgrant0607 @mikedanese

sdminonne · 2015-11-25T16:56:26Z

@EricMountain-1A @davidopp

k8s-github-robot · 2015-11-25T17:03:26Z

Labelling this PR as size/L

k8s-bot · 2015-11-25T17:24:26Z

GCE e2e test build/test passed for commit 2480ed60d8306a93ae64959dfbd09ea27ad6055d.

k8s-github-robot · 2015-11-25T17:27:44Z

The author of this PR is not in the whitelist for merge, can one of the admins add the 'ok-to-merge' label?

derekwaynecarr · 2015-11-26T01:18:44Z

docs/proposals/workflow.md

+
+## Implementation
+
+The basic idea consists in adding a label selector to the current Job API object. The new selector will determine the parent jobs.  The parent jobs are the jobs current job will depend on. The current job will be scheduled once all the parent jobs will run to completion.


do we prevent circular dependencies, or are they ok?

In general this kind of system does not permit cycles (DAG), the tool we currently have in production does not permit cycles either.
It's true that this is the probably the major drawback of the label selector approach, see.

What happens if the parent job does not exist? Are we going to wait indefinitely? Warn a user?

Yes, we'll wait indefinitely, and warning a user is an option

The status is the best option for that, imho. See this comment.

soltysh · 2015-11-27T10:33:24Z

docs/proposals/workflow.md

+
+#### JobSpec and JobStatus
+
+No modifications


I'd add a status field informing that a job is waiting for parents, if parents selector is specified. That might be useful in cases I was talking about in this comment. Or better, extend our current JobConditions to include that status.

soltysh · 2015-11-27T10:56:03Z

docs/proposals/workflow.md

+
+## Known drawbacks
+
+    * Using only a label selector won't permit to implement backtracking for failures, a common functionality for a DAG worflow system. In this cases [controllerRef](https://github.com/kubernetes/kubernetes/issues/2210#issuecomment-134878215) could help.


What kind of backtracking you mean here? Does this address the issue partially?

Don't used indentation ;)

For backtracking I mean the ability to backtrack from a failed job to its parent job (which most of the time is the reason for the failure). Our current tool has this troubleshooting ability since the causality of the job is very clear. Using labels selector and boolean field only won't give us this feature.

But the misunderstanding comes from my fault. I wrote implement but I wasn't thinking to some specific functionality. Just a way to troubleshoot failures.

soltysh · 2015-11-27T11:01:17Z

I think I'm done, at least for now ;)

sdminonne · 2015-11-30T08:52:57Z

The initial proposal was written following @bgrant0607 's idea that's why I proposed a generalized label selector. But I agree with @davidopp 's comments. The #341 model is too powerful here and it could produce some unexpected behaviour.

sdminonne · 2015-11-30T13:31:29Z

@soltysh PTAL

k8s-bot · 2015-11-30T14:08:29Z

GCE e2e test build/test passed for commit e2ee02ed7d382989d092538c6bf34c1cc3c6112b.

soltysh · 2015-12-02T21:15:05Z

I agree with @sdminonne that k8s should provide users with some simple solution for building basic job workflows ootb. Once user is familiar with the basic ideas, and most importantly with the time, he will be able to better specify his requirements, it'll be easier for him to choose one of the more powerful solutions. @bgrant0607 the problems you're mentioning apply atm to jobs, imho. What is proposed here is slight enhancement that will greatly improve user experience with jobs. Just my 2 cents ;)

erictune · 2015-12-02T23:41:33Z

The controller has to topologically sort all jobs every cycle, in order to find things it can work on. It might as well do DFS to detect cycles. If there are millions of jobs in a namespace, doing this every job reconcilation cycle could be somewhat slow? But millions of jobs in a namespace seems unlikely for the near future. Thousands seems like it should not slow things down too much. It could do some kind of incremental update, but that adds complexity, so we should not do that at first. But it is nice to have as an out. So, I think I am okay with the sorting overheads.

erictune · 2015-12-02T23:44:27Z

Because a missing job is treated as an unsatisfied dependency, jobs in a graph can be created in any order, and nothing starts until a dependency-less job is created. This works well. And a user can delete any job that has not been started yet, and insert a modified one without fear of a race. So that is good.

erictune · 2015-12-02T23:55:37Z

@sdminonne had you considered a list of job names instead of label selectors? Absent a clear use case for selectors, would this be easier for users to understand?

bgrant0607 · 2015-12-03T03:42:44Z

How would one start a DAG from a ScheduledJob?

bgrant0607 · 2015-12-03T03:45:02Z

@soltysh Job doesn't currently have the problems I mentioned. Job could be a composable building block for implementing a multi-cluster Ubernetes Job. It doesn't define a Job-level notion of success, for instance. Nor does it currently assign indices. Job doesn't guarantee execution order.

bgrant0607 · 2015-12-03T03:46:17Z

I agree with @erictune that a label selector isn't a good idea in this case.

bgrant0607 · 2015-12-03T03:56:00Z

docs/proposals/workflow.md

+If `job.spec.parentSelector` is absent the `job` will be started immediately.
+If `job.spec.parentSelector` is pointing to a non-existing job, the job will wait indefinitely or
+until a matching
+job will be created and completed accordingly


There's no notion of Job-level success or failure. The dependent jobs would start regardless?

What would happen if a user deleted a job that other jobs were waiting on?

If dependent jobs consume the output of preceding jobs, they'd need to wait on the data, anyway, assuming weak consistency. Why not just start all the jobs?

If someone did kubectl -f <file> that contained all the jobs or kubectl -f <directory>, what should happen if the jobs were created out of order? Should the behavior be the same as when depended-upon jobs were already deleted?

Also note that at some point we'll need to GC terminated jobs, so they may disappear spontaneously unless we do something to prevent that.

@sdminonne My point about Ubernetes is that either Job would need to become cross-cluster aware, which is undesirable for failure isolation reasons, or there would need to be a way to externalize dependency resolution that is different from this mechanism, at which point that might as well be used in the single-cluster case also.

@sdminonne JobSpec contains a PodTemplate, which contains labels. That's not a problem.

Were you thinking that to create a graph of N Jobs, N ScheduledJobs would be scheduled at the same time?

Regarding unique names: Whether using names or label selectors, in order to prevent conflicts between successive executions of a DAG, one would need to ensure that either previous instances were deleted before the subsequent set was launched or that each successive set was comprised of jobs with unique names and label selectors. Otherwise, later runs could delay completion of earlier ones (if using label selectors) or could be prevented from being created (due to name conflicts).

Regarding scheduling: I was thinking to create a graph of jobs without any particular order except for the first node of the graph ( no parent => nil job label selector). Every job will wait for the termination (not failure or success) of their parents and the will start their pods - as soon job terminates it could eventually trigger children jobs.

About names (or labels): I see your point (I think), so a mechanism to uniquely identify jobs must be forward propagated from the graph entry point (starting job) to the final nodes. Again if I'm understanding well: modifying the parentSelector (I know I need the modify the name and the object) to childSelector the graph approach would work.

bgrant0607 · 2015-12-03T04:28:01Z

Sort of relevant: #13567

bgrant0607 · 2015-12-03T04:33:49Z

Note that we're also discussion initializers, which are another mechanism for deferring execution until preconditions are satisfied. #17305 @derekwaynecarr

bgrant0607 · 2015-12-03T04:43:23Z

A list of object references could maybe be extended to cross-cluster object references in the future.

bgrant0607 · 2015-12-03T04:56:24Z

docs/proposals/workflow.md

+type JobSpec struct {
+	...
+	// Job labels selector to detect parent jobs.
+	ParentSelector map[string]string `json:"parentSelector"`


"parent" is what Chronos uses, but has a different meaning in other systems, and doesn't inherently imply "after", so we should use another term here.

bgrant0607 · 2015-12-03T05:31:47Z

Also, having worked on a number of workflow and dataflow systems, the system can make MUCH better decisions if it has a representation of the whole DAG at once.

sdminonne · 2015-12-03T15:01:25Z

@bgrant0607: I'm OK to modify this proposal to remove the label selector (which is the only thing that was added :) ) and to replace with an array of reference at the price that the graph will become a chain.

timothysc · 2015-12-03T15:12:42Z

Also, having worked on a number of workflow and dataflow systems, the system can make MUCH better decisions if it has a representation of the whole DAG at once.

💯

cross-ref:
http://ccl.cse.nd.edu/software/makeflow/
http://pegasus.isi.edu/

bgrant0607 · 2015-12-03T15:18:35Z

A common need is to be able to compose workflows -- to run a whole workflow as a single step of a larger workflow. The ways I've seen that accomplished at the configuration layer on top of an approach similar to the one in this proposal add a lot of complexity I'd like to avoid.

sdminonne · 2015-12-03T15:37:40Z

@bgrant0607 yep, as I was mentioning at kubecon, in our production system we have a feature reach workflow system running thousands of job per day. My goal was to have a minimal support to compose jobs, no more - no less. If we wants to support re-start and partial-run (not sure the name you use on this) we need a new workflow resource. Thanks again for you comment and time on this.

bgrant0607 · 2015-12-04T06:32:28Z

@sdminonne What do you mean by re-start and partial-run? They sound related to failure handling.

sdminonne · 2015-12-04T09:37:15Z

@bgrant0607 the ability to manually re-start a sequence of steps and the ability to run only a subset of the sequence

erictune · 2015-12-04T20:28:56Z

@sdminonne Let's keep this open for now. Would you write an alternative proposal with a workflow resource as well? Then the community can compare this proposal and the workflow side by side?

sdminonne · 2015-12-04T20:54:42Z

@erictune absolutely. Sorry, probably it was not clear. My idea was to modify this proposal since what @bgrant0607 pointed out are real weak points. But I'm OK to keep this one as is and I'll write the proposal with the workflow resource next week (trying for Wed more or less).
Thanks

pmorie · 2015-12-09T17:34:30Z

docs/proposals/workflow.md

+
+A proposal to modify [`Job` resource](../../docs/user-guide/jobs.md) to implement a minimal
+[Workflow managment system](https://en.wikipedia.org/wiki/Workflow_management_system) in kubernetes.
+Workflows (aka DAG workflows since jobs are organized in a Direct Acyclic Graph) are ubiquitous


nit: directed

sdminonne · 2015-12-15T16:41:49Z

@erictune I'm taking more time for the new proposal but I'm still working on it.

erictune · 2016-01-13T16:17:14Z

@sdminonne may I close this since we are working in #18827 now?

sdminonne · 2016-01-13T16:19:41Z

@erictune sure. I close it.

googlebot added the cla: yes label Nov 25, 2015

mikedanese assigned erictune Nov 25, 2015

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 25, 2015

mikedanese added the area/api Indicates an issue on api area. label Nov 25, 2015

k8s-github-robot added the needs-ok-to-merge label Nov 25, 2015

derekwaynecarr reviewed Nov 26, 2015
View reviewed changes

soltysh mentioned this pull request Nov 27, 2015

Generalize label selectors #341

Closed

soltysh reviewed Nov 27, 2015
View reviewed changes

soltysh mentioned this pull request Nov 27, 2015

Validate no replicationController overlap. #2210

Closed

soltysh reviewed Nov 27, 2015
View reviewed changes

bgrant0607 reviewed Dec 3, 2015
View reviewed changes

erictune mentioned this pull request Dec 4, 2015

ScheduledJob controller proposal #11980

Merged

pmorie reviewed Dec 9, 2015
View reviewed changes

sdminonne mentioned this pull request Dec 17, 2015

DAG Workflow with Workflow resource #18827

Merged

sdminonne closed this Jan 13, 2016

sdminonne deleted the dag_workflow branch April 20, 2016 13:59


		## Implementation

		The basic idea consists in adding a label selector to the current Job API object. The new selector will determine the parent jobs. The parent jobs are the jobs current job will depend on. The current job will be scheduled once all the parent jobs will run to completion.


		## Known drawbacks

		* Using only a label selector won't permit to implement backtracking for failures, a common functionality for a DAG worflow system. In this cases [controllerRef](https://github.com/kubernetes/kubernetes/issues/2210#issuecomment-134878215) could help.

job DAG workflow proposal #17787

job DAG workflow proposal #17787

Conversation

sdminonne commented Nov 25, 2015

sdminonne commented Nov 25, 2015

k8s-github-robot commented Nov 25, 2015

k8s-bot commented Nov 25, 2015

k8s-github-robot commented Nov 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Nov 27, 2015

sdminonne commented Nov 30, 2015

sdminonne commented Nov 30, 2015

k8s-bot commented Nov 30, 2015

soltysh commented Dec 2, 2015

erictune commented Dec 2, 2015

erictune commented Dec 2, 2015

erictune commented Dec 2, 2015

bgrant0607 commented Dec 3, 2015

bgrant0607 commented Dec 3, 2015

bgrant0607 commented Dec 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgrant0607 commented Dec 3, 2015

bgrant0607 commented Dec 3, 2015

bgrant0607 commented Dec 3, 2015

Choose a reason for hiding this comment

bgrant0607 commented Dec 3, 2015

sdminonne commented Dec 3, 2015

timothysc commented Dec 3, 2015

bgrant0607 commented Dec 3, 2015

sdminonne commented Dec 3, 2015

bgrant0607 commented Dec 4, 2015

sdminonne commented Dec 4, 2015

erictune commented Dec 4, 2015

sdminonne commented Dec 4, 2015

Choose a reason for hiding this comment

sdminonne commented Dec 15, 2015

erictune commented Jan 13, 2016

sdminonne commented Jan 13, 2016