-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: WDL resources block #183
Comments
Not sure whether the resource naming (ie |
oh yeah probably, ill update |
while I can see the need for this in certain execution environments, it seems to add a significant level of complexity especially by introducing the new object I can understand the need to be able to use the same cluster for certain tasks, but should cluster lifecycle management be a part of the WDL specification? I feel like its the wrong place to put it, but I do not know where the right place is? I would think that a specific implementation that supports a backend with these requirements might accept a secondary file that specifies the before and after tasks and does cluster orchestration. but I do not feel like this is needed in the core specification just yet |
Agree about the implementation specific/certain environments/should this be in WDL points - and also on the "but I do not know where the right place is?" point :) I could imagine it being a before and after that is specific to execution engine & environment like I think you are saying, and that also seems fine to me. The main point here is I like the idea of people being able to write, modify, and reuse their own without needing to get that code into the execution environment. Not unlike the templates issue this is a means of saving from having to copy and paste a ton of WDL across workflows (as you can clearly do this already by writing your own cluster management tasks), and copy/paste becomes a nightmare for maintenance. |
Here are some more notes following a conversation with @abaumann and @cwhelan Take the following example: # task1 and task2 can be run in parallel
resources {
before before_task1{input: cluster_size=cluster_size}
call my_task1{input: cluster_name=before_task1.cluster_name, i="foo"}
after after_task1{input: cluster_name=before_task1.cluster_name}
}
resources {
before before_task2{input: cluster_size=cluster_size}
call my_task2{input: cluster_name=before_task2.cluster_name, i="foo"}
after after_task1{input: cluster_name=before_task2.cluster_name}
} In this case, I would expect DAG to be Now if we have a case where task2 is dependent on task1, I would not expect # task1 then task2
resources {
before before_task1{input: cluster_size=cluster_size}
call my_task1{input: cluster_name=before_task1.cluster_name, i="foo"}
after after_task1{input: cluster_name=before_task1.cluster_name}
}
resources {
before before_task2{input: cluster_size=cluster_size}
call my_task2{input: cluster_name=before_task2.cluster_name, i=my_task1.output_file}
after after_task1{input: cluster_name=before_task2.cluster_name}
} I would expect the DAG to be |
Did we land on a decision wrt whether to consider this for inclusion in WDL? My impression is that the answer is no, this should be done elsewhere. Can I get some ayes/nays on closing this issue? (yes I am absolutely procrastinating on some writing by cleaning up old issues) |
I dont think we specifically landed anywhere. My impression is that this goes against the goal of abstraction of WDL from the exectution environment. Its possible this logic could be put into the |
Right. Closing this issue with the recommendation that if someone cares very very strongly they can open a new proposal working out how this would work as a hints-based thing. |
This came from discussions at OpenBio Winter Codefest around how to allow people to manage computational resources, specifically driven by Spark (or grid engine, or whatever else).
When using an external resource like Dataproc to run Spark jobs on clusters, you need to do some management of those clusters during the lifetime of your workflow. In some cases you might want one cluster per task, or you might want to reuse the same cluster across multiple tasks. In order to do this, this proposal includes a way to have a before & after to help with this management.
New keywords below are
before
,after
, andresources
.before
is a callable that is called before any of thecall
s in a workflow.after
is a callable that is called after all thecall
s are complete in a workflow, and is guaranteed to be called as long asbefore
succeeds, regardless of whethercontinueWhilePossible
is used or not.resources
is a block that contains exactly onebefore
, oneafter
and one or morecall
inside of it.resources
is added so that one workflow can have more than one set ofbefore
andafter
, e.g. to run tasks in parallel on different spark clusters.before
andafter
are used in the same waycall
is used.An example I tried out using this and #182 for porting the hail wdl task I've been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax
The text was updated successfully, but these errors were encountered: