Proposal: WDL resources block #183

abaumann · 2018-01-19T18:43:01Z

This came from discussions at OpenBio Winter Codefest around how to allow people to manage computational resources, specifically driven by Spark (or grid engine, or whatever else).

When using an external resource like Dataproc to run Spark jobs on clusters, you need to do some management of those clusters during the lifetime of your workflow. In some cases you might want one cluster per task, or you might want to reuse the same cluster across multiple tasks. In order to do this, this proposal includes a way to have a before & after to help with this management.

New keywords below are before, after, and resources.

before is a callable that is called before any of the calls in a workflow. after is a callable that is called after all the calls are complete in a workflow, and is guaranteed to be called as long as before succeeds, regardless of whether continueWhilePossible is used or not. resources is a block that contains exactly one before, one after and one or more call inside of it. resources is added so that one workflow can have more than one set of before and after, e.g. to run tasks in parallel on different spark clusters. before and after are used in the same way call is used.

workflow my_variant_caller {
    String cluster_size
    String bam
 
    resources {
        before before_task{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task.cluster_name, bam=bam}
        after after_task{input: cluster_name=before_task.cluster_name}
    }

    output {
        job_output=my_task1.spark_output
    }
}

An example I tried out using this and #182 for porting the hail wdl task I've been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax

############################
Template file contents:

task start_cluster {
  # TODO: a struct later would be much easier
  Map[String, String] dataproc_cluster_specs

  command { ... spin up cluster here ... }

  output {
    cluster_name = "name of cluster made"
    cluster_staging_bucket = "path to cluster's staging bucket"
  }
}

task delete_cluster {
  String cluster_name

  command { ... delete cluster here ... }

  output {}
}

# template workflow for running pyspark on dataproc
template workflow pyspark_dataproc {
   # TODO: a struct later would be much easier
   Map[String, String] dataproc_cluster_specs

   String cluster_name
   
   resources dataproc_cluster_manager {
        before start_cluster{input: dataproc_cluster_specs=dataproc_cluster_specs}
        call submit_job{cluster_name=cluster_name, cluster_staging_bucket=start_cluster.cluster_staging_bucket}
        after delete_cluster{input: cluster_name=cluster_name}
   }
}

############################
# User workflow file contents:

import "pyspark_dataproc.wdl" as pyspark_dataproc_wdl

# user defined task
task submit_job {
  String cluster_name
  String cluster_staging_bucket
  File   hailCommandFile
  String inputVds
  String inputAnnot
   
  File outputVdsFileName
  File qcResultsFileName  

  command { ... submit to the cluster and output to cluster staging bucket ... }
}

# workflow that uses template
workflow submit_hail_job {
    String cluster_name
    String cluster_staging_bucket
    File   hailCommandFile
    String inputVds
    String inputAnnot
   
    File outputVdsFileName
    File qcResultsFileName  

    call pyspark_dataproc_wdl.pyspark_dataproc {
      input: dataproc_cluster_specs = {"master_machine_type":"n1-standard-8", "master_machine_disk": 100},
      inject: submit_job = submit_job(input: cluster_name=cluster_name, cluster_staging_bucket=cluster_staging_bucket, 
                                             hailCommandFile=hailCommandFile, inputVds=inputVds, inputAnnot=inputAnnot,
                                             outputVdsFileName=outputVdsFileName, qcResultsFileName=qcResultsFileName)
    }

    output {
        String out1 = variant_calling_template.out1
    }
}

The text was updated successfully, but these errors were encountered:

cjllanwarne · 2018-01-19T20:45:26Z

Not sure whether the resource naming (iesome_spark_jobs) is really needed? We could just have
resources { ... } and in the outputs, ... = my_task1.spark_output

abaumann · 2018-01-19T20:57:10Z

oh yeah probably, ill update

patmagee · 2018-01-25T02:33:57Z

while I can see the need for this in certain execution environments, it seems to add a significant level of complexity especially by introducing the new object resource for something that will be implementation specific and only in certain environments.

I can understand the need to be able to use the same cluster for certain tasks, but should cluster lifecycle management be a part of the WDL specification? I feel like its the wrong place to put it, but I do not know where the right place is?

I would think that a specific implementation that supports a backend with these requirements might accept a secondary file that specifies the before and after tasks and does cluster orchestration. but I do not feel like this is needed in the core specification just yet

abaumann · 2018-01-25T04:35:37Z

Agree about the implementation specific/certain environments/should this be in WDL points - and also on the "but I do not know where the right place is?" point :)

I could imagine it being a before and after that is specific to execution engine & environment like I think you are saying, and that also seems fine to me. The main point here is I like the idea of people being able to write, modify, and reuse their own without needing to get that code into the execution environment. Not unlike the templates issue this is a means of saving from having to copy and paste a ton of WDL across workflows (as you can clearly do this already by writing your own cluster management tasks), and copy/paste becomes a nightmare for maintenance.

LeeTL1220 · 2018-01-31T19:24:33Z

Here are some more notes following a conversation with @abaumann and @cwhelan

Take the following example:

# task1 and task2 can be run in parallel
resources {
        before before_task1{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task1.cluster_name, i="foo"}
        after after_task1{input: cluster_name=before_task1.cluster_name}
    }

resources {
        before before_task2{input: cluster_size=cluster_size}
        call my_task2{input: cluster_name=before_task2.cluster_name, i="foo"}
        after after_task1{input: cluster_name=before_task2.cluster_name}
    }

In this case, I would expect DAG to be before_task1 --> my_task1 --> after_task1 in parallel with before_task2 --> my_task2 --> after_task2

Now if we have a case where task2 is dependent on task1, I would not expect before_task2 to be run until we know that we can run my_task2.

# task1 then task2
resources {
        before before_task1{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task1.cluster_name, i="foo"}
        after after_task1{input: cluster_name=before_task1.cluster_name}
    }

resources {
        before before_task2{input: cluster_size=cluster_size}
        call my_task2{input: cluster_name=before_task2.cluster_name, i=my_task1.output_file}
        after after_task1{input: cluster_name=before_task2.cluster_name}
    }

I would expect the DAG to be before_task1 --> my_task1 --> after_task1 --> before_task2 --> my_task2 --> after_task2

vdauwera · 2021-04-01T06:32:02Z

Did we land on a decision wrt whether to consider this for inclusion in WDL? My impression is that the answer is no, this should be done elsewhere.

Can I get some ayes/nays on closing this issue?

(yes I am absolutely procrastinating on some writing by cleaning up old issues)

patmagee · 2021-04-01T14:49:37Z

I dont think we specifically landed anywhere. My impression is that this goes against the goal of abstraction of WDL from the exectution environment. Its possible this logic could be put into the hints section, but I would not be praticularly keen on voting this forward myself

vdauwera · 2021-04-01T15:30:11Z

Right. Closing this issue with the recommendation that if someone cares very very strongly they can open a new proposal working out how this would work as a hints-based thing.

abaumann mentioned this issue Jan 19, 2018

Proposal: WDL templates #182

Closed

abaumann mentioned this issue Jan 30, 2018

Proposal for 'after' keyword #162

Merged

vdauwera closed this as completed Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: WDL resources block #183

Proposal: WDL resources block #183

abaumann commented Jan 19, 2018 •

edited

Loading

cjllanwarne commented Jan 19, 2018 •

edited

Loading

abaumann commented Jan 19, 2018

patmagee commented Jan 25, 2018

abaumann commented Jan 25, 2018

LeeTL1220 commented Jan 31, 2018 •

edited

Loading

vdauwera commented Apr 1, 2021

patmagee commented Apr 1, 2021

vdauwera commented Apr 1, 2021

Proposal: WDL resources block #183

Proposal: WDL resources block #183

Comments

abaumann commented Jan 19, 2018 • edited Loading

An example I tried out using this and #182 for porting the hail wdl task I've been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax

cjllanwarne commented Jan 19, 2018 • edited Loading

abaumann commented Jan 19, 2018

patmagee commented Jan 25, 2018

abaumann commented Jan 25, 2018

LeeTL1220 commented Jan 31, 2018 • edited Loading

vdauwera commented Apr 1, 2021

patmagee commented Apr 1, 2021

vdauwera commented Apr 1, 2021

abaumann commented Jan 19, 2018 •

edited

Loading

cjllanwarne commented Jan 19, 2018 •

edited

Loading

LeeTL1220 commented Jan 31, 2018 •

edited

Loading