Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: WDL resources block #183

Closed
abaumann opened this issue Jan 19, 2018 · 8 comments
Closed

Proposal: WDL resources block #183

abaumann opened this issue Jan 19, 2018 · 8 comments

Comments

@abaumann
Copy link

abaumann commented Jan 19, 2018

This came from discussions at OpenBio Winter Codefest around how to allow people to manage computational resources, specifically driven by Spark (or grid engine, or whatever else).

When using an external resource like Dataproc to run Spark jobs on clusters, you need to do some management of those clusters during the lifetime of your workflow. In some cases you might want one cluster per task, or you might want to reuse the same cluster across multiple tasks. In order to do this, this proposal includes a way to have a before & after to help with this management.

New keywords below are before, after, and resources.

before is a callable that is called before any of the calls in a workflow. after is a callable that is called after all the calls are complete in a workflow, and is guaranteed to be called as long as before succeeds, regardless of whether continueWhilePossible is used or not. resources is a block that contains exactly one before, one after and one or more call inside of it. resources is added so that one workflow can have more than one set of before and after, e.g. to run tasks in parallel on different spark clusters. before and after are used in the same way call is used.

workflow my_variant_caller {
    String cluster_size
    String bam
 
    resources {
        before before_task{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task.cluster_name, bam=bam}
        after after_task{input: cluster_name=before_task.cluster_name}
    }

    output {
        job_output=my_task1.spark_output
    }
}

An example I tried out using this and #182 for porting the hail wdl task I've been working on (https://github.com/broadinstitute/firecloud-tools/blob/ab_hail_wdl/scripts/hail_wdl_test/hail_test_cleanup.wdl) to this proposed syntax

############################
Template file contents:

task start_cluster {
  # TODO: a struct later would be much easier
  Map[String, String] dataproc_cluster_specs

  command { ... spin up cluster here ... }

  output {
    cluster_name = "name of cluster made"
    cluster_staging_bucket = "path to cluster's staging bucket"
  }
}

task delete_cluster {
  String cluster_name

  command { ... delete cluster here ... }

  output {}
}

# template workflow for running pyspark on dataproc
template workflow pyspark_dataproc {
   # TODO: a struct later would be much easier
   Map[String, String] dataproc_cluster_specs

   String cluster_name
   
   resources dataproc_cluster_manager {
        before start_cluster{input: dataproc_cluster_specs=dataproc_cluster_specs}
        call submit_job{cluster_name=cluster_name, cluster_staging_bucket=start_cluster.cluster_staging_bucket}
        after delete_cluster{input: cluster_name=cluster_name}
   }
}
############################
# User workflow file contents:

import "pyspark_dataproc.wdl" as pyspark_dataproc_wdl

# user defined task
task submit_job {
  String cluster_name
  String cluster_staging_bucket
  File   hailCommandFile
  String inputVds
  String inputAnnot
   
  File outputVdsFileName
  File qcResultsFileName  

  command { ... submit to the cluster and output to cluster staging bucket ... }
}

# workflow that uses template
workflow submit_hail_job {
    String cluster_name
    String cluster_staging_bucket
    File   hailCommandFile
    String inputVds
    String inputAnnot
   
    File outputVdsFileName
    File qcResultsFileName  

    call pyspark_dataproc_wdl.pyspark_dataproc {
      input: dataproc_cluster_specs = {"master_machine_type":"n1-standard-8", "master_machine_disk": 100},
      inject: submit_job = submit_job(input: cluster_name=cluster_name, cluster_staging_bucket=cluster_staging_bucket, 
                                             hailCommandFile=hailCommandFile, inputVds=inputVds, inputAnnot=inputAnnot,
                                             outputVdsFileName=outputVdsFileName, qcResultsFileName=qcResultsFileName)
    }

    output {
        String out1 = variant_calling_template.out1
    }
}
@cjllanwarne
Copy link
Contributor

cjllanwarne commented Jan 19, 2018

Not sure whether the resource naming (iesome_spark_jobs) is really needed? We could just have
resources { ... } and in the outputs, ... = my_task1.spark_output

@abaumann
Copy link
Author

oh yeah probably, ill update

@patmagee
Copy link
Member

while I can see the need for this in certain execution environments, it seems to add a significant level of complexity especially by introducing the new object resource for something that will be implementation specific and only in certain environments.

I can understand the need to be able to use the same cluster for certain tasks, but should cluster lifecycle management be a part of the WDL specification? I feel like its the wrong place to put it, but I do not know where the right place is?

I would think that a specific implementation that supports a backend with these requirements might accept a secondary file that specifies the before and after tasks and does cluster orchestration. but I do not feel like this is needed in the core specification just yet

@abaumann
Copy link
Author

Agree about the implementation specific/certain environments/should this be in WDL points - and also on the "but I do not know where the right place is?" point :)

I could imagine it being a before and after that is specific to execution engine & environment like I think you are saying, and that also seems fine to me. The main point here is I like the idea of people being able to write, modify, and reuse their own without needing to get that code into the execution environment. Not unlike the templates issue this is a means of saving from having to copy and paste a ton of WDL across workflows (as you can clearly do this already by writing your own cluster management tasks), and copy/paste becomes a nightmare for maintenance.

@LeeTL1220
Copy link

LeeTL1220 commented Jan 31, 2018

Here are some more notes following a conversation with @abaumann and @cwhelan

Take the following example:

# task1 and task2 can be run in parallel
resources {
        before before_task1{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task1.cluster_name, i="foo"}
        after after_task1{input: cluster_name=before_task1.cluster_name}
    }

resources {
        before before_task2{input: cluster_size=cluster_size}
        call my_task2{input: cluster_name=before_task2.cluster_name, i="foo"}
        after after_task1{input: cluster_name=before_task2.cluster_name}
    }

In this case, I would expect DAG to be before_task1 --> my_task1 --> after_task1 in parallel with before_task2 --> my_task2 --> after_task2

Now if we have a case where task2 is dependent on task1, I would not expect before_task2 to be run until we know that we can run my_task2.

# task1 then task2
resources {
        before before_task1{input: cluster_size=cluster_size}
        call my_task1{input: cluster_name=before_task1.cluster_name, i="foo"}
        after after_task1{input: cluster_name=before_task1.cluster_name}
    }

resources {
        before before_task2{input: cluster_size=cluster_size}
        call my_task2{input: cluster_name=before_task2.cluster_name, i=my_task1.output_file}
        after after_task1{input: cluster_name=before_task2.cluster_name}
    }

I would expect the DAG to be before_task1 --> my_task1 --> after_task1 --> before_task2 --> my_task2 --> after_task2

@vdauwera
Copy link
Member

vdauwera commented Apr 1, 2021

Did we land on a decision wrt whether to consider this for inclusion in WDL? My impression is that the answer is no, this should be done elsewhere.

Can I get some ayes/nays on closing this issue?

(yes I am absolutely procrastinating on some writing by cleaning up old issues)

@patmagee
Copy link
Member

patmagee commented Apr 1, 2021

I dont think we specifically landed anywhere. My impression is that this goes against the goal of abstraction of WDL from the exectution environment. Its possible this logic could be put into the hints section, but I would not be praticularly keen on voting this forward myself

@vdauwera
Copy link
Member

vdauwera commented Apr 1, 2021

Right. Closing this issue with the recommendation that if someone cares very very strongly they can open a new proposal working out how this would work as a hints-based thing.

@vdauwera vdauwera closed this as completed Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants