Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rerunnable workflow as CWL #5

Open
rvosa opened this issue Apr 11, 2020 · 11 comments
Open

Rerunnable workflow as CWL #5

rvosa opened this issue Apr 11, 2020 · 11 comments

Comments

@rvosa
Copy link
Member

rvosa commented Apr 11, 2020

The goal of the basic workflow is to be able to consume unaligned FASTA, align this (i.e. solve #3) and build a tree with it (by addressing #4). These steps are implemented with tools, scripts, and web service calls that are all provisioned inside a Docker container (whose Dockerfile is in the root of the repo, and whose tag will be the same as the repo name).

Subsequently, these steps will be chained together using CWL, most of which is already scaffolded in PR #1. The essential test is therefore that we should be able to run the whole thing on a clean computer using something like cwl-runner. We will then submit this to covid19.workflowhub.eu.

@rvosa rvosa added this to the Full workflow milestone Apr 11, 2020
@rvosa rvosa changed the title Rerunnable workflow Rerunnable workflow as CWL Apr 11, 2020
@rvosa
Copy link
Member Author

rvosa commented Apr 11, 2020

(For consulting on how to finalize this, we might talk to Tazro and/or Michael Crusoe)

@rvosa
Copy link
Member Author

rvosa commented Apr 11, 2020

(For consulting on workflow hub registration, we can consult Carole Goble)

@mr-c
Copy link

mr-c commented Apr 11, 2020

@rvosa Happy to help!

@rvosa
Copy link
Member Author

rvosa commented Apr 13, 2020

Hi @mr-c, thanks! Here's something I'm wondering about. In this repo they built a little workflow that does the alignment and tree building locally, for the purpose of then doing a tree shape analysis that assigns clade identifiers to the different sequences. People running that pipeline might experience some performance issues especially with the alignment step, because MAFFT is kind of expensive.

To address that, I would like to be able to provide our pipeline to them so that the compute steps are done on the CIPRES server instead.

Could you sketch out the steps of what it would take for our project to be portable enough so that that would be as painless as possible. I'm thinking something like:

  1. our docker container is on docker hub
  2. the CWL orchestrates the interaction with the container to do our pipeline
  3. the CWL workflow ends up on workflow hub
    ...
  4. the conda environment.yml that they're running pulls in our workflow

@rvosa
Copy link
Member Author

rvosa commented Apr 13, 2020

i.e. what are the ... steps that would need to happen?

@mr-c
Copy link

mr-c commented Apr 13, 2020

Hello @rvosa !

I think that is a great idea to both run the analysis and also provide a portable "take home" version.

  1. Make a CWL workflow. Ensure that each application has its own Docker container, preferably from biocontianers.pro
  2. Distribute this workflow. Users can run it from any CWL compatible system. The workflow should also be registered with the Workflow HUb
  3. No need for a conda environment.yml, their CWL runners will automatically use the Docker containers. If you'd like to have a non-Docker version then we can add SoftwareRequirement hints, which some CWL runners will translate into conda packages

@rvosa
Copy link
Member Author

rvosa commented Apr 13, 2020

Hi @mr-c,

well, for step 3 the issue is not so much that we need an environment.yml (we don't), the issue is that these guys distribute their pipeline with an environment.yml. What I would like to accomplish is that we can contribute our work as a drop-in replacement for some of the steps they've been taking. How would that work?

@mr-c
Copy link

mr-c commented Apr 13, 2020

While I've never packaged a CWL workflow as a single Conda tool, it should be possible. A CWL workflow can start with #!/usr/bin/env cwl-runner and be marked executable. The Conda package could recommend or depend on the CWL reference runner, so everything would be invisible to the user. When using cwltool they would even get a --help output derived from the workflow inputs and doc property.

@rvosa
Copy link
Member Author

rvosa commented Apr 14, 2020

How would it work the other way around? Like, I make conda recipes for the reusable tools developed here, and now I want to invoke those from CWL. Is there some facility that wraps that?

@mr-c
Copy link

mr-c commented Apr 22, 2020

There is a basic CWL workflow

https://view.commonwl.org/workflows/github.com/common-workflow-lab/2020-covid-19-bh/blob/8fd2d9814a5641a55efd8e63fa65a652b66f9d0b/msa/msa.cwl

Workflow diagram

It can be run locally:

cwltool https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa.cwl  \
  https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa_test.yaml

or via the Arvados instance at biohackathon.curii.com

arvados-cwl-runner https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa.cwl  \
  https://github.com/common-workflow-lab/2020-covid-19-bh/raw/master/msa/msa_test.yaml

@mr-c
Copy link

mr-c commented Apr 22, 2020

Throughout this repository I found conflicting command line arguments in use, so please tell me the preferred options.

There are two options for the XSEDE version of IQTree that I was unable to decipher:

vparam.specify_runtype_=2 - Specify the nrun type - 2 for Tree Inference.
and
vparam.specify_numparts_=1 - How many partitions does your data set have.

Is there a source file that shows how http://www.phylo.org/index.php/rest/iqtree_xsede.html is turned into a command line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants