Skip to content

Commit

Permalink
Update: TERRA Workflow Documentation
Browse files Browse the repository at this point in the history
Document 3 WDL workflows to be run on Terra:

1. ncov:master - run the basic ncov workflow
2. ncov:wdl/genbank_ingest - pull a public dataset and send them through our preprocessing scripts.
3. ncov:wdl/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts.

The workflows are separated so that only parameters specific to a particular usecase are shown in Terra.
  • Loading branch information
j23414 committed Sep 12, 2022
1 parent de1b4f7 commit c81a837
Show file tree
Hide file tree
Showing 5 changed files with 117 additions and 3 deletions.
40 changes: 37 additions & 3 deletions docs/src/guides/run-analysis-on-terra.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
Run the workflow on Terra
*************************

We have wrapped the ncov workflow for use in Terra:

.. image:: ../images/terra-ncov.png

We recommend starting with the "minimal" use case (first row in "User Provided Data") with steps described below:

Import ``ncov`` WDL workflow from Dockstore
===========================================

Expand Down Expand Up @@ -64,8 +70,36 @@ Connect your data files to the WDL workflow
|Nextstrain_WRKFLW| sequence_fasta | File | this.sequences |
+-----------------+------------------+-------+----------------------+

10. Click on the **OUTPUTS** tab
11. Connect your generated output back to the data table, but filling in values:
10. If creating a build with multiple sequence and metadata files, can upload a targz folder containing the files. Otherwise skip

+-----------------+-----------------+-------+----------------------+
|Task name | Variable | Type | Attribute |
+=================+=================+=======+======================+
|Nextstrain_WRKFLW| context_targz | File | this.context_targz |
+-----------------+-----------------+-------+----------------------+

11. OPTIONAL CHANGE: If you are uploading gisaid/genbank, or very large sequence files, it is highly recommended to increase disk size.

+-----------------+-------------------+-------+---------------------------------------------+
|Task name | Variable | Type | Description |
+=================+===================+=======+=============================================+
|Nextstrain_WRKFLW| disk_size | Int | 30 gb by default, may need to expand to 500 |
+-----------------+-------------------+-------+---------------------------------------------+

12. OPTIONAL CHANGE: If you have a private/public nextstrain group, specify the following variables to push to an s3 site. Otherwise this step can be skipped.

+-----------------+-----------------------+--------+--------------------------------+
|Task name | Variable | Type | Description |
+=================+=======================+========+================================+
|Nextstrain_WRKFLW| s3deply | String | nextstrain provided url string |
+-----------------+-----------------------+--------+--------------------------------+
|Nextstrain_WRKFLW| AWS_ACCESS_KEY_ID | String | your group access key id |
+-----------------+-----------------------+--------+--------------------------------+
|Nextstrain_WRKFLW| AWS_SECRET_ACCESS_KEY | String | your group secret access key |
+-----------------+-----------------------+--------+--------------------------------+

13. Click on the **OUTPUTS** tab
14. Connect your generated output back to the data table, but filling in values:

+-----------------+-----------------+-------+----------------------+
|Task name | Variable | Type | Attribute |
Expand All @@ -75,6 +109,6 @@ Connect your data files to the WDL workflow
|Nextstrain_WRKFLW| results_zip | File | this.results_zip |
+-----------------+-----------------+-------+----------------------+

12. Click on **Save** then click on **Run Analysis**
15. Click on **Save** then click on **Run Analysis**
#. Under the tab **JOB HISTORY**, verify that your job is running.
#. When run is complete, check the **DATA** / **TABLES** / **ncov_examples** tab and download "auspice.zip" file
79 changes: 79 additions & 0 deletions docs/src/guides/run-ingest-on-terra.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
**************************
Run Data Ingest on Terra
**************************

We have provided two pipelines for importing data into Terra:

* GenBank Ingest - pull a public dataset and send them through our preprocessing scripts.
* GISAID Ingest - pull a private dataset if a user has their own API key, account, and password.

The pipelines were mainly motivated to provide access to our data pre-processing scripts. Currently, these are focused on pulling datasets for the ncov workflow and is roughly diagramed below:

.. image:: ../images/terra-ingest.png


Import ``ingest`` wdl workflow from Dockstore
=============================================

1. `Setup a Terra account <https://terra.bio/>`_
#. Navigate to Dockstore: `ncov:wdl/genbank_ingest`_ or `ncov:wdl/gisaid_ingest`_ depending on if you wish to pull open (genbank) data or private (and have an gisaid api key) data.
#. Top right corner, under **Launch with**, click on **Terra**
#. Under "Workflow Name" set a name, such as ``genbank_ingest`` or ``gisaid_ingest``, and select your "Destination Workspace" in the drop down menu.
#. Click button **IMPORT**
#. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card

.. _`ncov:wdl/genbank_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov:wdl/genbank_ingest?tab=info
.. _`ncov:wdl/gisaid_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov:wdl/gisaid_ingest?tab=info

Create Terra Variables for GISAID API
=====================================

If you are pulling GISAID data you must have your own API key. If you are pulling GenBank data (open), click on your imported "genbank_ingest" and skip to step 6.

1. Navigate to your workspace on Terra
#. On the **Data** tab, from the left menu click **Workspace Data**
#. Create and fill in values for the following workspace variables:

+-----------------------------+----------------------------+-----------------------------------------------+
|Key | Value | Description |
+=============================+============================+===============================================+
|GISAID_API_ENDPOINT | url api enpoint value here | Provided by GISAID for your account |
+-----------------------------+----------------------------+-----------------------------------------------+
|GISAID_USERNAME_AND_PASSWORD | username:password | Your GISAID username password for api access |
+-----------------------------+----------------------------+-----------------------------------------------+

Connect your workspace variables to the wdl ingest workflow
===========================================================

1. Navigate back to the **Workflow** tab, and click on your imported "gisaid_ingest" workflow
#. Click on the radio button "Run workflow(s) with inputs defined by data table"
#. Under **Step 1**, select your root entity type **ncov_examples** from the drop down menu.
#. ONLY select the 1st entry in the data table. We only want to run this once.
#. Most of the values will be blank but fill in the values below:

+-----------------+-------------------------------+-------+----------------------------------------+
|Task name | Variable | Type | Attribute |
+=================+===============================+=======+========================================+
|Nextstrain_WRKFLW| GISAID_API_ENDPOINT | String| workspace.GISAID_API_ENDPOINT |
+-----------------+-------------------------------+-------+----------------------------------------+
|Nextstrain_WRKFLW| GISAID_USERNAME_AND_PASSWORD | String| workspace.GISAID_USERNAME_AND_PASSWORD |
+-----------------+-------------------------------+-------+----------------------------------------+

6. Click on the **OUTPUTS** tab
#. Connect your generated output back to the workspace data, but filling in values:

+-----------------+------------------+-------+----------------------------------+
|Task name | Variable | Type | Attribute |
+=================+==================+=======+==================================+
|Nextstrain_WRKFLW| sequences_fasta | File | workspace.gisaid_sequences_fasta |
+-----------------+------------------+-------+----------------------------------+
|Nextstrain_WRKFLW| metadata_tsv | File | workspace.gisaid_metadata_tsv |
+-----------------+------------------+-------+----------------------------------+
|Nextstrain_WRKFLW| nextclade_tsv | File | workspace.gisaid_nextclade_tsv |
+-----------------+------------------+-------+----------------------------------+

If you are pulling GenBank data, use something like ``workspace.genbank_sequences_fasta`` instead.

8. Click on **Save** then click on **Run Analysis**
#. Under the tab **JOB HISTORY**, verify that your job is running.
#. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.gisaid_sequences_fasta" and "workspace.gisaid_metadata.tsv" during normal ncov Terra runs.
Binary file added docs/src/images/terra-ingest.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/src/images/terra-ncov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ If you have a specific question, `post a note on the discussion board <https://d
guides/workflow-config-file
guides/customizing-visualization
guides/run-analysis-on-terra
guides/run-ingest-on-terra

.. toctree::
:maxdepth: 1
Expand Down

0 comments on commit c81a837

Please sign in to comment.