Skip to content

Commit

Permalink
Update: TERRA Workflow Documentation
Browse files Browse the repository at this point in the history
Document 3 WDL workflows to be run on Terra:

1. ncov/ncov - run the basic ncov workflow
2. ncov/genbank_ingest - pull a public dataset and send them through our preprocessing scripts.
3. ncov/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts.

The workflows are separated so that only parameters specific to a particular usecase are shown in Terra.
Apply suggestions from code review related to wording and spelling.

* styling fixes
* update docs to reflect observational differences

Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
Co-authored-by: Jover Lee <joverlee521@gmail.com>
  • Loading branch information
3 people committed Nov 3, 2022
1 parent de1b4f7 commit 721f4a0
Show file tree
Hide file tree
Showing 5 changed files with 141 additions and 10 deletions.
57 changes: 47 additions & 10 deletions docs/src/guides/run-analysis-on-terra.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,24 @@
Run the workflow on Terra
*************************

We have wrapped the ncov workflow for use in Terra:

.. image:: ../images/terra-ncov.png

We recommend starting with the "minimal" use case (first row in "User Provided Data") with steps described below:

Import ``ncov`` WDL workflow from Dockstore
===========================================

1. `Setup a Terra account <https://terra.bio/>`_
#. Navigate to Dockstore: `ncov:master`_
#. Top right corner, under **Launch with**, click on **Terra**
#. Under "Workflow Name" set a name, can also leave default ``ncov``, and select your **Destination Workspace** in the drop down menu.
#. Click button **IMPORT**
#. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card
1. `Set up a Terra account <https://terra.bio/>`_.
#. Navigate to Dockstore: `nextstrain/ncov/ncov`_
#. At the top right corner, under **Launch with**, click on **Terra**. You may be prompted to log in.
#. Provide a **Workflow Name** (e.g. ``ncov``).
#. Select a **Destination Workspace** from the dropdown menu.
#. Click **IMPORT**.
#. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card.

.. _`ncov:master`: https://dockstore.org/workflows/github.com/nextstrain/ncov:master?tab=info
.. _`nextstrain/ncov/ncov`: https://dockstore.org/workflows/github.com/nextstrain/ncov/ncov:master?tab=info

Upload your data files into Terra
=================================
Expand Down Expand Up @@ -64,8 +71,38 @@ Connect your data files to the WDL workflow
|Nextstrain_WRKFLW| sequence_fasta | File | this.sequences |
+-----------------+------------------+-------+----------------------+

10. Click on the **OUTPUTS** tab
11. Connect your generated output back to the data table, but filling in values:
10. If creating a build with multiple sequence and metadata files, can upload a tarball containing the files as described in `this tutorial`_. Otherwise skip

+-----------------+-----------------+-------+----------------------+
|Task name | Variable | Type | Attribute |
+=================+=================+=======+======================+
|Nextstrain_WRKFLW| context_targz | File | this.context_targz |
+-----------------+-----------------+-------+----------------------+

.. _`this tutorial`: https://docs.nextstrain.org/projects/ncov/en/latest/guides/data-prep/gisaid-search.html#download-contextual-data-for-your-region-of-interest

11. OPTIONAL CHANGE: If you are uploading GISAID/GenBank, or very large sequence files, it is highly recommended to increase disk size.

+-----------------+-------------------+-------+---------------------------------------------+
|Task name | Variable | Type | Description |
+=================+===================+=======+=============================================+
|Nextstrain_WRKFLW| disk_size | Int | 30 gb by default, may need to expand to 500 |
+-----------------+-------------------+-------+---------------------------------------------+

12. OPTIONAL CHANGE: If you have a private/public nextstrain group, specify the following variables to push to an s3 site. Otherwise this step can be skipped.

+-----------------+-----------------------+--------+--------------------------------+
|Task name | Variable | Type | Description |
+=================+=======================+========+================================+
|Nextstrain_WRKFLW| s3deploy | String | nextstrain provided url string |
+-----------------+-----------------------+--------+--------------------------------+
|Nextstrain_WRKFLW| AWS_ACCESS_KEY_ID | String | your group access key id |
+-----------------+-----------------------+--------+--------------------------------+
|Nextstrain_WRKFLW| AWS_SECRET_ACCESS_KEY | String | your group secret access key |
+-----------------+-----------------------+--------+--------------------------------+

13. Click on the **OUTPUTS** tab
14. Connect your generated output back to the data table, but filling in values:

+-----------------+-----------------+-------+----------------------+
|Task name | Variable | Type | Attribute |
Expand All @@ -75,6 +112,6 @@ Connect your data files to the WDL workflow
|Nextstrain_WRKFLW| results_zip | File | this.results_zip |
+-----------------+-----------------+-------+----------------------+

12. Click on **Save** then click on **Run Analysis**
15. Click on **Save** then click on **Run Analysis**
#. Under the tab **JOB HISTORY**, verify that your job is running.
#. When run is complete, check the **DATA** / **TABLES** / **ncov_examples** tab and download "auspice.zip" file
93 changes: 93 additions & 0 deletions docs/src/guides/run-ingest-on-terra.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
*******************************
Ingest SARS-CoV-2 data on Terra
*******************************

We have provided two pipelines for importing data into Terra:

* GenBank Ingest - pull a public dataset and send them through our preprocessing scripts.
* GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password.

The pipelines were mainly motivated to provide access to our data pre-processing scripts. Currently, these are focused on pulling datasets for the ncov workflow and is roughly diagrammed below:

.. image:: ../images/terra-ingest.png


Import the wdl workflow from Dockstore
=============================================

1. `Set up a Terra account <https://terra.bio/>`_.
2. Navigate to one of the following in Dockstore:
- `nextstrain/ncov/genbank_ingest`_: for open (GenBank) data
- `nextstrain/ncov/gisaid_ingest`_ for private (GISAID) data. Requires access to the GISAID API endpoint.
3. At the top right corner, under **Launch with**, click on **Terra**. You may be prompted to log in.
4. Provide a **Workflow Name** (e.g. ``genbank_ingest`` or ``gisaid_ingest``).
5. Select a **Destination Workspace** from the dropdown menu.
6. Click **IMPORT**.
7. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card.

.. _`nextstrain/ncov/genbank_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/genbank_ingest:master?tab=info
.. _`nextstrain/ncov/gisaid_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/gisaid_ingest:master?tab=info

Create Terra Variables for GISAID API Endpoint
================================================

If you are pulling GISAID data, you must have your own API key. If you are pulling GenBank data (open), skip to step 6 of the next section.

1. Navigate to your workspace on Terra.
2. On the **Data** tab, from the left menu click **Workspace Data**.
3. Create and fill in values for the following workspace variables:

+-----------------------------+----------------------------+-----------------------------------------------+
|Key | Value | Description |
+=============================+============================+===============================================+
|GISAID_API_ENDPOINT |URL API endpoint value here | Provided by GISAID for your account |
+-----------------------------+----------------------------+-----------------------------------------------+
|GISAID_USERNAME_AND_PASSWORD | username:password | Your GISAID username password for API access |
+-----------------------------+----------------------------+-----------------------------------------------+

Connect your workspace variables to the wdl ingest workflow
===========================================================

1. Navigate back to the **Workflow** tab, and click on the workflow imported to your workspace.
2. Click on the radio button **Run workflow(s) with inputs defined by data table**.
3. Under **Step 1**:

1. Select root entity type as **ncov_examples** from the drop down menu.

4. Under **Step 2**:

1. Click **SELECT DATA**.
2. Select **Choose specific ncov_exampless to process**.
3. Select the 1st row in the data table. The first column should have value ``blank``. Selecting more rows will cause the workflow to run more than once.
4. Click **OK**.

5. Most of the values will be blank but fill in the values below:

+-----------------+-------------------------------+-------+----------------------------------------+
|Task name | Variable | Type | Attribute |
+=================+===============================+=======+========================================+
|Nextstrain_WRKFLW| GISAID_API_ENDPOINT | String| workspace.GISAID_API_ENDPOINT |
+-----------------+-------------------------------+-------+----------------------------------------+
|Nextstrain_WRKFLW| GISAID_USERNAME_AND_PASSWORD | String| workspace.GISAID_USERNAME_AND_PASSWORD |
+-----------------+-------------------------------+-------+----------------------------------------+

6. Click on the **OUTPUTS** tab.
7. Connect your generated output back to the workspace data, but filling in values:

+-----------------+------------------+-------+----------------------------------+
|Task name | Variable | Type | Attribute |
+=================+==================+=======+==================================+
|Nextstrain_WRKFLW| metadata_tsv | File | workspace.gisaid_metadata_tsv |
+-----------------+------------------+-------+----------------------------------+
|Nextstrain_WRKFLW| nextclade_tsv | File | workspace.gisaid_nextclade_tsv |
+-----------------+------------------+-------+----------------------------------+
|Nextstrain_WRKFLW| sequences_fasta | File | workspace.gisaid_sequences_fasta |
+-----------------+------------------+-------+----------------------------------+

If you are pulling GenBank data, use something like ``workspace.genbank_sequences_fasta`` instead.

1. Click **SAVE** then **RUN ANALYSIS**.
#. Optionally enter a job description, then click **LAUNCH**.
#. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page.
#. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.gisaid_sequences_fasta" and "workspace.gisaid_metadata.tsv" during normal ncov Terra runs.

Binary file added docs/src/images/terra-ingest.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/src/images/terra-ncov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ If you have a specific question, `post a note on the discussion board <https://d
guides/workflow-config-file
guides/customizing-visualization
guides/run-analysis-on-terra
guides/run-ingest-on-terra

.. toctree::
:maxdepth: 1
Expand Down

0 comments on commit 721f4a0

Please sign in to comment.