Skip to content

Commit

Permalink
wdl docs: split run ingest into genbank and gisaid
Browse files Browse the repository at this point in the history
  • Loading branch information
j23414 committed Dec 6, 2022
1 parent 721f4a0 commit fcb4ad6
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 20 deletions.
1 change: 1 addition & 0 deletions docs/src/guides/run-analysis-on-terra.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Connect your data files to the WDL workflow
::

entity:ncov_examples_id metadata sequences configfile_yaml
blank
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
example_build gs://COPY_PATH_HERE/example-build.yaml

Expand Down
65 changes: 65 additions & 0 deletions docs/src/guides/run-genbank-ingest-on-terra.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
*********************************************
Ingest SARS-CoV-2 data from GenBank on Terra
*********************************************

We have provided two pipelines for importing data into Terra:

* **GenBank Ingest - pull a public dataset and send them through our preprocessing scripts.**
* GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password.

The pipelines were mainly motivated to provide access to our data pre-processing scripts. Currently, these are focused on pulling datasets for the ncov workflow and is roughly diagrammed below:

.. image:: ../images/terra-ingest.png

This guide describes the **GenBank Ingest**.

Import the GenBank ingest wdl workflow from Dockstore
======================================================

1. `Set up a Terra account <https://terra.bio/>`_.
2. Navigate to one of the following in Dockstore:
- `nextstrain/ncov/genbank_ingest`_: for open (GenBank) data
3. At the top right corner, under **Launch with**, click on **Terra**. You may be prompted to log in.
4. Provide a **Workflow Name** (e.g. ``genbank_ingest`` ).
5. Select a **Destination Workspace** from the dropdown menu.
6. Click **IMPORT**.
7. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card.

.. _`nextstrain/ncov/genbank_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/genbank_ingest:master?tab=info

Connect any workspace variables to the wdl ingest workflow
===========================================================

1. Navigate back to the **Workflow** tab, and click on the workflow imported to your workspace.
2. Click on the radio button **Run workflow(s) with inputs defined by data table**.
3. Under **Step 1**:

1. Select root entity type as **ncov_examples** from the drop down menu.

4. Under **Step 2**:

1. Click **SELECT DATA**.
2. Select **Choose specific ncov_examples to process**.
3. Select the 1st row in the data table. The first column should have value ``blank``. Selecting more rows will cause the workflow to run more than once.
4. Click **OK**.

5. Leave the values blank.
6. Click on the **OUTPUTS** tab.
7. Connect your generated output back to the workspace data, but filling in values:

+-----------------+------------------+-------+----------------------------------+
|Task name | Variable | Type | Attribute |
+=================+==================+=======+==================================+
|Nextstrain_WRKFLW| metadata_tsv | File | workspace.genbank_metadata_tsv |
+-----------------+------------------+-------+----------------------------------+
|Nextstrain_WRKFLW| nextclade_tsv | File | workspace.genbank_nextclade_tsv |
+-----------------+------------------+-------+----------------------------------+
|Nextstrain_WRKFLW| sequences_fasta | File | workspace.genbank_sequences_fasta|
+-----------------+------------------+-------+----------------------------------+


8. Click **SAVE** then **RUN ANALYSIS**.
9. Optionally enter a job description, then click **LAUNCH**.
10. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page.
11. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.genbank_sequences_fasta" and "workspace.genbank_metadata.tsv" during normal ncov Terra runs.

Original file line number Diff line number Diff line change
@@ -1,37 +1,36 @@
*******************************
Ingest SARS-CoV-2 data on Terra
*******************************
*******************************************
Ingest SARS-CoV-2 data from GISAID on Terra
*******************************************

We have provided two pipelines for importing data into Terra:

* GenBank Ingest - pull a public dataset and send them through our preprocessing scripts.
* GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password.
* **GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password.**

The pipelines were mainly motivated to provide access to our data pre-processing scripts. Currently, these are focused on pulling datasets for the ncov workflow and is roughly diagrammed below:

.. image:: ../images/terra-ingest.png

This guide describes the **GISAID Ingest**.

Import the wdl workflow from Dockstore
=============================================
Import the GISAID ingest wdl workflow from Dockstore
=====================================================

1. `Set up a Terra account <https://terra.bio/>`_.
2. Navigate to one of the following in Dockstore:
- `nextstrain/ncov/genbank_ingest`_: for open (GenBank) data
2. Navigate to the following in Dockstore:
- `nextstrain/ncov/gisaid_ingest`_ for private (GISAID) data. Requires access to the GISAID API endpoint.
3. At the top right corner, under **Launch with**, click on **Terra**. You may be prompted to log in.
4. Provide a **Workflow Name** (e.g. ``genbank_ingest`` or ``gisaid_ingest``).
4. Provide a **Workflow Name** (e.g. ``gisaid_ingest``).
5. Select a **Destination Workspace** from the dropdown menu.
6. Click **IMPORT**.
7. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card.

.. _`nextstrain/ncov/genbank_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/genbank_ingest:master?tab=info
.. _`nextstrain/ncov/gisaid_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/gisaid_ingest:master?tab=info

Create Terra Variables for GISAID API Endpoint
================================================

If you are pulling GISAID data, you must have your own API key. If you are pulling GenBank data (open), skip to step 6 of the next section.
If you are pulling GISAID data, you must have your own API key.

1. Navigate to your workspace on Terra.
2. On the **Data** tab, from the left menu click **Workspace Data**.
Expand All @@ -45,7 +44,7 @@ If you are pulling GISAID data, you must have your own API key. If you are pulli
|GISAID_USERNAME_AND_PASSWORD | username:password | Your GISAID username password for API access |
+-----------------------------+----------------------------+-----------------------------------------------+

Connect your workspace variables to the wdl ingest workflow
Connect any workspace variables to the wdl ingest workflow
===========================================================

1. Navigate back to the **Workflow** tab, and click on the workflow imported to your workspace.
Expand All @@ -57,7 +56,7 @@ Connect your workspace variables to the wdl ingest workflow
4. Under **Step 2**:

1. Click **SELECT DATA**.
2. Select **Choose specific ncov_exampless to process**.
2. Select **Choose specific ncov_examples to process**.
3. Select the 1st row in the data table. The first column should have value ``blank``. Selecting more rows will cause the workflow to run more than once.
4. Click **OK**.

Expand All @@ -84,10 +83,8 @@ Connect your workspace variables to the wdl ingest workflow
|Nextstrain_WRKFLW| sequences_fasta | File | workspace.gisaid_sequences_fasta |
+-----------------+------------------+-------+----------------------------------+

If you are pulling GenBank data, use something like ``workspace.genbank_sequences_fasta`` instead.

1. Click **SAVE** then **RUN ANALYSIS**.
#. Optionally enter a job description, then click **LAUNCH**.
#. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page.
#. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.gisaid_sequences_fasta" and "workspace.gisaid_metadata.tsv" during normal ncov Terra runs.
8. Click **SAVE** then **RUN ANALYSIS**.
9. Optionally enter a job description, then click **LAUNCH**.
10. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page.
11. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.gisaid_sequences_fasta" and "workspace.gisaid_metadata.tsv" during normal ncov Terra runs.

3 changes: 2 additions & 1 deletion docs/src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ If you have a specific question, `post a note on the discussion board <https://d
guides/workflow-config-file
guides/customizing-visualization
guides/run-analysis-on-terra
guides/run-ingest-on-terra
guides/run-genbank-ingest-on-terra
guides/run-gisaid-ingest-on-terra

.. toctree::
:maxdepth: 1
Expand Down

0 comments on commit fcb4ad6

Please sign in to comment.