From fcb4ad6f0bb4870b33d7bee5936aca19c0292692 Mon Sep 17 00:00:00 2001 From: j23414 Date: Tue, 6 Dec 2022 11:47:17 -0800 Subject: [PATCH] wdl docs: split run ingest into genbank and gisaid --- docs/src/guides/run-analysis-on-terra.rst | 1 + .../guides/run-genbank-ingest-on-terra.rst | 65 +++++++++++++++++++ ...rra.rst => run-gisaid-ingest-on-terra.rst} | 35 +++++----- docs/src/index.rst | 3 +- 4 files changed, 84 insertions(+), 20 deletions(-) create mode 100644 docs/src/guides/run-genbank-ingest-on-terra.rst rename docs/src/guides/{run-ingest-on-terra.rst => run-gisaid-ingest-on-terra.rst} (75%) diff --git a/docs/src/guides/run-analysis-on-terra.rst b/docs/src/guides/run-analysis-on-terra.rst index 10f140242..7a006501f 100644 --- a/docs/src/guides/run-analysis-on-terra.rst +++ b/docs/src/guides/run-analysis-on-terra.rst @@ -46,6 +46,7 @@ Connect your data files to the WDL workflow :: entity:ncov_examples_id metadata sequences configfile_yaml + blank example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz example_build gs://COPY_PATH_HERE/example-build.yaml diff --git a/docs/src/guides/run-genbank-ingest-on-terra.rst b/docs/src/guides/run-genbank-ingest-on-terra.rst new file mode 100644 index 000000000..e1b9472b9 --- /dev/null +++ b/docs/src/guides/run-genbank-ingest-on-terra.rst @@ -0,0 +1,65 @@ +********************************************* +Ingest SARS-CoV-2 data from GenBank on Terra +********************************************* + +We have provided two pipelines for importing data into Terra: + +* **GenBank Ingest - pull a public dataset and send them through our preprocessing scripts.** +* GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password. + +The pipelines were mainly motivated to provide access to our data pre-processing scripts. Currently, these are focused on pulling datasets for the ncov workflow and is roughly diagrammed below: + +.. image:: ../images/terra-ingest.png + +This guide describes the **GenBank Ingest**. + +Import the GenBank ingest wdl workflow from Dockstore +====================================================== + +1. `Set up a Terra account `_. +2. Navigate to one of the following in Dockstore: + - `nextstrain/ncov/genbank_ingest`_: for open (GenBank) data +3. At the top right corner, under **Launch with**, click on **Terra**. You may be prompted to log in. +4. Provide a **Workflow Name** (e.g. ``genbank_ingest`` ). +5. Select a **Destination Workspace** from the dropdown menu. +6. Click **IMPORT**. +7. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card. + +.. _`nextstrain/ncov/genbank_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/genbank_ingest:master?tab=info + +Connect any workspace variables to the wdl ingest workflow +=========================================================== + +1. Navigate back to the **Workflow** tab, and click on the workflow imported to your workspace. +2. Click on the radio button **Run workflow(s) with inputs defined by data table**. +3. Under **Step 1**: + + 1. Select root entity type as **ncov_examples** from the drop down menu. + +4. Under **Step 2**: + + 1. Click **SELECT DATA**. + 2. Select **Choose specific ncov_examples to process**. + 3. Select the 1st row in the data table. The first column should have value ``blank``. Selecting more rows will cause the workflow to run more than once. + 4. Click **OK**. + +5. Leave the values blank. +6. Click on the **OUTPUTS** tab. +7. Connect your generated output back to the workspace data, but filling in values: + + +-----------------+------------------+-------+----------------------------------+ + |Task name | Variable | Type | Attribute | + +=================+==================+=======+==================================+ + |Nextstrain_WRKFLW| metadata_tsv | File | workspace.genbank_metadata_tsv | + +-----------------+------------------+-------+----------------------------------+ + |Nextstrain_WRKFLW| nextclade_tsv | File | workspace.genbank_nextclade_tsv | + +-----------------+------------------+-------+----------------------------------+ + |Nextstrain_WRKFLW| sequences_fasta | File | workspace.genbank_sequences_fasta| + +-----------------+------------------+-------+----------------------------------+ + + +8. Click **SAVE** then **RUN ANALYSIS**. +9. Optionally enter a job description, then click **LAUNCH**. +10. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page. +11. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.genbank_sequences_fasta" and "workspace.genbank_metadata.tsv" during normal ncov Terra runs. + diff --git a/docs/src/guides/run-ingest-on-terra.rst b/docs/src/guides/run-gisaid-ingest-on-terra.rst similarity index 75% rename from docs/src/guides/run-ingest-on-terra.rst rename to docs/src/guides/run-gisaid-ingest-on-terra.rst index 7e0875d24..c371f1ff6 100644 --- a/docs/src/guides/run-ingest-on-terra.rst +++ b/docs/src/guides/run-gisaid-ingest-on-terra.rst @@ -1,37 +1,36 @@ -******************************* -Ingest SARS-CoV-2 data on Terra -******************************* +******************************************* +Ingest SARS-CoV-2 data from GISAID on Terra +******************************************* We have provided two pipelines for importing data into Terra: * GenBank Ingest - pull a public dataset and send them through our preprocessing scripts. -* GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password. +* **GISAID Ingest - pull a private dataset if a user has their own API endpoint, username, and password.** The pipelines were mainly motivated to provide access to our data pre-processing scripts. Currently, these are focused on pulling datasets for the ncov workflow and is roughly diagrammed below: .. image:: ../images/terra-ingest.png +This guide describes the **GISAID Ingest**. -Import the wdl workflow from Dockstore -============================================= +Import the GISAID ingest wdl workflow from Dockstore +===================================================== 1. `Set up a Terra account `_. -2. Navigate to one of the following in Dockstore: - - `nextstrain/ncov/genbank_ingest`_: for open (GenBank) data +2. Navigate to the following in Dockstore: - `nextstrain/ncov/gisaid_ingest`_ for private (GISAID) data. Requires access to the GISAID API endpoint. 3. At the top right corner, under **Launch with**, click on **Terra**. You may be prompted to log in. -4. Provide a **Workflow Name** (e.g. ``genbank_ingest`` or ``gisaid_ingest``). +4. Provide a **Workflow Name** (e.g. ``gisaid_ingest``). 5. Select a **Destination Workspace** from the dropdown menu. 6. Click **IMPORT**. 7. In your workspace, click on the **WORKFLOWS** tab and verify that the imported workflow is showing a card. -.. _`nextstrain/ncov/genbank_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/genbank_ingest:master?tab=info .. _`nextstrain/ncov/gisaid_ingest`: https://dockstore.org/workflows/github.com/nextstrain/ncov/gisaid_ingest:master?tab=info Create Terra Variables for GISAID API Endpoint ================================================ -If you are pulling GISAID data, you must have your own API key. If you are pulling GenBank data (open), skip to step 6 of the next section. +If you are pulling GISAID data, you must have your own API key. 1. Navigate to your workspace on Terra. 2. On the **Data** tab, from the left menu click **Workspace Data**. @@ -45,7 +44,7 @@ If you are pulling GISAID data, you must have your own API key. If you are pulli |GISAID_USERNAME_AND_PASSWORD | username:password | Your GISAID username password for API access | +-----------------------------+----------------------------+-----------------------------------------------+ -Connect your workspace variables to the wdl ingest workflow +Connect any workspace variables to the wdl ingest workflow =========================================================== 1. Navigate back to the **Workflow** tab, and click on the workflow imported to your workspace. @@ -57,7 +56,7 @@ Connect your workspace variables to the wdl ingest workflow 4. Under **Step 2**: 1. Click **SELECT DATA**. - 2. Select **Choose specific ncov_exampless to process**. + 2. Select **Choose specific ncov_examples to process**. 3. Select the 1st row in the data table. The first column should have value ``blank``. Selecting more rows will cause the workflow to run more than once. 4. Click **OK**. @@ -84,10 +83,8 @@ Connect your workspace variables to the wdl ingest workflow |Nextstrain_WRKFLW| sequences_fasta | File | workspace.gisaid_sequences_fasta | +-----------------+------------------+-------+----------------------------------+ -If you are pulling GenBank data, use something like ``workspace.genbank_sequences_fasta`` instead. - -1. Click **SAVE** then **RUN ANALYSIS**. -#. Optionally enter a job description, then click **LAUNCH**. -#. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page. -#. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.gisaid_sequences_fasta" and "workspace.gisaid_metadata.tsv" during normal ncov Terra runs. +8. Click **SAVE** then **RUN ANALYSIS**. +9. Optionally enter a job description, then click **LAUNCH**. +10. The new job will appear in the **JOB HISTORY** tab. You can monitor its status by refreshing that page. +11. When run is complete, check the **DATA** / **Workspace Data** tab and use the "workspace.gisaid_sequences_fasta" and "workspace.gisaid_metadata.tsv" during normal ncov Terra runs. diff --git a/docs/src/index.rst b/docs/src/index.rst index f8e6dd512..7c4033b7a 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -46,7 +46,8 @@ If you have a specific question, `post a note on the discussion board