Skip to content

Commit

Permalink
Merge pull request #47 from nextstrain/ingest-uploads
Browse files Browse the repository at this point in the history
Add uploads to S3 for NCBI and Andersen Lab ingests
  • Loading branch information
joverlee521 committed Jun 3, 2024
2 parents 82b0315 + e86aa97 commit 4eaba84
Show file tree
Hide file tree
Showing 14 changed files with 196 additions and 153 deletions.
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,15 @@ Specifically, the files needed are `ingest/results/metadata.tsv` and `ingest/res
Run full genome builds with the following command.

``` bash
nextstrain build . --snakefile Snakefile.genome --config local_ingest=True ingest_source=ncbi
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
. \
--snakefile Snakefile.genome \
--config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi
```

Currently this is only set up for the "h5n1-cattle-outbreak" build using locally ingested NCBI data,
Currently this is only set up for the "h5n1-cattle-outbreak" build using NCBI data,
and the build is restricted to a set of strains where we think there's no reassortment, with outgroups
excluded in (`config/dropped_strains_h5n1-cattle-outbreak.txt`).
Output files will be placed in `results/h5n1-cattle-outbreak/genome`.
Expand Down
8 changes: 6 additions & 2 deletions Snakefile.genome
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
include: "rules/common.smk"

assert LOCAL_INGEST == True and INGEST_SOURCE == "ncbi", \
"Full genome build is only set up for local ingest from 'ncbi'."
if LOCAL_INGEST:
assert INGEST_SOURCE == "ncbi", \
"Full genome build is only set up for locat ingest from 'ncbi'."
else:
assert S3_SRC.startswith("s3://nextstrain-data/"), \
"Full genome build is only set up for data from the public S3 bucket"

import json

Expand Down
47 changes: 34 additions & 13 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,46 @@ This workflow requires the Nextstrain CLI's Docker runtime which includes [fauna
> NOTE: All command examples assume you are within the `ingest` directory.
> If running commands from the outer `avian-flu` directory, replace the `.` with `ingest`.
### Ingest data from NCBI GenBank
### Ingest and upload data from public sources to S3

#### Ingest NCBI GenBank

To download, parse and curate data from NCBI GenBank run the following command.
```sh
nextstrain build . ingest_ncbi --configfile build-configs/ncbi/defaults/config.yaml
```

This results in the files `metadata.tsv`, `sequences_ha.fasta`, etc... under `ingest/ncbi/results/`.
This results in the files `metadata.tsv`, `sequences_ha.fasta`, etc... under `ncbi/results/`.

#### Ingest from Andersen lab's avian-influenza repo

Ingest publicly available consensus sequences and metadata from Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).
Only run this workflow as needed to see the latest available data in the repo.
It does not merge or deduplicate the data the NCBI GenBank workflow.

```sh
nextstrain build . ingest_andersen_lab --configfile build-configs/ncbi/defaults/config.yaml
```

The results will be available in `andersen-lab/results/`.

#### Upload to S3

To run both NCBI Genbank and Andersent Lab ingests _and_ upload results to S3,
run the following command:

```sh
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
. \
upload_all_ncbi \
--configfile build-configs/ncbi/defaults/config.yaml
```

The workflow compresses and uploads the local files to S3 to corresponding paths
under `s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi` and
`s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab`.

### Ingest and upload data from fauna to S3

Expand Down Expand Up @@ -53,17 +85,6 @@ nextstrain build \
. upload_all
```

### Ingest from Andersen lab's avian-influenza repo

Ingest publicly available consensus sequences and metadata from Andersen lab's [avian-influenza repo](https://github.com/andersen-lab/avian-influenza).
Only run this workflow as needed to see the latest available data in the repo.
It does not merge or deduplicate the data with the fauna data used in the default ingest workflow.

```sh
nextstrain build . merge_andersen_segment_metadata
```

The results will be available in `andersen-lab/results/`.

## Configuration

Expand Down
10 changes: 7 additions & 3 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,11 @@ path_to_fauna = '../fauna'
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "defaults/config.yaml"

SUPPORTED_DATA_SOURCES = ["fauna", "ncbi", "andersen-lab"]

wildcard_constraints:
segment = "|".join(config["segments"])
segment = "|".join(config["segments"]),
data_source = "|".join(SUPPORTED_DATA_SOURCES)

rule all:
# As of 2024-05-16 the default ingest only ingests data from fauna
Expand All @@ -18,8 +21,9 @@ rule upload_all:
sequences=expand("fauna/s3/sequences_{segment}.done", segment=config["segments"]),
metadata="fauna/s3/metadata.done",

include: "rules/upload_from_fauna.smk"
include: "rules/ingest_andersen_lab.smk"
include: "rules/ingest_fauna.smk"
include: "rules/merge_segment_metadata.smk"
include: "rules/upload_to_s3.smk"

# Allow users to import custom rules provided via the config.
if "custom_rules" in config:
Expand Down
26 changes: 25 additions & 1 deletion ingest/build-configs/ncbi/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,37 @@ workflow and defines its default outputs.
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "build-configs/ncbi/defaults/config.yaml"

# Sanity check that the requested segments match our ncbi_segments map
assert all(segment in config["ncbi_segments"].keys() for segment in config["segments"])

NCBI_DATA_SOURCES = ["ncbi", "andersen-lab"]

rule ingest_ncbi:
input:
expand([
"ncbi/results/sequences_{segment}.fasta",
], segment=config["ncbi_segments"].keys()),
], segment=config["segments"]),
"ncbi/results/metadata.tsv",


rule ingest_andersen_lab:
input:
expand([
"andersen-lab/results/sequences_{segment}.fasta",
], segment=config["segments"]),
"andersen-lab/results/metadata.tsv",


# Uploads all results for NCBI and Andersen Lab ingests
rule upload_all_ncbi:
input:
expand([
"{data_source}/s3/sequences_{segment}.done",
"{data_source}/s3/metadata.done",
], data_source=NCBI_DATA_SOURCES, segment=config["segments"]),


# Include file paths are relative this Snakefile
include: "rules/ingest_andersen_lab.smk"
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#!/usr/bin/env python3
"""
Curate the metadata that originated from Andersen Lab's avian-influenza repo
<https://github.com/andersen-lab/avian-influenza>.
Expand Down
6 changes: 6 additions & 0 deletions ingest/build-configs/ncbi/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -126,3 +126,9 @@ curate:
- gisaid_clade
- h5_clade
- genbank_accession

s3_dst:
ncbi: s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi
andersen-lab: s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab

cloudfront_domain: data.nextstrain.org
19 changes: 0 additions & 19 deletions ingest/build-configs/ncbi/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -148,22 +148,3 @@ rule subset_metadata:
tsv-select -H -f {params.metadata_fields} \
{input.metadata} > {output.subset_metadata}
"""


rule merge_ncbi_segment_metadata:
"""
Add a column "n_segments" which reports how many segments
have sequence data (no QC performed).
"""
input:
segments = expand("ncbi/data/metadata_{segment}.tsv", segment=config["ncbi_segments"]),
metadata = "ncbi/data/metadata_ha.tsv",
output:
metadata = "ncbi/results/metadata.tsv",
shell:
"""
python scripts/add_segment_counts.py \
--segments {input.segments} \
--metadata {input.metadata} \
--output {output.metadata}
"""
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ rule curate_metadata:
"""
augur curate normalize-strings \
--metadata {input.metadata} \
| python3 ./scripts/curate_andersen_lab_data.py \
| ./build-configs/ncbi/bin/curate_andersen_lab_data \
| ./vendored/apply-geolocation-rules \
--geolocation-rules {input.geolocation_rules} \
| augur curate passthru \
Expand All @@ -99,7 +99,7 @@ rule match_metadata_and_segment_fasta:
metadata = "andersen-lab/data/metadata.tsv",
fasta = "andersen-lab/data/{segment}.fasta"
output:
metadata = "andersen-lab/results/metadata_{segment}.tsv",
metadata = "andersen-lab/data/metadata_{segment}.tsv",
fasta = "andersen-lab/results/sequences_{segment}.fasta"
log:
"andersen-lab/logs/match_segment_metadata_and_fasta/{segment}.txt",
Expand All @@ -118,21 +118,3 @@ rule match_metadata_and_segment_fasta:
--output-seq-field sequence \
2> {log}
"""

rule merge_andersen_segment_metadata:
"""
Add a column "n_segments" which reports how many segments
have sequence data (no QC performed).
"""
input:
segments = expand("andersen-lab/results/metadata_{segment}.tsv", segment=config["segments"]),
metadata = "andersen-lab/results/metadata_ha.tsv",
output:
metadata = "andersen-lab/results/metadata.tsv",
shell:
"""
python scripts/add_segment_counts.py \
--segments {input.segments} \
--metadata {input.metadata} \
--output {output.metadata}
"""
3 changes: 2 additions & 1 deletion ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ segments:
- mp
- ns

s3_dst: "s3://nextstrain-data-private/files/workflows/avian-flu"
s3_dst:
fauna: s3://nextstrain-data-private/files/workflows/avian-flu
41 changes: 41 additions & 0 deletions ingest/rules/ingest_fauna.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from pathlib import Path


rule download_segment:
output:
sequences = "fauna/data/{segment}.fasta",
params:
fasta_fields = "strain virus accession collection_date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade",
output_dir = lambda wildcards, output: Path(output.sequences).parent,
output_fstem = lambda wildcards, output: Path(output.sequences).stem,
benchmark:
"fauna/benchmarks/download_segment_{segment}.txt"
shell:
"""
python3 {path_to_fauna}/vdb/download.py \
--database vdb \
--virus avian_flu \
--fasta_fields {params.fasta_fields} \
--select locus:{wildcards.segment} \
--path {params.output_dir} \
--fstem {params.output_fstem}
"""

rule parse_segment:
input:
sequences = "fauna/data/{segment}.fasta",
output:
sequences = "fauna/results/sequences_{segment}.fasta",
metadata = "fauna/data/metadata_{segment}.tsv",
params:
fasta_fields = "strain virus isolate_id date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade",
prettify_fields = "region country division location host originating_lab submitting_lab authors PMID"
shell:
"""
augur parse \
--sequences {input.sequences} \
--output-sequences {output.sequences} \
--output-metadata {output.metadata} \
--fields {params.fasta_fields} \
--prettify-fields {params.prettify_fields}
"""
26 changes: 26 additions & 0 deletions ingest/rules/merge_segment_metadata.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""
This part of the workflow handles how we merge the metadata for each segment
into a central metadata file.
"""


rule merge_segment_metadata:
"""
For each subtype's HA metadata file add a column "n_segments" which reports
how many segments have sequence data (no QC performed). This will force the
download & parsing of all segments for a given subtype. Note that this does
not currently consider the prescribed min lengths (see min_length function)
for each segment, but that would be a nice improvement.
"""
input:
segments = expand("{{data_source}}/data/metadata_{segment}.tsv", segment=config["segments"]),
metadata = "{data_source}/data/metadata_ha.tsv",
output:
metadata = "{data_source}/results/metadata.tsv",
shell:
"""
python scripts/add_segment_counts.py \
--segments {input.segments} \
--metadata {input.metadata} \
--output {output.metadata}
"""
Loading

0 comments on commit 4eaba84

Please sign in to comment.