-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #47 from nextstrain/ingest-uploads
Add uploads to S3 for NCBI and Andersen Lab ingests
- Loading branch information
Showing
14 changed files
with
196 additions
and
153 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
ingest/scripts/curate_andersen_lab_data.py → ...configs/ncbi/bin/curate_andersen_lab_data
100644 → 100755
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
from pathlib import Path | ||
|
||
|
||
rule download_segment: | ||
output: | ||
sequences = "fauna/data/{segment}.fasta", | ||
params: | ||
fasta_fields = "strain virus accession collection_date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade", | ||
output_dir = lambda wildcards, output: Path(output.sequences).parent, | ||
output_fstem = lambda wildcards, output: Path(output.sequences).stem, | ||
benchmark: | ||
"fauna/benchmarks/download_segment_{segment}.txt" | ||
shell: | ||
""" | ||
python3 {path_to_fauna}/vdb/download.py \ | ||
--database vdb \ | ||
--virus avian_flu \ | ||
--fasta_fields {params.fasta_fields} \ | ||
--select locus:{wildcards.segment} \ | ||
--path {params.output_dir} \ | ||
--fstem {params.output_fstem} | ||
""" | ||
|
||
rule parse_segment: | ||
input: | ||
sequences = "fauna/data/{segment}.fasta", | ||
output: | ||
sequences = "fauna/results/sequences_{segment}.fasta", | ||
metadata = "fauna/data/metadata_{segment}.tsv", | ||
params: | ||
fasta_fields = "strain virus isolate_id date region country division location host domestic_status subtype originating_lab submitting_lab authors PMID gisaid_clade h5_clade", | ||
prettify_fields = "region country division location host originating_lab submitting_lab authors PMID" | ||
shell: | ||
""" | ||
augur parse \ | ||
--sequences {input.sequences} \ | ||
--output-sequences {output.sequences} \ | ||
--output-metadata {output.metadata} \ | ||
--fields {params.fasta_fields} \ | ||
--prettify-fields {params.prettify_fields} | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
""" | ||
This part of the workflow handles how we merge the metadata for each segment | ||
into a central metadata file. | ||
""" | ||
|
||
|
||
rule merge_segment_metadata: | ||
""" | ||
For each subtype's HA metadata file add a column "n_segments" which reports | ||
how many segments have sequence data (no QC performed). This will force the | ||
download & parsing of all segments for a given subtype. Note that this does | ||
not currently consider the prescribed min lengths (see min_length function) | ||
for each segment, but that would be a nice improvement. | ||
""" | ||
input: | ||
segments = expand("{{data_source}}/data/metadata_{segment}.tsv", segment=config["segments"]), | ||
metadata = "{data_source}/data/metadata_ha.tsv", | ||
output: | ||
metadata = "{data_source}/results/metadata.tsv", | ||
shell: | ||
""" | ||
python scripts/add_segment_counts.py \ | ||
--segments {input.segments} \ | ||
--metadata {input.metadata} \ | ||
--output {output.metadata} | ||
""" |
Oops, something went wrong.