Harmonize with pathogen repo guide #31

j23414 · 2024-01-22T19:14:29Z

Description of proposed changes

Harmonize the ingest code organization with the pathogen repo guide

Related issue(s)

Harmonize with pathogen-repo-guide #29

Checklist

Checks pass

post merge update of data.nextstrain.org zika files
Rename transform.smk to curate.smk
Simplify field name remapping
location of usvi data files
Harmonize the ingest/config/defaults.yaml file

jameshadfield · 2024-01-22T19:54:15Z

phylogenetic/rules/merge_sequences_usvi.smk

+
+REQUIRED INPUTS:
+
+    usvi_sequences  = data/sequences_usvi.fasta


not blocking

Is our thinking to use this docstring-like approach to document each rule file? A brief description of intent seems helpful but detailing the inputs & outputs seems prone to falling out of sync

Thanks for taking a look and yes, you are correct!

The rational is to target the docstring for a high-level summary of the rule file's objectives. Users can delve into specific rule docstrings for more in-depth information. We'd aim at keeping the documented inputs and outputs up to date as part of any future modification, a bit similar to maintaining an API.

I would advise reconsidering this. The snakemake code documents the inputs/outputs, duplicating these in the ~docstring is going to fall out of sync. Completely up to you & Jover -- just a suggestion!

It's a good point! We were also considering just moving the "append_usvi" rule into "prepare_sequences.smk" to bypass having to track anther file. Do you have a preference for keeping it separate or just merging it among the "prepare_sequences" rules?

No preference. In general I'd see the naming of the rules as pathogen-dependent; if the bioinformatician feels it's appropriate to carve certain rules out into a new .smk file then I think that's just fine.

Yeah, I'm also worried about inputs/outputs getting out of sync, but I'm not sure how else to give people a high level view of how the smaller .smk files are connected.

The goal is for these required inputs/outputs to rarely change since they will become the standard connection points.

My reading of the comments in the template's (otherwise-empty) snakemake rule files was that they provided a really nice overview of what the ruleset should accomplish, including the expected inputs and outputs¹, but that the docstring paths would be removed by the author when they actually add the rules.

¹ For the simple case -- for builds using wildcards the author will have to understand that these paths need modifying.

joverlee521

Thanks for following up with the additional changes here @j23414!

joverlee521 · 2024-01-23T23:01:16Z

phylogenetic/data/README.md

+
+### Integration of USVI data
+
+This Zika build incorporates data from https://github.com/blab/zika-usvi/. The sequences and metadata for USVI from that GitHub repository have undergone curation and were uploaded to https://github.com/nextstrain/fauna. Subsequently, they were downloaded as sequences and metadata, and a filter was applied to include only those records not yet submitted to NCBI GenBank. The resulting records are now available as a pair of metadata and sequences files in this directory.


I think it might be helpful to add a little more detail here.

This is my best guess looking through git history:

Sequences were uploaded to the fauna database following these instructions.

Sequences were downloaded from the fauna database following these instructions.

It's not clear to me what filter was applied to include only records not yet submitted to NCBI

Good point. I should have stated that I used blastn to identify USVI records not yet submitted to NCBI. Not sure how much detail I should give here.

For point 3

Run zika ingest to get a sequences.fasta file from GenBank

Blast fauna's zika.fasta file against the sequences.fasta file and pull any records that are not a 100% match

GENBANK_SEQUENCES=sequences.fasta FAUNA_SEQUENCES=zika.fasta # Create a local blast database makeblastdb \ -in ${GENBANK_SEQUENCES} \ -dbtype nucl # Blast fauna against GenBank blastn \ -db ${GENBANK_SEQUENCES} \ -query ${FAUNA_SEQUENCES} \ -num_alignments 1 \ -outfmt 6 \ -out blast_output.txt # USVI strains that # + match at 100% # + match at least a 5000nt region (to filter out short substring matches) cat blast_output.txt \ | awk -F'\t' '$1~"USVI" && $3>=100 && $4>5000 , OFS="\t" {print $1}' \ > USVI_100_match.txt less USVI_100_match.txt # USVI/5/2016|zika|MW165881|2016-10-17|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago # USVI/43/2016|zika|MW165884|2016-07-19|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago # USVI/4/2016|zika|MW165880|2016-10-14|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago # USVI/35/2016|zika|MW165883|2016-09-08|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago # USVI/25/2016|zika|MW165882|2016-09-27|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago

# USVI strains that are not in the 100 match list cat blast_output.txt \ | awk -F'\t' '$1~"USVI" , OFS="\t" {print}' \ | grep -Fvf USVI_100_match.txt \ | awk -F'\t' '{print $1}' \ > USVI_not_match.txt head USVI_not_match.txt # USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black # USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black # USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black # USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black # USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black # ...

I think it's helpful to keep as much level of detail as needed to be able to reproduce the metadata_usvi.tsv and sequences_usvi.fasta files.

Added documentation to reproduce the metadata_usvi.tsv and sequences_usvi.fasta files in more detail here: f9eff33.

In the end, I regenerated the two files (albeit in slightly different order) but then subsequently aligned pre and post fasta files against each other to ensure parity.

joverlee521 · 2024-01-24T00:14:37Z

phylogenetic/rules/merge_sequences_usvi.smk

+
+REQUIRED INPUTS:
+
+    usvi_sequences  = data/sequences_usvi.fasta


Yeah, I'm also worried about inputs/outputs getting out of sync, but I'm not sure how else to give people a high level view of how the smaller .smk files are connected.

The goal is for these required inputs/outputs to rarely change since they will become the standard connection points.

This is a more accurate name for the rule, since it fetches from NCBI and matches the pathogen-repo-template/ingest/ncbi_fetch_sequences.smk rule.

This matches the pathogen-repo-template/ingest/rules/curate.smk

Incorporating changes from the pathogen repo template: * nextstrain/pathogen-repo-guide@5e1b1ef

https://github.com/nextstrain/pathogen-repo-template/blob/b8ae886b25877a218ad50380fb44f8825d50aedb/ingest/config/defaults.yaml

j23414 · 2024-02-01T23:17:26Z

As part of addressing this comment, I updated ingest but forgot to update phylogentic. Fixed in b6d3ccb

Document provenance of the USVI data that is merged into the Zika live site.

The phylogenetic data url was updated to match the ingest data url from https://github.com/nextstrain/zika/blob/c0b9a6d38af405324c968aed9922cc2b3d136db2/ingest/config/optional.yaml#L7

joverlee521 · 2024-02-06T18:41:40Z

Thanks for following up and addressing changes to match the pathogen-repo guide!

I've made a couple more issues to track some remaining tasks:

j23414 linked an issue Jan 22, 2024 that may be closed by this pull request

Harmonize with pathogen-repo-guide #29

Closed

5 tasks

j23414 marked this pull request as draft January 22, 2024 19:14

j23414 self-assigned this Jan 22, 2024

jameshadfield reviewed Jan 22, 2024

View reviewed changes

j23414 force-pushed the harmonize-with-pathogen-repo-template branch from 29eca1c to 8db1278 Compare January 22, 2024 20:04

j23414 marked this pull request as ready for review January 22, 2024 20:07

j23414 requested a review from a team January 22, 2024 23:01

joverlee521 approved these changes Jan 24, 2024

View reviewed changes

j23414 force-pushed the harmonize-with-pathogen-repo-template branch from 2f914a1 to 978a4ac Compare January 25, 2024 01:08

j23414 added 5 commits January 30, 2024 12:19

Move fetch_sequences.smk to fetch_from_ncbi.smk

ad34fa3

This is a more accurate name for the rule, since it fetches from NCBI and matches the pathogen-repo-template/ingest/ncbi_fetch_sequences.smk rule.

Move transform.smk to curate.smk

0115484

This matches the pathogen-repo-template/ingest/rules/curate.smk

ingest/curate: Make the field map config more user friendly

5b1369e

Incorporating changes from the pathogen repo template: * nextstrain/pathogen-repo-guide@5e1b1ef

Harmonize ingest/config/defaults.yaml with the pathogen repo template

446047e

https://github.com/nextstrain/pathogen-repo-template/blob/b8ae886b25877a218ad50380fb44f8825d50aedb/ingest/config/defaults.yaml

More specific naming: merging in USVI sequences

19c8259

j23414 force-pushed the harmonize-with-pathogen-repo-template branch 2 times, most recently from dd74ddf to f9eff33 Compare January 30, 2024 20:22

j23414 changed the title ~~Harmonize with pathogen repo template~~ Harmonize with pathogen repo guide Feb 5, 2024

j23414 added 2 commits February 5, 2024 16:04

Move USVI data to the data folder

540f32d

Document provenance of the USVI data that is merged into the Zika live site.

Update the phylogenetic data url to match ingest data url

f66a276

The phylogenetic data url was updated to match the ingest data url from https://github.com/nextstrain/zika/blob/c0b9a6d38af405324c968aed9922cc2b3d136db2/ingest/config/optional.yaml#L7

j23414 force-pushed the harmonize-with-pathogen-repo-template branch from b6d3ccb to f66a276 Compare February 6, 2024 00:04

j23414 merged commit c710ca1 into main Feb 6, 2024
8 checks passed

j23414 deleted the harmonize-with-pathogen-repo-template branch February 6, 2024 00:19

joverlee521 mentioned this pull request Feb 23, 2024

phylo: clean rule removes the USVI data #37

Closed

joverlee521 mentioned this pull request Mar 1, 2024

Add phylogenetic directory nextstrain/measles#18

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonize with pathogen repo guide #31

Harmonize with pathogen repo guide #31

j23414 commented Jan 22, 2024 •

edited

Loading

jameshadfield Jan 22, 2024

j23414 Jan 22, 2024

jameshadfield Jan 22, 2024

j23414 Jan 22, 2024

jameshadfield Jan 22, 2024

joverlee521 Jan 24, 2024

jameshadfield Jan 24, 2024

joverlee521 left a comment

joverlee521 Jan 23, 2024

j23414 Jan 24, 2024

joverlee521 Jan 24, 2024

j23414 Jan 30, 2024

joverlee521 Jan 24, 2024

j23414 commented Feb 1, 2024

joverlee521 commented Feb 6, 2024


		### Integration of USVI data

		This Zika build incorporates data from https://github.com/blab/zika-usvi/. The sequences and metadata for USVI from that GitHub repository have undergone curation and were uploaded to https://github.com/nextstrain/fauna. Subsequently, they were downloaded as sequences and metadata, and a filter was applied to include only those records not yet submitted to NCBI GenBank. The resulting records are now available as a pair of metadata and sequences files in this directory.

Harmonize with pathogen repo guide #31

Harmonize with pathogen repo guide #31

Conversation

j23414 commented Jan 22, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 commented Feb 1, 2024

joverlee521 commented Feb 6, 2024

j23414 commented Jan 22, 2024 •

edited

Loading