-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonize with pathogen repo guide #31
Conversation
|
||
REQUIRED INPUTS: | ||
|
||
usvi_sequences = data/sequences_usvi.fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking
Is our thinking to use this docstring-like approach to document each rule file? A brief description of intent seems helpful but detailing the inputs & outputs seems prone to falling out of sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and yes, you are correct!
The rational is to target the docstring for a high-level summary of the rule file's objectives. Users can delve into specific rule docstrings for more in-depth information. We'd aim at keeping the documented inputs and outputs up to date as part of any future modification, a bit similar to maintaining an API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advise reconsidering this. The snakemake code documents the inputs/outputs, duplicating these in the ~docstring is going to fall out of sync. Completely up to you & Jover -- just a suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point! We were also considering just moving the "append_usvi" rule into "prepare_sequences.smk" to bypass having to track anther file. Do you have a preference for keeping it separate or just merging it among the "prepare_sequences" rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No preference. In general I'd see the naming of the rules as pathogen-dependent; if the bioinformatician feels it's appropriate to carve certain rules out into a new .smk
file then I think that's just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm also worried about inputs/outputs getting out of sync, but I'm not sure how else to give people a high level view of how the smaller .smk
files are connected.
The goal is for these required inputs/outputs to rarely change since they will become the standard connection points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of the comments in the template's (otherwise-empty) snakemake rule files was that they provided a really nice overview of what the ruleset should accomplish, including the expected inputs and outputs¹, but that the docstring paths would be removed by the author when they actually add the rules.
¹ For the simple case -- for builds using wildcards the author will have to understand that these paths need modifying.
29eca1c
to
8db1278
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up with the additional changes here @j23414!
|
||
### Integration of USVI data | ||
|
||
This Zika build incorporates data from https://github.com/blab/zika-usvi/. The sequences and metadata for USVI from that GitHub repository have undergone curation and were uploaded to https://github.com/nextstrain/fauna. Subsequently, they were downloaded as sequences and metadata, and a filter was applied to include only those records not yet submitted to NCBI GenBank. The resulting records are now available as a pair of metadata and sequences files in this directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be helpful to add a little more detail here.
This is my best guess looking through git history:
- Sequences were uploaded to the fauna database following these instructions.
- Sequences were downloaded from the fauna database following these instructions.
- It's not clear to me what filter was applied to include only records not yet submitted to NCBI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I should have stated that I used blastn
to identify USVI records not yet submitted to NCBI. Not sure how much detail I should give here.
For point 3
- Run zika ingest to get a
sequences.fasta
file from GenBank - Blast fauna's
zika.fasta
file against thesequences.fasta
file and pull any records that are not a 100% match
GENBANK_SEQUENCES=sequences.fasta
FAUNA_SEQUENCES=zika.fasta
# Create a local blast database
makeblastdb \
-in ${GENBANK_SEQUENCES} \
-dbtype nucl
# Blast fauna against GenBank
blastn \
-db ${GENBANK_SEQUENCES} \
-query ${FAUNA_SEQUENCES} \
-num_alignments 1 \
-outfmt 6 \
-out blast_output.txt
# USVI strains that
# + match at 100%
# + match at least a 5000nt region (to filter out short substring matches)
cat blast_output.txt \
| awk -F'\t' '$1~"USVI" && $3>=100 && $4>5000 , OFS="\t" {print $1}' \
> USVI_100_match.txt
less USVI_100_match.txt
# USVI/5/2016|zika|MW165881|2016-10-17|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/43/2016|zika|MW165884|2016-07-19|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/4/2016|zika|MW165880|2016-10-14|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/35/2016|zika|MW165883|2016-09-08|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/25/2016|zika|MW165882|2016-09-27|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI strains that are not in the 100 match list
cat blast_output.txt \
| awk -F'\t' '$1~"USVI" , OFS="\t" {print}' \
| grep -Fvf USVI_100_match.txt \
| awk -F'\t' '{print $1}' \
> USVI_not_match.txt
head USVI_not_match.txt
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black
# USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's helpful to keep as much level of detail as needed to be able to reproduce the metadata_usvi.tsv
and sequences_usvi.fasta
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added documentation to reproduce the metadata_usvi.tsv and sequences_usvi.fasta files in more detail here: f9eff33.
In the end, I regenerated the two files (albeit in slightly different order) but then subsequently aligned pre and post fasta files against each other to ensure parity.
|
||
REQUIRED INPUTS: | ||
|
||
usvi_sequences = data/sequences_usvi.fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm also worried about inputs/outputs getting out of sync, but I'm not sure how else to give people a high level view of how the smaller .smk
files are connected.
The goal is for these required inputs/outputs to rarely change since they will become the standard connection points.
2f914a1
to
978a4ac
Compare
This is a more accurate name for the rule, since it fetches from NCBI and matches the pathogen-repo-template/ingest/ncbi_fetch_sequences.smk rule.
This matches the pathogen-repo-template/ingest/rules/curate.smk
Incorporating changes from the pathogen repo template: * nextstrain/pathogen-repo-guide@5e1b1ef
dd74ddf
to
f9eff33
Compare
As part of addressing this comment, I updated ingest but forgot to update phylogentic. Fixed in b6d3ccb |
Document provenance of the USVI data that is merged into the Zika live site.
The phylogenetic data url was updated to match the ingest data url from https://github.com/nextstrain/zika/blob/c0b9a6d38af405324c968aed9922cc2b3d136db2/ingest/config/optional.yaml#L7
b6d3ccb
to
f66a276
Compare
Thanks for following up and addressing changes to match the pathogen-repo guide! I've made a couple more issues to track some remaining tasks: |
Description of proposed changes
Harmonize the ingest code organization with the pathogen repo guide
Related issue(s)
Checklist
ingest/config/defaults.yaml
file