Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize with pathogen repo guide #31

Merged
merged 7 commits into from
Feb 6, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jan 22, 2024

Description of proposed changes

Harmonize the ingest code organization with the pathogen repo guide

Related issue(s)

Checklist

  • Checks pass

@j23414 j23414 linked an issue Jan 22, 2024 that may be closed by this pull request
5 tasks
@j23414 j23414 marked this pull request as draft January 22, 2024 19:14
@j23414 j23414 self-assigned this Jan 22, 2024

REQUIRED INPUTS:

usvi_sequences = data/sequences_usvi.fasta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking

Is our thinking to use this docstring-like approach to document each rule file? A brief description of intent seems helpful but detailing the inputs & outputs seems prone to falling out of sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look and yes, you are correct!

The rational is to target the docstring for a high-level summary of the rule file's objectives. Users can delve into specific rule docstrings for more in-depth information. We'd aim at keeping the documented inputs and outputs up to date as part of any future modification, a bit similar to maintaining an API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise reconsidering this. The snakemake code documents the inputs/outputs, duplicating these in the ~docstring is going to fall out of sync. Completely up to you & Jover -- just a suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point! We were also considering just moving the "append_usvi" rule into "prepare_sequences.smk" to bypass having to track anther file. Do you have a preference for keeping it separate or just merging it among the "prepare_sequences" rules?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference. In general I'd see the naming of the rules as pathogen-dependent; if the bioinformatician feels it's appropriate to carve certain rules out into a new .smk file then I think that's just fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm also worried about inputs/outputs getting out of sync, but I'm not sure how else to give people a high level view of how the smaller .smk files are connected.

The goal is for these required inputs/outputs to rarely change since they will become the standard connection points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the comments in the template's (otherwise-empty) snakemake rule files was that they provided a really nice overview of what the ruleset should accomplish, including the expected inputs and outputs¹, but that the docstring paths would be removed by the author when they actually add the rules.

¹ For the simple case -- for builds using wildcards the author will have to understand that these paths need modifying.

@j23414 j23414 force-pushed the harmonize-with-pathogen-repo-template branch from 29eca1c to 8db1278 Compare January 22, 2024 20:04
@j23414 j23414 marked this pull request as ready for review January 22, 2024 20:07
@j23414 j23414 requested a review from a team January 22, 2024 23:01
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up with the additional changes here @j23414!


### Integration of USVI data

This Zika build incorporates data from https://github.com/blab/zika-usvi/. The sequences and metadata for USVI from that GitHub repository have undergone curation and were uploaded to https://github.com/nextstrain/fauna. Subsequently, they were downloaded as sequences and metadata, and a filter was applied to include only those records not yet submitted to NCBI GenBank. The resulting records are now available as a pair of metadata and sequences files in this directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful to add a little more detail here.

This is my best guess looking through git history:

  1. Sequences were uploaded to the fauna database following these instructions.
  2. Sequences were downloaded from the fauna database following these instructions.
  3. It's not clear to me what filter was applied to include only records not yet submitted to NCBI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I should have stated that I used blastn to identify USVI records not yet submitted to NCBI. Not sure how much detail I should give here.

For point 3

  1. Run zika ingest to get a sequences.fasta file from GenBank
  2. Blast fauna's zika.fasta file against the sequences.fasta file and pull any records that are not a 100% match
GENBANK_SEQUENCES=sequences.fasta
FAUNA_SEQUENCES=zika.fasta

# Create a local blast database
makeblastdb \
  -in ${GENBANK_SEQUENCES} \  
  -dbtype nucl

# Blast fauna against GenBank
blastn \
  -db ${GENBANK_SEQUENCES} \
  -query ${FAUNA_SEQUENCES} \
  -num_alignments 1 \
  -outfmt 6 \
  -out blast_output.txt

# USVI strains that 
# + match at 100% 
# + match at least a 5000nt region (to filter out short substring matches)
cat blast_output.txt \
| awk -F'\t' '$1~"USVI" && $3>=100 && $4>5000 , OFS="\t" {print $1}' \
> USVI_100_match.txt

less USVI_100_match.txt 
# USVI/5/2016|zika|MW165881|2016-10-17|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/43/2016|zika|MW165884|2016-07-19|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/4/2016|zika|MW165880|2016-10-14|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/35/2016|zika|MW165883|2016-09-08|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/25/2016|zika|MW165882|2016-09-27|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI strains that are not in the 100 match list
cat blast_output.txt \
| awk -F'\t' '$1~"USVI" , OFS="\t" {print}' \
| grep -Fvf USVI_100_match.txt \
| awk -F'\t' '{print $1}' \
> USVI_not_match.txt

head USVI_not_match.txt
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black
# USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black
# ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to keep as much level of detail as needed to be able to reproduce the metadata_usvi.tsv and sequences_usvi.fasta files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation to reproduce the metadata_usvi.tsv and sequences_usvi.fasta files in more detail here: f9eff33.

In the end, I regenerated the two files (albeit in slightly different order) but then subsequently aligned pre and post fasta files against each other to ensure parity.


REQUIRED INPUTS:

usvi_sequences = data/sequences_usvi.fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm also worried about inputs/outputs getting out of sync, but I'm not sure how else to give people a high level view of how the smaller .smk files are connected.

The goal is for these required inputs/outputs to rarely change since they will become the standard connection points.

@j23414 j23414 force-pushed the harmonize-with-pathogen-repo-template branch from 2f914a1 to 978a4ac Compare January 25, 2024 01:08
@j23414 j23414 force-pushed the harmonize-with-pathogen-repo-template branch 2 times, most recently from dd74ddf to f9eff33 Compare January 30, 2024 20:22
@j23414
Copy link
Contributor Author

j23414 commented Feb 1, 2024

As part of addressing this comment, I updated ingest but forgot to update phylogentic. Fixed in b6d3ccb

@j23414 j23414 changed the title Harmonize with pathogen repo template Harmonize with pathogen repo guide Feb 5, 2024
Document provenance of the USVI data that is merged into the Zika live site.
@j23414 j23414 force-pushed the harmonize-with-pathogen-repo-template branch from b6d3ccb to f66a276 Compare February 6, 2024 00:04
@j23414 j23414 merged commit c710ca1 into main Feb 6, 2024
8 checks passed
@j23414 j23414 deleted the harmonize-with-pathogen-repo-template branch February 6, 2024 00:19
@joverlee521
Copy link
Contributor

Thanks for following up and addressing changes to match the pathogen-repo guide!

I've made a couple more issues to track some remaining tasks:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Harmonize with pathogen-repo-guide
3 participants