-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add E gene trees #18
Add E gene trees #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not quite obvious to me why the separate rules for E are necessary, e.g. export
What part of export differs between E and genome? Maybe I missed this but I tried to find differences and couldn't see any obvious ones.
Thanks for checking! The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a Instead, we'd rely on "augur traits" and a metadata column for annotating types and subtypes. |
You could either create an empty dummy file in Do you see what I mean? You can embed a little shell script in the clades rule. I do that all the time, e.g. here I use a little bash rule preprocess_clades:
input:
clades="builds/clades{clade_type}.tsv",
outgroup="profiles/clades/{build_name}/outgroup.tsv",
output:
clades="builds/{build_name}/clades{clade_type}.tsv",
wildcard_constraints:
clade_type=".*", # Snakemake wildcard default is ".+" which doesn't match empty strings
params:
strain_set=lambda w: config["strainSet"][w.build_name],
shell:
"""
cp {input.clades} {output.clades};
cat <(echo) {input.outgroup} >> {output.clades};
if [ {params.strain_set} = 21L ]; then
for clade in 19A 19B 20A 20B 20C 20D 20E 20F 20G 20H 20I \
20J 21A 21B 21C 21D 21E 21F 21G 21H 21I 21J 21K 21M \
Alpha Beta Gamma Delta Epsilon Eta Theta Iota Kappa Lambda Mu;
do
sed -i "/$clade/d" {output.clades};
done
fi
""" |
Yeah I agree with Cornelius here.
As an example of this, you can provide a function instead of a list of node-data JSONs in export and then have the function generate the list of node data JSONs conditional on the wildcards. This has benefits when visualising the DAG, as the clades rule won't appear in the graph for the E gene target. |
Thanks @corneliusroemer and @jameshadfield! Explored various suggested approaches, outlined below:
Defining the wildcard conditional inline had the added benefit of consolidating down the augur translate rule. Thanks for the suggested solutions! This is open to further review and discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I scanned through the commits and added a few comments, but they're relatively minor. The simplifications in e440c0d are really nice (that's the commit I focused on, but I did read the others).
I think there's some simplifications to be had around
ruleorder: nextclade3_cut_E > decompress
ruleorder: filter_E > align
but I didn't explore. Happy to look again if you'd like.
313b394
to
a9b5a4c
Compare
0706ae0
to
6bf0d18
Compare
cc134af
to
3e0d85c
Compare
3285195
to
3edb07d
Compare
* Autogenerate reference_dengue_serotype_E.gb and fasta files * Add rules to prepare E sequences for phylogenetic analysis * Use nextclade3 to align and cut out E sequences from sequences_all.fasta
Drop the augur clades call since clades.tsv includes mutations outside the E gene. Pull type and subtype from metadata instead.
Use wildcard conditionals to further combine rules that take a different set of input files for the genome and E gene builds. Case 1: The E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule. Case 2: The reference Genbank files for E genes are dynamically generated in the results folder.
* Dropped "_dengue" from names, since it's redundant in the context of the project * Added "_genome" to the end of the reference genome files, to parallel the "_E" at the end of the reference E sequence files.
…uses. The purpose of this rule is to align sequences to an E gene reference sequence and extract the E gene region, if any. The rule is renamed to reflect this.
Placeholder of staged builds to help me visually check them, will update links to latest run soon.
|
Move gene annotation to top of CDS to match other genbank files (denv1,3,4)
This PR has been superseded by merged PRs: Closing since no longer needed |
Description of proposed changes
The goal of this PR is to add dengue E gene trees (e.g.
dengue_denv1_E.json
, ...dengue_denv4_E.json
) in response to feedback that certain locations may only provide the E sequences. To maintain clarity, E gene specific rules that differ from the standard genome rules are placed in separate*_E.smk
files. Shared rules are consolidated using wildcards for streamlined implementation where possible.General steps to support E gene trees are outlined as follows:
*_E.smk
filesRelated issue(s)
#17
Checklist