Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add E gene trees #18

Closed
wants to merge 14 commits into from
Closed

Add E gene trees #18

wants to merge 14 commits into from

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jan 13, 2024

Description of proposed changes

The goal of this PR is to add dengue E gene trees (e.g. dengue_denv1_E.json, ... dengue_denv4_E.json) in response to feedback that certain locations may only provide the E sequences. To maintain clarity, E gene specific rules that differ from the standard genome rules are placed in separate *_E.smk files. Shared rules are consolidated using wildcards for streamlined implementation where possible.

General steps to support E gene trees are outlined as follows:

  1. Use newreference.py from the RSV pipeline to generate reference files (gb, fasta) specific to the E gene
  2. Use Nextclade v3 to extract E gene sequences from the full dataset
  3. Add a filter to exclude sequences shorter than 1000nt, reducing the risk of misclassification
  4. Add E gene rules as separate *_E.smk files
  5. Consolidate and merge redundant rules using Snakemake wildcards

Related issue(s)

#17

Checklist

  • Checks pass

@j23414 j23414 linked an issue Jan 13, 2024 that may be closed by this pull request
@j23414 j23414 changed the title Phylogenetic dir e Add E gene trees Jan 13, 2024
@j23414 j23414 requested a review from a team January 16, 2024 17:30
Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite obvious to me why the separate rules for E are necessary, e.g. export

What part of export differs between E and genome? Maybe I missed this but I tried to find differences and couldn't see any obvious ones.

@j23414
Copy link
Contributor Author

j23414 commented Jan 16, 2024

Thanks for checking! The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule.

Instead, we'd rely on "augur traits" and a metadata column for annotating types and subtypes.

@corneliusroemer
Copy link
Member

corneliusroemer commented Jan 16, 2024

The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule.

You could either create an empty dummy file in augur clades when augur clades is called for E or make the input conditional in export. While I'm in general very much in favor of low complexity of snakemake workflows and don't mind duplication if it reduces obscurity, here it should be straightforward to pack into a little bit of Python.

Do you see what I mean? You can embed a little shell script in the clades rule. I do that all the time, e.g. here I use a little bash if then to make the rule do different things based on a parameter. You could adapt this to test for he value of the wildcard.

rule preprocess_clades:
    input:
        clades="builds/clades{clade_type}.tsv",
        outgroup="profiles/clades/{build_name}/outgroup.tsv",
    output:
        clades="builds/{build_name}/clades{clade_type}.tsv",
    wildcard_constraints:
        clade_type=".*",  # Snakemake wildcard default is ".+" which doesn't match empty strings
    params:
        strain_set=lambda w: config["strainSet"][w.build_name],
    shell:
        """
        cp {input.clades} {output.clades};
        cat <(echo) {input.outgroup} >> {output.clades};
        if [ {params.strain_set} = 21L ]; then
            for clade in 19A 19B 20A 20B 20C 20D 20E 20F 20G 20H 20I \
                20J 21A 21B 21C 21D 21E 21F 21G 21H 21I 21J 21K 21M \
                Alpha Beta Gamma Delta Epsilon Eta Theta Iota Kappa Lambda Mu;
            do
                sed -i "/$clade/d" {output.clades};
            done
        fi
        """

@jameshadfield
Copy link
Member

Yeah I agree with Cornelius here.

or make the input conditional in export

As an example of this, you can provide a function instead of a list of node-data JSONs in export and then have the function generate the list of node data JSONs conditional on the wildcards. This has benefits when visualising the DAG, as the clades rule won't appear in the graph for the E gene target.

@j23414
Copy link
Contributor Author

j23414 commented Jan 16, 2024

Thanks @corneliusroemer and @jameshadfield! Explored various suggested approaches, outlined below:

  1. Adding a preprocess_clades rule: Noticed that not every subtype in clades_serotype.tsv has E gene defining mutations listed (e.g. DENV1/V, DENV2/AII). Therefore, achieving a comprehensive solution may involve specifically identifying the E gene mutations for each subtype. This could be a potential task for later exploration.
  2. Creating a dummy file in augur clades: Encountered a failure with "augur export" on empty clades_{serotype}_E.json files which is expected from prior github discussion threads. However, I didn't dig too deeply into determining the minimal content required for the empty file, hoping for a more elegant solution through an alternative approach.
  3. Defining a node_data_files function similar to hepB: Ended up using this approach. Instead of placing the wildcard pattern in a separate function, opted to define the wildcard conditional inline.

Defining the wildcard conditional inline had the added benefit of consolidating down the augur translate rule. Thanks for the suggested solutions! This is open to further review and discussion.

@joverlee521 joverlee521 self-requested a review January 16, 2024 23:57
@jameshadfield jameshadfield self-requested a review January 17, 2024 01:37
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scanned through the commits and added a few comments, but they're relatively minor. The simplifications in e440c0d are really nice (that's the commit I focused on, but I did read the others).

I think there's some simplifications to be had around

ruleorder: nextclade3_cut_E > decompress
ruleorder: filter_E > align

but I didn't explore. Happy to look again if you'd like.

phylogenetic/rules/annotate_phylogeny.smk Show resolved Hide resolved
phylogenetic/bin/newreference.py Show resolved Hide resolved
phylogenetic/config/auspice_config_all.json Outdated Show resolved Hide resolved
phylogenetic/rules/prepare_sequences_E.smk Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the phylogenetic_dir_E branch 3 times, most recently from 3285195 to 3edb07d Compare May 1, 2024 23:38
j23414 added 13 commits May 2, 2024 11:12
* Autogenerate reference_dengue_serotype_E.gb and fasta files
* Add rules to prepare E sequences for phylogenetic analysis
* Use nextclade3 to align and cut out E sequences from sequences_all.fasta
Drop the augur clades call since clades.tsv includes mutations outside
the E gene. Pull type and subtype from metadata instead.
Use wildcard conditionals to further combine rules that take a different
set of input files for the genome and E gene builds.

Case 1: The E gene build excludes the "augur clades" command, resulting in the
absence of a clades_{serotype}_E.json file in the E gene export rule.

Case 2: The reference Genbank files for E genes are dynamically generated
in the results folder.
* Dropped "_dengue" from names, since it's redundant in the context of the project
* Added "_genome" to the end of the reference genome files, to parallel the "_E"
  at the end of the reference E sequence files.
…uses.

The purpose of this rule is to align sequences to an E gene reference sequence
and extract the E gene region, if any. The rule is renamed to reflect this.
@j23414
Copy link
Contributor Author

j23414 commented May 2, 2024

Placeholder of staged builds to help me visually check them, will update links to latest run soon.

genome E gene
all all/genome all/E
denv1 denv1/genome denv1/E
denv2 denv2/genome denv2/E
denv3 denv3/genome denv3/E
denv4 denv4/genome denv4/E

Move gene annotation to top of CDS to match other genbank files (denv1,3,4)
@j23414
Copy link
Contributor Author

j23414 commented May 23, 2024

This PR has been superseded by merged PRs:

Closing since no longer needed

@j23414 j23414 closed this May 23, 2024
@j23414 j23414 deleted the phylogenetic_dir_E branch May 23, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add E gene builds
4 participants