Add E gene trees #18

j23414 · 2024-01-13T00:17:30Z

Description of proposed changes

The goal of this PR is to add dengue E gene trees (e.g. dengue_denv1_E.json, ... dengue_denv4_E.json) in response to feedback that certain locations may only provide the E sequences. To maintain clarity, E gene specific rules that differ from the standard genome rules are placed in separate *_E.smk files. Shared rules are consolidated using wildcards for streamlined implementation where possible.

General steps to support E gene trees are outlined as follows:

Use newreference.py from the RSV pipeline to generate reference files (gb, fasta) specific to the E gene
Use Nextclade v3 to extract E gene sequences from the full dataset
Add a filter to exclude sequences shorter than 1000nt, reducing the risk of misclassification
Add E gene rules as separate *_E.smk files
Consolidate and merge redundant rules using Snakemake wildcards

Related issue(s)

#17

Checklist

Checks pass

phylogenetic/config/auspice_config_all_E.json

corneliusroemer

It's not quite obvious to me why the separate rules for E are necessary, e.g. export

What part of export differs between E and genome? Maybe I missed this but I tried to find differences and couldn't see any obvious ones.

j23414 · 2024-01-16T18:38:07Z

Thanks for checking! The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule.

Instead, we'd rely on "augur traits" and a metadata column for annotating types and subtypes.

corneliusroemer · 2024-01-16T18:51:28Z

The distinction lies in the fact that the E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule.

You could either create an empty dummy file in augur clades when augur clades is called for E or make the input conditional in export. While I'm in general very much in favor of low complexity of snakemake workflows and don't mind duplication if it reduces obscurity, here it should be straightforward to pack into a little bit of Python.

Do you see what I mean? You can embed a little shell script in the clades rule. I do that all the time, e.g. here I use a little bash if then to make the rule do different things based on a parameter. You could adapt this to test for he value of the wildcard.

rule preprocess_clades:
    input:
        clades="builds/clades{clade_type}.tsv",
        outgroup="profiles/clades/{build_name}/outgroup.tsv",
    output:
        clades="builds/{build_name}/clades{clade_type}.tsv",
    wildcard_constraints:
        clade_type=".*",  # Snakemake wildcard default is ".+" which doesn't match empty strings
    params:
        strain_set=lambda w: config["strainSet"][w.build_name],
    shell:
        """
        cp {input.clades} {output.clades};
        cat <(echo) {input.outgroup} >> {output.clades};
        if [ {params.strain_set} = 21L ]; then
            for clade in 19A 19B 20A 20B 20C 20D 20E 20F 20G 20H 20I \
                20J 21A 21B 21C 21D 21E 21F 21G 21H 21I 21J 21K 21M \
                Alpha Beta Gamma Delta Epsilon Eta Theta Iota Kappa Lambda Mu;
            do
                sed -i "/$clade/d" {output.clades};
            done
        fi
        """

jameshadfield · 2024-01-16T19:58:33Z

Yeah I agree with Cornelius here.

or make the input conditional in export

As an example of this, you can provide a function instead of a list of node-data JSONs in export and then have the function generate the list of node data JSONs conditional on the wildcards. This has benefits when visualising the DAG, as the clades rule won't appear in the graph for the E gene target.

j23414 · 2024-01-16T23:29:19Z

Thanks @corneliusroemer and @jameshadfield! Explored various suggested approaches, outlined below:

Adding a preprocess_clades rule: Noticed that not every subtype in clades_serotype.tsv has E gene defining mutations listed (e.g. DENV1/V, DENV2/AII). Therefore, achieving a comprehensive solution may involve specifically identifying the E gene mutations for each subtype. This could be a potential task for later exploration.
Creating a dummy file in augur clades: Encountered a failure with "augur export" on empty clades_{serotype}_E.json files which is expected from prior github discussion threads. However, I didn't dig too deeply into determining the minimal content required for the empty file, hoping for a more elegant solution through an alternative approach.
Defining a node_data_files function similar to hepB: Ended up using this approach. Instead of placing the wildcard pattern in a separate function, opted to define the wildcard conditional inline.

Defining the wildcard conditional inline had the added benefit of consolidating down the augur translate rule. Thanks for the suggested solutions! This is open to further review and discussion.

jameshadfield

I scanned through the commits and added a few comments, but they're relatively minor. The simplifications in e440c0d are really nice (that's the commit I focused on, but I did read the others).

I think there's some simplifications to be had around

ruleorder: nextclade3_cut_E > decompress
ruleorder: filter_E > align

but I didn't explore. Happy to look again if you'd like.

phylogenetic/rules/annotate_phylogeny.smk

phylogenetic/bin/newreference.py

phylogenetic/config/auspice_config_all.json

phylogenetic/rules/prepare_sequences_E.smk

* Autogenerate reference_dengue_serotype_E.gb and fasta files * Add rules to prepare E sequences for phylogenetic analysis * Use nextclade3 to align and cut out E sequences from sequences_all.fasta

Drop the augur clades call since clades.tsv includes mutations outside the E gene. Pull type and subtype from metadata instead.

Use wildcard conditionals to further combine rules that take a different set of input files for the genome and E gene builds. Case 1: The E gene build excludes the "augur clades" command, resulting in the absence of a clades_{serotype}_E.json file in the E gene export rule. Case 2: The reference Genbank files for E genes are dynamically generated in the results folder.

* Dropped "_dengue" from names, since it's redundant in the context of the project * Added "_genome" to the end of the reference genome files, to parallel the "_E" at the end of the reference E sequence files.

…uses. The purpose of this rule is to align sequences to an E gene reference sequence and extract the E gene region, if any. The rule is renamed to reflect this.

j23414 · 2024-05-02T20:57:04Z

Placeholder of staged builds to help me visually check them, will update links to latest run soon.

	genome	E gene
all	all/genome	all/E
denv1	denv1/genome	denv1/E
denv2	denv2/genome	denv2/E
denv3	denv3/genome	denv3/E
denv4	denv4/genome	denv4/E

Move gene annotation to top of CDS to match other genbank files (denv1,3,4)

j23414 · 2024-05-23T22:04:43Z

This PR has been superseded by merged PRs:

Closing since no longer needed

j23414 linked an issue Jan 13, 2024 that may be closed by this pull request

Add E gene builds #17

Closed

j23414 changed the title ~~Phylogenetic dir e~~ Add E gene trees Jan 13, 2024

j23414 requested a review from a team January 16, 2024 17:30

corneliusroemer reviewed Jan 16, 2024

View reviewed changes

phylogenetic/config/auspice_config_all_E.json Outdated Show resolved Hide resolved

corneliusroemer reviewed Jan 16, 2024

View reviewed changes

joverlee521 self-requested a review January 16, 2024 23:57

jameshadfield self-requested a review January 17, 2024 01:37

jameshadfield reviewed Jan 17, 2024

View reviewed changes

phylogenetic/rules/annotate_phylogeny.smk Show resolved Hide resolved

phylogenetic/bin/newreference.py Show resolved Hide resolved

phylogenetic/config/auspice_config_all.json Outdated Show resolved Hide resolved

phylogenetic/rules/prepare_sequences_E.smk Outdated Show resolved Hide resolved

This was referenced Feb 6, 2024

Split by dengue serotype (denv1-denv4) #19

Closed

Harmonize with pathogen repo guide #22

Closed

Fix: update dropped strains file to list accession instead of strain names #26

Merged

j23414 force-pushed the phylogenetic_dir_E branch 2 times, most recently from 313b394 to a9b5a4c Compare February 13, 2024 22:45

j23414 force-pushed the phylogenetic_dir_E branch 3 times, most recently from 0706ae0 to 6bf0d18 Compare February 26, 2024 18:18

This was referenced Mar 8, 2024

DRAFT: add 'gene' feature capture when reading genbank files nextstrain/augur#1435

Draft

Generalize the "extend-metadata.py" script for any {gene}_coverage columns nextstrain/rsv#57

Open

This was referenced Mar 18, 2024

Harmonize ingest with pathogen repo guide #35

Merged

Add "--start" and "--end" arguments to newreference.py to allow for creating subgenic trees nextstrain/rsv#58

Open

Add gene coverage columns during ingest workflow #36

Merged

kimandrews mentioned this pull request Mar 22, 2024

Make tree for 450bp of the N gene ("N450") nextstrain/measles#20

Merged

1 task

joverlee521 mentioned this pull request Apr 5, 2024

Automate ingest and phylogenetic workflows #38

Merged

2 tasks

This was referenced Apr 25, 2024

Instead of '?', set default gene_coverage to 0 #44

Merged

Dynamically generate the auspice config files #45

Merged

j23414 force-pushed the phylogenetic_dir_E branch from cc134af to 3e0d85c Compare April 30, 2024 21:52

j23414 force-pushed the phylogenetic_dir_E branch 3 times, most recently from 3285195 to 3edb07d Compare May 1, 2024 23:38

j23414 added 13 commits May 2, 2024 11:12

Copy newreference script from RSV

bebcf2f

Prepare E sequences for phylogenetic analysis

ce642e2

* Autogenerate reference_dengue_serotype_E.gb and fasta files * Add rules to prepare E sequences for phylogenetic analysis * Use nextclade3 to align and cut out E sequences from sequences_all.fasta

Construct phylogeny for E gene

8a5835d

Generate node data for E gene trees

6ca8224

Drop the augur clades call since clades.tsv includes mutations outside the E gene. Pull type and subtype from metadata instead.

Export E gene builds

8caa412

Include root sequences in E build

044145b

Update snakemake targets

bd8beb6

Use wildcards and combine redundant rules

4cf6b1c

Consistent naming of reference genome files

ab5dff4

* Dropped "_dengue" from names, since it's redundant in the context of the project * Added "_genome" to the end of the reference genome files, to parallel the "_E" at the end of the reference E sequence files.

Rename rule to describe what it does instead of the internal tool it …

6a5cd62

…uses. The purpose of this rule is to align sequences to an E gene reference sequence and extract the E gene region, if any. The rule is renamed to reflect this.

Exclude outliers

05d58e8

AWS cores

baf71f7

j23414 force-pushed the phylogenetic_dir_E branch from 3edb07d to baf71f7 Compare May 2, 2024 18:13

fixup: denv2 reference genbank

7d8c420

Move gene annotation to top of CDS to match other genbank files (denv1,3,4)

This was referenced May 8, 2024

Generate gene reference files #47

Merged

Use gene reference files to generate E gene trees #48

Merged

j23414 closed this May 23, 2024

j23414 deleted the phylogenetic_dir_E branch May 23, 2024 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add E gene trees #18

Add E gene trees #18

j23414 commented Jan 13, 2024

corneliusroemer left a comment

j23414 commented Jan 16, 2024

corneliusroemer commented Jan 16, 2024 •

edited

Loading

jameshadfield commented Jan 16, 2024

j23414 commented Jan 16, 2024

jameshadfield left a comment

j23414 commented May 2, 2024 •

edited

Loading

j23414 commented May 23, 2024

Add E gene trees #18

Add E gene trees #18

Conversation

j23414 commented Jan 13, 2024

Description of proposed changes

Related issue(s)

Checklist

corneliusroemer left a comment

Choose a reason for hiding this comment

j23414 commented Jan 16, 2024

corneliusroemer commented Jan 16, 2024 • edited Loading

jameshadfield commented Jan 16, 2024

j23414 commented Jan 16, 2024

jameshadfield left a comment

Choose a reason for hiding this comment

j23414 commented May 2, 2024 • edited Loading

j23414 commented May 23, 2024

corneliusroemer commented Jan 16, 2024 •

edited

Loading

j23414 commented May 2, 2024 •

edited

Loading