Skip to content

Commit

Permalink
This commit fixes an error in which gaps in the HA cleavage site were…
Browse files Browse the repository at this point in the history
… being filled in as Ns, then being inferred to match the reference sequence. This resulted in most H5N1 and H5Nx sequences being inferred to have a polybasic cleavage site. This was fixed by removing `--fill-gaps` in `augur align` and by adding `--keep-ambiguous` to `augur translate`. I then added in functionality to annotate the HA cleavage site sequence and infer whether a furin cleavage motif is present. This is now a new rule, `rule cleavage_site`, which calls `scripts/annotate-ha-cleavage-site.py`. `scripts/annotate-ha-cleavage-site.py` will read in the HA alignment, translate it, find the start of HA2 (which always begins with amino acids `GLFG`, and infer whether the preceding 4 amino acids contain a furin cleavage motif. Here, we define a furin cleavage motif as `R-X-K/R-R`, where `X` can be any amino acid. This script will output whether a furin cleavage motif is present or absent to `results/cleavage-site_{subtype}_ha.json` and the sequence of the cleavage site to `results/cleavage-site-sequencese_{subtype}_ha.json`. These are then fed to `rule export` so that they are displayed in auspice. This commit contains new auspice configs, `scripts/annotate-ha-cleavage-site.py`, the new `rule cleavage_site` in the Snakefile, and an updated README.md.
  • Loading branch information
lmoncla committed Nov 23, 2021
1 parent 2d6a8f1 commit 6426cca
Show file tree
Hide file tree
Showing 7 changed files with 167 additions and 17 deletions.
31 changes: 29 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@ All 32 builds (4 subtypes x 8 segments) can be build by running `snakemake`. For

Please see [nextstrain.org/docs](https://nextstrain.org/docs) for details about augur and pathogen builds.

# A note on clade labeling
# Features unique to avian flu builds

### cleavage site annotations
Influenza virus HA is translated as a single peptide (HA0) that is cleaved to form the mature, functional form (HA1 and HA2). In all human strains and many avian strains, the cleavage site is composed of a single, basic amino acid residue. However, some avian influenza subtypes, particularly H5s, have acquired additional basic residues immediately preceding the HA cleavage site. In some cases, this results in addition of a furin cleavage motif, allowing HA to be cleaved by furin, which is ubiquitously expressed, and allows for viral replication across a range of tissues. The addition of this "polybasic cleavage site" is one of the prime determinants of avian influenza virulence. In these builds, we have annotated whether strains contain a furin cleavage motif, defined here as the sequence `R-X-K/R-R` immediately preceding the start of HA2, where `X` can be any amino acid. We have also added a color by for the cleavage site sequence, which we define here as the 4 bases preceding HA2.

### clade labeling
H5 viruses are classified into clades, which are currently viewable as a color by on [nextstrain.org](https://nextstrain.org/flu/avian/h5n1/ha?c=h5_label_clade). Because clade annotations are not available in all public databases, we annotate sequences with their most likely clade using a tool developed by Samuel S. Shepard at CDC called [LABEL](https://wonder.cdc.gov/amd/flu/label/). The assigned clade for each H5N1 or H5Nx sequence are available as public tsvs [here](https://github.com/nextstrain/avian-flu/tree/master/clade-labeling).

To update the `clades.tsv` file with clade annotations for new sequences, run:
Expand All @@ -25,4 +30,26 @@ To run the builds on without updating the clades file, run:

To string these together and update the `clades.tsv` file for new sequences and then run the builds:

`snakemake -s -p Snakefile.clades --cores 1 && snakemake -p --cores 1`
`snakemake -s -p Snakefile.clades --cores 1 && snakemake -p --cores 1`

# To make your own, custom build
You are welcome to clone this repo and use it as a starting point for running your own, local builds if you'd like. The [Nextstrain docs](https://docs.nextstrain.org/en/latest/index.html) are a fantastic resource for getting started with the Nextstrain pipeline, and include some [great tutorials](https://docs.nextstrain.org/en/latest/install.html) to get you started. This build is slightly more complicated than other builds, and has a few custom functions in it to accommodate all the features on [nextstrain.org](https://nextstrain.org/flu/avian), and makes use of wildcards for both subtypes and gene segments. I originally used the [Lassa virus build](https://github.com/nextstrain/lassa) as a template for this one, because lassa also has a segmented genome. So if you'd like a simpler example, I would recommend looking through that build, then building up in complexity to the wildcard structure here. If you'd like to adapt the pipeline here to your own data, you would need to make a few changes and additions:

#### 1. fauna / RethinkDB credentials
This build starts by pulling sequences from our live [fauna](https://github.com/nextstrain/fauna) database (a RethinkDB
instance). This requires environment variables `RETHINK_HOST` and `RETHINK_AUTH_KEY` to be
set.

If you don't have access to our database, you can run the build using the example data
provided in this repository. Before running the build, copy the example sequences into the
`data/` directory like so:

```
mkdir data/
cp example_data/* data/
```

Then run the the build. If you'd like to consistently run your own data, then you can place your fasta file in `data`. Alternatively, you can alter the `Snakefile` to remove references to our database and add paths to your own files. To do this, remove `rule download`, add paths to your input data (sequences and metadata) in `rule files`, and add those paths as the input to `rule parse`.

#### 2. clade labeling
If you'd like to run clade labeling, you will need to install [LABEL](https://wonder.cdc.gov/amd/flu/label/) yourself. This pipeline assumes that LABEL is located in `avian-flu/flu-amd/`, and should work if you install it into the `avian-flu` directory. If you do not need to label clades, then you can delete `rule add_h5_clade`, the function `metadata_by_wildcards`. You will need to make sure that all references to `metadata` in the pipeline are referencing `metadata_subtype_segment`, not `metadata-with-clade_subtype_segment`, which is generated by `rule add_h5_clade` and adds a new column to the metadata file with clade information.
29 changes: 15 additions & 14 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -154,11 +154,11 @@ rule align:
--sequences {input.sequences} \
--reference-sequence {input.reference} \
--output {output.alignment} \
--fill-gaps \
--remove-reference \
--nthreads 1
"""


rule tree:
message: "Building tree"
input:
Expand Down Expand Up @@ -225,7 +225,8 @@ rule ancestral:
--tree {input.tree} \
--alignment {input.alignment} \
--output-node-data {output.node_data} \
--inference {params.inference}
--inference {params.inference}\
--keep-ambiguous
"""

rule translate:
Expand Down Expand Up @@ -264,21 +265,19 @@ rule traits:
--confidence
"""

rule reconstruct_translations:
message: "Reconstructing translations for DMS data view"
rule cleavage_site:
message: "determining sequences that harbor furin cleavage sites"
input:
tree = rules.refine.output.tree,
node_data = "results/aa-muts_{subtype}_{segment}.json",
alignment = "results/aligned_{subtype}_ha.fasta"
output:
aa_alignment = "results/aa-alignment_{subtype}_{segment}.fasta"
cleavage_site_annotations = "results/cleavage-site_{subtype}_ha.json",
cleavage_site_sequences = "results/cleavage-site-sequences_{subtype}_ha.json"
shell:
"""
augur reconstruct-sequences \
--tree {input.tree} \
--mutations {input.node_data} \
--gene PB2 \
--output {output.aa_alignment} \
--internal-nodes
python scripts/annotate-ha-cleavage-site.py \
--alignment {input.alignment} \
--furin_site_motif {output.cleavage_site_annotations} \
--cleavage_site_sequence {output.cleavage_site_sequences}
"""

rule export:
Expand All @@ -290,6 +289,8 @@ rule export:
traits = rules.traits.output.node_data,
nt_muts = rules.ancestral.output.node_data,
aa_muts = rules.translate.output.node_data,
fcs = rules.cleavage_site.output.cleavage_site_annotations,
cleavage_site_sequences = rules.cleavage_site.output.cleavage_site_sequences,
colors = files.colors,
lat_longs = files.lat_longs,
auspice_config = files.auspice_config
Expand All @@ -300,7 +301,7 @@ rule export:
augur export v2 \
--tree {input.tree} \
--metadata {input.metadata} \
--node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts}\
--node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} {input.fcs} {input.cleavage_site_sequences}\
--colors {input.colors} \
--lat-longs {input.lat_longs} \
--auspice-config {input.auspice_config} \
Expand Down
10 changes: 10 additions & 0 deletions config/auspice_config_h5n1.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@
"title": "H5 clade",
"type": "categorical"
},
{
"key": "furin_cleavage_motif",
"title": "furin cleavage motif",
"type": "categorical"
},
{
"key": "cleavage_site_sequence",
"title": "cleavage site sequence",
"type": "categorical"
},
{
"key": "submitting_lab",
"title": "Submitting lab",
Expand Down
10 changes: 10 additions & 0 deletions config/auspice_config_h5nx.json
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,16 @@
"title": "H5 clade",
"type": "categorical"
},
{
"key": "furin_cleavage_motif",
"title": "furin cleavage motif",
"type": "categorical"
},
{
"key": "cleavage_site_sequence",
"title": "cleavage site sequence",
"type": "categorical"
},
{
"key": "submitting_lab",
"title": "Submitting lab",
Expand Down
12 changes: 11 additions & 1 deletion config/auspice_config_h7n9.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"maintainers": [
{"name": "Louise Moncla", "url": "http://bedford.io/team/louise-moncla/"}
],
"build_url": "https://github.com/nextstrain/avian-flu",
"build_url": "https://github.com/nextstrain/avian-flu",
"colorings": [
{
"key": "gt",
Expand All @@ -30,6 +30,16 @@
"title": "Host",
"type": "categorical"
},
{
"key": "furin_cleavage_motif",
"title": "furin cleavage motif",
"type": "categorical"
},
{
"key": "cleavage_site_sequence",
"title": "cleavage site sequence",
"type": "categorical"
},
{
"key": "submitting_lab",
"title": "Submitting lab",
Expand Down
10 changes: 10 additions & 0 deletions config/auspice_config_h9n2.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,16 @@
"title": "Host",
"type": "categorical"
},
{
"key": "furin_cleavage_motif",
"title": "furin cleavage motif",
"type": "categorical"
},
{
"key": "cleavage_site_sequence",
"title": "cleavage site sequence",
"type": "categorical"
},
{
"key": "submitting_lab",
"title": "Submitting lab",
Expand Down
82 changes: 82 additions & 0 deletions scripts/annotate-ha-cleavage-site.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"""
This script will read in the HA alignment file, translate the sequence to amino acids,
find the beginning of HA2, and pull out the 4 amino acid sites immediately preceding HA2.
If HA2 is preceded immediately by amino acids R-X-K/R-R, then it is annotated
as having a furin cleavage motif. Otherwise, it is annotated as wild type. This produces 2
json files, one annotating a binary has a furin site or wild type site, the other coding
the actual sequence at the cleavage sites. These can both be used in augur export so
that the annotation and sequence show up in auspice as a color by.
"""

import Bio
from Bio import SeqIO
import json

import argparse
parser = argparse.ArgumentParser()

parser.add_argument('--alignment', type=str, help='alignment file output by rule augur align')
parser.add_argument('--furin_site_motif', type=str, help='name of output json file that annotates tips as having a furin cleavage site or a wt cleavage site')
parser.add_argument('--cleavage_site_sequence', type=str, help='name of output json file that annotates the cleavage site sequence for each tip')


args = parser.parse_args()
alignment = args.alignment
furin_site_motif_json = args.furin_site_motif
cleavage_site_sequence_json = args.cleavage_site_sequence


def output_furin_cleavage_site_jsons(alignment, output_json1, output_json2):

output_dict_furin = {"nodes":{}}
output_dict_seq = {"nodes":{}}

with open(output_json1, "w") as outfile:
outfile.write("")
with open(output_json2, "w") as outfile:
outfile.write("")


for seq in SeqIO.parse(alignment, "fasta"):

strain_name = seq.description

# convert gaps to Ns to avoid translation errors
sequence = str(seq.seq).upper().replace("-","N")

# convert back to sequence object
sequence = Bio.Seq.Seq(sequence)

# translate and find beginning of ha2
aa = str(sequence.translate())
ha2_begin = "GLFG"

start_pos_ha2 = aa.find(ha2_begin)

# define the furin site as the 4 positions prior to the start of HA2
furin_site = aa[start_pos_ha2-4:start_pos_ha2]

# if those 4 preceding amino acids have the pattern R-X-R/K-R, then it is cleavable
# by furin. Here, X is any amino acid (but not a gap), and the 3rd position can be
# K or R
if furin_site[0] == "R" and furin_site[3] == "R" and (furin_site[2] == "R" or furin_site[2]=="K") and furin_site[1]!="X":
furin_site_annotation = "present"
else:
furin_site_annotation = "absent"


output_dict_furin["nodes"][strain_name] = {"furin_cleavage_motif":furin_site_annotation}
output_dict_seq["nodes"][strain_name] = {"cleavage_site_sequence":furin_site.replace("X","-")}

f = open(output_json1, "w")
json.dump(output_dict_furin, f)
f.close()

f = open(output_json2, "w")
json.dump(output_dict_seq, f)
f.close()


output_furin_cleavage_site_jsons(alignment, furin_site_motif_json, cleavage_site_sequence_json)


0 comments on commit 6426cca

Please sign in to comment.