This commit fixes an error in which gaps in the HA cleavage site were…

… being filled in as Ns, then being inferred to match the reference sequence. This resulted in most H5N1 and H5Nx sequences being inferred to have a polybasic cleavage site. This was fixed by removing `--fill-gaps` in `augur align` and by adding `--keep-ambiguous` to `augur translate`. I then added in functionality to annotate the HA cleavage site sequence and infer whether a furin cleavage motif is present. This is now a new rule, `rule cleavage_site`, which calls `scripts/annotate-ha-cleavage-site.py`. `scripts/annotate-ha-cleavage-site.py` will read in the HA alignment, translate it, find the start of HA2 (which always begins with amino acids `GLFG`, and infer whether the preceding 4 amino acids contain a furin cleavage motif. Here, we define a furin cleavage motif as `R-X-K/R-R`, where `X` can be any amino acid. This script will output whether a furin cleavage motif is present or absent to `results/cleavage-site_{subtype}_ha.json` and the sequence of the cleavage site to `results/cleavage-site-sequencese_{subtype}_ha.json`. These are then fed to `rule export` so that they are displayed in auspice. This commit contains new auspice configs, `scripts/annotate-ha-cleavage-site.py`, the new `rule cleavage_site` in the Snakefile, and an updated README.md.
nextstrain · Nov 23, 2021 · 6426cca · 6426cca
1 parent 2d6a8f1
commit 6426cca
Show file tree

Hide file tree

Showing 7 changed files with 167 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,12 @@ All 32 builds (4 subtypes x 8 segments) can be build by running `snakemake`. For
 
 Please see [nextstrain.org/docs](https://nextstrain.org/docs) for details about augur and pathogen builds.
 
-# A note on clade labeling
+# Features unique to avian flu builds
+
+### cleavage site annotations 
+Influenza virus HA is translated as a single peptide (HA0) that is cleaved to form the mature, functional form (HA1 and HA2). In all human strains and many avian strains, the cleavage site is composed of a single, basic amino acid residue. However, some avian influenza subtypes, particularly H5s, have acquired additional basic residues immediately preceding the HA cleavage site. In some cases, this results in addition of a furin cleavage motif, allowing HA to be cleaved by furin, which is ubiquitously expressed, and allows for viral replication across a range of tissues. The addition of this "polybasic cleavage site" is one of the prime determinants of avian influenza virulence. In these builds, we have annotated whether strains contain a furin cleavage motif, defined here as the sequence `R-X-K/R-R` immediately preceding the start of HA2, where `X` can be any amino acid. We have also added a color by for the cleavage site sequence, which we define here as the 4 bases preceding HA2. 
+
+### clade labeling
 H5 viruses are classified into clades, which are currently viewable as a color by on [nextstrain.org](https://nextstrain.org/flu/avian/h5n1/ha?c=h5_label_clade). Because clade annotations are not available in all public databases, we annotate sequences with their most likely clade using a tool developed by Samuel S. Shepard at CDC called [LABEL](https://wonder.cdc.gov/amd/flu/label/). The assigned clade for each H5N1 or H5Nx sequence are available as public tsvs [here](https://github.com/nextstrain/avian-flu/tree/master/clade-labeling).
 
 To update the `clades.tsv` file with clade annotations for new sequences, run: 
@@ -25,4 +30,26 @@ To run the builds on without updating the clades file, run:
 
 To string these together and update the `clades.tsv` file for new sequences and then run the builds: 
 
-`snakemake -s -p Snakefile.clades --cores 1 && snakemake -p --cores 1`
+`snakemake -s -p Snakefile.clades --cores 1 && snakemake -p --cores 1`
+
+# To make your own, custom build
+You are welcome to clone this repo and use it as a starting point for running your own, local builds if you'd like. The [Nextstrain docs](https://docs.nextstrain.org/en/latest/index.html) are a fantastic resource for getting started with the Nextstrain pipeline, and include some [great tutorials](https://docs.nextstrain.org/en/latest/install.html) to get you started. This build is slightly more complicated than other builds, and has a few custom functions in it to accommodate all the features on [nextstrain.org](https://nextstrain.org/flu/avian), and makes use of wildcards for both subtypes and gene segments. I originally used the [Lassa virus build](https://github.com/nextstrain/lassa) as a template for this one, because lassa also has a segmented genome. So if you'd like a simpler example, I would recommend looking through that build, then building up in complexity to the wildcard structure here. If you'd like to adapt the pipeline here to your own data, you would need to make a few changes and additions:
+
+#### 1. fauna / RethinkDB credentials
+This build starts by pulling sequences from our live [fauna](https://github.com/nextstrain/fauna) database (a RethinkDB
+instance). This requires environment variables `RETHINK_HOST` and `RETHINK_AUTH_KEY` to be
+set.
+
+If you don't have access to our database, you can run the build using the example data
+provided in this repository. Before running the build, copy the example sequences into the
+`data/` directory like so:
+
+```
+mkdir data/
+cp example_data/* data/
+```
+
+Then run the the build. If you'd like to consistently run your own data, then you can place your fasta file in `data`. Alternatively, you can alter the `Snakefile` to remove references to our database and add paths to your own files. To do this, remove `rule download`, add paths to your input data (sequences and metadata) in `rule files`, and add those paths as the input to `rule parse`. 
+
+#### 2. clade labeling 
+If you'd like to run clade labeling, you will need to install [LABEL](https://wonder.cdc.gov/amd/flu/label/) yourself. This pipeline assumes that LABEL is located in `avian-flu/flu-amd/`, and should work if you install it into the `avian-flu` directory. If you do not need to label clades, then you can delete `rule add_h5_clade`, the function `metadata_by_wildcards`. You will need to make sure that all references to `metadata` in the pipeline are referencing `metadata_subtype_segment`, not `metadata-with-clade_subtype_segment`, which is generated by `rule add_h5_clade` and adds a new column to the metadata file with clade information. 
diff --git a/Snakefile b/Snakefile
@@ -154,11 +154,11 @@ rule align:
             --sequences {input.sequences} \
             --reference-sequence {input.reference} \
             --output {output.alignment} \
-            --fill-gaps \
             --remove-reference \
             --nthreads 1
         """
 
+
 rule tree:
     message: "Building tree"
     input:
@@ -225,7 +225,8 @@ rule ancestral:
             --tree {input.tree} \
             --alignment {input.alignment} \
             --output-node-data {output.node_data} \
-            --inference {params.inference}
+            --inference {params.inference}\
+            --keep-ambiguous
         """
 
 rule translate:
@@ -264,21 +265,19 @@ rule traits:
             --confidence
         """
 
-rule reconstruct_translations:
-    message: "Reconstructing translations for DMS data view"
+rule cleavage_site:
+    message: "determining sequences that harbor furin cleavage sites"
     input:
-        tree = rules.refine.output.tree,
-        node_data = "results/aa-muts_{subtype}_{segment}.json",
+        alignment = "results/aligned_{subtype}_ha.fasta"
     output:
-        aa_alignment = "results/aa-alignment_{subtype}_{segment}.fasta"
+        cleavage_site_annotations = "results/cleavage-site_{subtype}_ha.json",
+        cleavage_site_sequences = "results/cleavage-site-sequences_{subtype}_ha.json"
     shell:
         """
-        augur reconstruct-sequences \
-            --tree {input.tree} \
-            --mutations {input.node_data} \
-            --gene PB2 \
-            --output {output.aa_alignment} \
-            --internal-nodes
+        python scripts/annotate-ha-cleavage-site.py \
+            --alignment {input.alignment} \
+            --furin_site_motif {output.cleavage_site_annotations} \
+            --cleavage_site_sequence {output.cleavage_site_sequences}
         """
 
 rule export:
@@ -290,6 +289,8 @@ rule export:
         traits = rules.traits.output.node_data,
         nt_muts = rules.ancestral.output.node_data,
         aa_muts = rules.translate.output.node_data,
+        fcs = rules.cleavage_site.output.cleavage_site_annotations,
+        cleavage_site_sequences = rules.cleavage_site.output.cleavage_site_sequences,
         colors = files.colors,
         lat_longs = files.lat_longs,
         auspice_config = files.auspice_config
@@ -300,7 +301,7 @@ rule export:
         augur export v2 \
             --tree {input.tree} \
             --metadata {input.metadata} \
-            --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts}\
+            --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} {input.fcs} {input.cleavage_site_sequences}\
             --colors {input.colors} \
             --lat-longs {input.lat_longs} \
             --auspice-config {input.auspice_config} \

diff --git a/config/auspice_config_h5n1.json b/config/auspice_config_h5n1.json
@@ -35,6 +35,16 @@
       "title": "H5 clade",
       "type": "categorical"
     },
+    {
+      "key": "furin_cleavage_motif",
+      "title": "furin cleavage motif",
+      "type": "categorical"
+    },
+    {
+      "key": "cleavage_site_sequence",
+      "title": "cleavage site sequence",
+      "type": "categorical"
+    },
     {
       "key": "submitting_lab",
       "title": "Submitting lab",

diff --git a/config/auspice_config_h5nx.json b/config/auspice_config_h5nx.json
@@ -40,6 +40,16 @@
       "title": "H5 clade",
       "type": "categorical"
     },
+    {
+      "key": "furin_cleavage_motif",
+      "title": "furin cleavage motif",
+      "type": "categorical"
+    },
+    {
+      "key": "cleavage_site_sequence",
+      "title": "cleavage site sequence",
+      "type": "categorical"
+    },
     {
       "key": "submitting_lab",
       "title": "Submitting lab",

diff --git a/config/auspice_config_h7n9.json b/config/auspice_config_h7n9.json
@@ -3,7 +3,7 @@
   "maintainers": [
     {"name": "Louise Moncla", "url": "http://bedford.io/team/louise-moncla/"}
   ],
-  "build_url": "https://github.com/nextstrain/avian-flu",  
+  "build_url": "https://github.com/nextstrain/avian-flu",
   "colorings": [
     {
       "key": "gt",
@@ -30,6 +30,16 @@
       "title": "Host",
       "type": "categorical"
     },
+    {
+      "key": "furin_cleavage_motif",
+      "title": "furin cleavage motif",
+      "type": "categorical"
+    },
+    {
+      "key": "cleavage_site_sequence",
+      "title": "cleavage site sequence",
+      "type": "categorical"
+    },
     {
       "key": "submitting_lab",
       "title": "Submitting lab",

diff --git a/config/auspice_config_h9n2.json b/config/auspice_config_h9n2.json
@@ -30,6 +30,16 @@
       "title": "Host",
       "type": "categorical"
     },
+    {
+      "key": "furin_cleavage_motif",
+      "title": "furin cleavage motif",
+      "type": "categorical"
+    },
+    {
+      "key": "cleavage_site_sequence",
+      "title": "cleavage site sequence",
+      "type": "categorical"
+    },
     {
       "key": "submitting_lab",
       "title": "Submitting lab",

diff --git a/scripts/annotate-ha-cleavage-site.py b/scripts/annotate-ha-cleavage-site.py
@@ -0,0 +1,82 @@
+"""
+This script will read in the HA alignment file, translate the sequence to amino acids, 
+find the beginning of HA2, and pull out the 4 amino acid sites immediately preceding HA2. 
+If HA2 is preceded immediately by amino acids R-X-K/R-R, then it is annotated
+as having a furin cleavage motif. Otherwise, it is annotated as wild type. This produces 2
+json files, one annotating a binary has a furin site or wild type site, the other coding 
+the actual sequence at the cleavage sites. These can both be used in augur export so 
+that the annotation and sequence show up in auspice as a color by.  
+"""
+
+import Bio
+from Bio import SeqIO
+import json
+
+import argparse
+parser = argparse.ArgumentParser()
+
+parser.add_argument('--alignment', type=str, help='alignment file output by rule augur align')
+parser.add_argument('--furin_site_motif', type=str, help='name of output json file that annotates tips as having a furin cleavage site or a wt cleavage site')
+parser.add_argument('--cleavage_site_sequence', type=str, help='name of output json file that annotates the cleavage site sequence for each tip')
+
+
+args = parser.parse_args()
+alignment = args.alignment
+furin_site_motif_json = args.furin_site_motif
+cleavage_site_sequence_json = args.cleavage_site_sequence
+
+
+def output_furin_cleavage_site_jsons(alignment, output_json1, output_json2):
+
+    output_dict_furin = {"nodes":{}}
+    output_dict_seq = {"nodes":{}}
+
+    with open(output_json1, "w") as outfile: 
+        outfile.write("")
+    with open(output_json2, "w") as outfile: 
+        outfile.write("")
+
+
+    for seq in SeqIO.parse(alignment, "fasta"):
+
+        strain_name = seq.description
+
+        # convert gaps to Ns to avoid translation errors
+        sequence = str(seq.seq).upper().replace("-","N")
+
+        # convert back to sequence object
+        sequence = Bio.Seq.Seq(sequence) 
+
+        # translate and find beginning of ha2
+        aa = str(sequence.translate())
+        ha2_begin = "GLFG"
+
+        start_pos_ha2 = aa.find(ha2_begin)
+
+        # define the furin site as the 4 positions prior to the start of HA2
+        furin_site = aa[start_pos_ha2-4:start_pos_ha2]
+
+		# if those 4 preceding amino acids have the pattern R-X-R/K-R, then it is cleavable 
+		# by furin. Here, X is any amino acid (but not a gap), and the 3rd position can be 
+		# K or R
+        if furin_site[0] == "R" and furin_site[3] == "R" and (furin_site[2] == "R" or furin_site[2]=="K") and furin_site[1]!="X":
+            furin_site_annotation = "present"
+        else:
+            furin_site_annotation = "absent"
+
+
+        output_dict_furin["nodes"][strain_name] = {"furin_cleavage_motif":furin_site_annotation}
+        output_dict_seq["nodes"][strain_name] = {"cleavage_site_sequence":furin_site.replace("X","-")}
+
+    f = open(output_json1, "w")
+    json.dump(output_dict_furin, f)
+    f.close()
+
+    f = open(output_json2, "w")
+    json.dump(output_dict_seq, f)
+    f.close()
+
+
+output_furin_cleavage_site_jsons(alignment, furin_site_motif_json, cleavage_site_sequence_json)
+
+