-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This commit fixes an error in which gaps in the HA cleavage site were…
… being filled in as Ns, then being inferred to match the reference sequence. This resulted in most H5N1 and H5Nx sequences being inferred to have a polybasic cleavage site. This was fixed by removing `--fill-gaps` in `augur align` and by adding `--keep-ambiguous` to `augur translate`. I then added in functionality to annotate the HA cleavage site sequence and infer whether a furin cleavage motif is present. This is now a new rule, `rule cleavage_site`, which calls `scripts/annotate-ha-cleavage-site.py`. `scripts/annotate-ha-cleavage-site.py` will read in the HA alignment, translate it, find the start of HA2 (which always begins with amino acids `GLFG`, and infer whether the preceding 4 amino acids contain a furin cleavage motif. Here, we define a furin cleavage motif as `R-X-K/R-R`, where `X` can be any amino acid. This script will output whether a furin cleavage motif is present or absent to `results/cleavage-site_{subtype}_ha.json` and the sequence of the cleavage site to `results/cleavage-site-sequencese_{subtype}_ha.json`. These are then fed to `rule export` so that they are displayed in auspice. This commit contains new auspice configs, `scripts/annotate-ha-cleavage-site.py`, the new `rule cleavage_site` in the Snakefile, and an updated README.md.
- Loading branch information
Showing
7 changed files
with
167 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
""" | ||
This script will read in the HA alignment file, translate the sequence to amino acids, | ||
find the beginning of HA2, and pull out the 4 amino acid sites immediately preceding HA2. | ||
If HA2 is preceded immediately by amino acids R-X-K/R-R, then it is annotated | ||
as having a furin cleavage motif. Otherwise, it is annotated as wild type. This produces 2 | ||
json files, one annotating a binary has a furin site or wild type site, the other coding | ||
the actual sequence at the cleavage sites. These can both be used in augur export so | ||
that the annotation and sequence show up in auspice as a color by. | ||
""" | ||
|
||
import Bio | ||
from Bio import SeqIO | ||
import json | ||
|
||
import argparse | ||
parser = argparse.ArgumentParser() | ||
|
||
parser.add_argument('--alignment', type=str, help='alignment file output by rule augur align') | ||
parser.add_argument('--furin_site_motif', type=str, help='name of output json file that annotates tips as having a furin cleavage site or a wt cleavage site') | ||
parser.add_argument('--cleavage_site_sequence', type=str, help='name of output json file that annotates the cleavage site sequence for each tip') | ||
|
||
|
||
args = parser.parse_args() | ||
alignment = args.alignment | ||
furin_site_motif_json = args.furin_site_motif | ||
cleavage_site_sequence_json = args.cleavage_site_sequence | ||
|
||
|
||
def output_furin_cleavage_site_jsons(alignment, output_json1, output_json2): | ||
|
||
output_dict_furin = {"nodes":{}} | ||
output_dict_seq = {"nodes":{}} | ||
|
||
with open(output_json1, "w") as outfile: | ||
outfile.write("") | ||
with open(output_json2, "w") as outfile: | ||
outfile.write("") | ||
|
||
|
||
for seq in SeqIO.parse(alignment, "fasta"): | ||
|
||
strain_name = seq.description | ||
|
||
# convert gaps to Ns to avoid translation errors | ||
sequence = str(seq.seq).upper().replace("-","N") | ||
|
||
# convert back to sequence object | ||
sequence = Bio.Seq.Seq(sequence) | ||
|
||
# translate and find beginning of ha2 | ||
aa = str(sequence.translate()) | ||
ha2_begin = "GLFG" | ||
|
||
start_pos_ha2 = aa.find(ha2_begin) | ||
|
||
# define the furin site as the 4 positions prior to the start of HA2 | ||
furin_site = aa[start_pos_ha2-4:start_pos_ha2] | ||
|
||
# if those 4 preceding amino acids have the pattern R-X-R/K-R, then it is cleavable | ||
# by furin. Here, X is any amino acid (but not a gap), and the 3rd position can be | ||
# K or R | ||
if furin_site[0] == "R" and furin_site[3] == "R" and (furin_site[2] == "R" or furin_site[2]=="K") and furin_site[1]!="X": | ||
furin_site_annotation = "present" | ||
else: | ||
furin_site_annotation = "absent" | ||
|
||
|
||
output_dict_furin["nodes"][strain_name] = {"furin_cleavage_motif":furin_site_annotation} | ||
output_dict_seq["nodes"][strain_name] = {"cleavage_site_sequence":furin_site.replace("X","-")} | ||
|
||
f = open(output_json1, "w") | ||
json.dump(output_dict_furin, f) | ||
f.close() | ||
|
||
f = open(output_json2, "w") | ||
json.dump(output_dict_seq, f) | ||
f.close() | ||
|
||
|
||
output_furin_cleavage_site_jsons(alignment, furin_site_motif_json, cleavage_site_sequence_json) | ||
|
||
|