# Usher Syndrome and the Evolution of Metazoan Sensory Structures


## Introduction

Usher Syndrome (USH), a genetic sensory disorder, is the most common cause of combined blindness-deafness in humans.  The genes associated with Usher syndrome play key structural and functional roles in ciliated sensory cells of the vertebrate retina and inner ear.  Usher genes form interciliary links and their anchoring complexes in photoreceptors and the mechanosensory hair cell (Kremer et al 2006).  When a mutation occurs in one of these genes,  mechanotransduction is abolished and the retina degenerates, resulting in blindness, deafness and impaired vestibular function.  

Given the key role these genes play in vertebrate sensory structures, it is concievable that these genes may serve similar sensory functions in other Metazoan groups.   Previously thought to be confined to vertebrates, USH homologs were identifed within an Echinoderm genome (Sodergren 2006).  Recently, USH homologs have been shown to be upregulated in the choanocytes of the sponge *Ephydatia*, hinting that these genes may play a conserved role in the evolution of ciliated sensory structures of the Metazoa, and begging the question of how early these genes arose (Pena et al 2016).  By investigating the evolutionary history of the genes involved in Usher syndrome, this project can better determine how the suite of genes involved with Usher syndrome were assembled within the Metazoa and its close relatives, and what role these genes might have played in the sensory evolution of early animals.

## Steps to the process

### BLAST  Human USH sequences against NCBI data base

```
def search_taxa_all_gene_delay(list_of_taxa):
    # blasts sequences in a file against a user submitted list of taxa
    # loop through the list and run blast for each one
    # will save each result to a separate xml file
    
    
    from Bio.Blast import NCBIWWW
        # imports the NCBIWWW module from Biopython to allow remote querry 
        
    import time
        # let's us delay imputs to not spam the NCBI servers and get kicked off
    
    with open("USH_Search_seq.fasta", "r") as fasta_file:
        sequences = fasta_file.read()
        fasta_file.close()
        #reads in sequences we will be searching
        
    for i in list_of_taxa:
        result_handle = NCBIWWW.qblast("blastp", # specifies the program for a protein-protein search
                                       "refseq_protein",  #  database of protein sequences
                                       sequences, # our list of sequences we read in
                                       alignments = 100, # asks for 100 best hits
                                       descriptions = 100, 
                                       expect = 0.00001, # specifies the E-value cut off (i.e. how likely a random match for our query would be
                                       entrez_query = str(i)) # specifies the taxa as we loop through it
                                       
        file_name=str("USH_Search_"+str(i)+".xml") #this creates a name for the file
        
        save_file=open(file_name, "w")  #we are opening a file that does not yet exist to write to it
        
        save_file.write(result_handle.read())  #writing the result of our blast search to local file
        
        save_file.close() #closing it to allow the file to actually write it
        result_handle.close() #close the results handle
        
        print("created "+ file_name) #this is just a nice way to track the progress of the program
        
        time.sleep(60)  #this gives 1 minute between writing the output and sending another request to the ncbi server
            # NCBI is a shared resource, so you can't monopolize computer time
        
```


Here is the list of taxa:
```
full_taxa_file_name=open("/home/eeb177-student/Desktop/eeb-177/project/sandbox/Testrun_multi_genes_same_org/full_list_taxa_NCBI.txt", "r")
```
please note that this only includes all the taxa from NCBI's database, there are still a number I want to include in the analysis from separate databases

### Parse BLAST Output to CSV
```
def parse_and_summarize(blast_output_xml):
    #  goes through the output of a BLAST xml file and finds the relevant stats to summarize the search
    from Bio.Blast import NCBIXML
    from Bio.SeqRecord import SeqRecord
    #import the required libraries
    
    for file_name in blast_output_xml:
        result_handle = open(str(file_name), "r") #sets the result handle
        
        blast_records = NCBIXML.parse(result_handle)
        #need to use parse if it has multiple records in it
        
        for blast_record in blast_records:
            org_desig=file_name.split("_")[2].split("[")[0].replace(" ", "_")
            #properly formats the taxa id so we can use it to name things
            
            homo_sapiens = "[Homo sapiens]"
            blast_query=blast_record.query
            if homo_sapiens in blast_query:
                gene_name=blast_record.query.split("|")[4].split("[")[0].replace(" ", "_").replace("_protein", "").replace("_isoform_b3", "")
                formated_gene_name = gene_name[1:-1]
            else:
                formated_gene_name=blast_query.split("|")[4].split("_")[0]
            #this conditional properly formats the gene name so we can use it to name things
                #Basically there are 2 formats of sequence names I used to do the search
                #this statement switches the naming convention depending on which is used
                #it's not very general, so in the future I will make sure to properly name my search sequences
                #that should make this statement unnecessary 
```
This gives us the BLAST results for one gene in one species as a CSV

### Grab Gene IDs from output with bash for each gene

```
cat gene_WHRN_tiny_subset_summary.csv | cut -d ',' -f 1 > WHRN_tiny_gi.txt
```

### Download full sequence for gene from NCBI
```
def download_seqs_from_list_and_autoname(in_filename):    
    with open(in_filename, "r") as query_file:
        query_ids = query_file.read().splitlines()
    #open the file, grab the ids

    from Bio import Entrez
    Entrez.email = "hspeck@ucla.edu"
    # import Bio Entrez, let NCBI know who I am

    search_results = Entrez.read(Entrez.epost(db ="protein", 
                                              id = ",".join(query_ids)))  # needs the id's to be as a list separated by commas
    #upload the IDs to NCBI
    webenv_return = search_results["WebEnv"]
    query_key_return = search_results["QueryKey"]
    #grab the relevant variables to call on the stuff we posted

    count = len(query_ids)
    #assigns the count based on the number of sequences we searched for

    from urllib.error import HTTPError
    # load required library for the try and except conditions

    batch_size = 20
    #this determines how many things we retrieve and write to the file
    #can safely be larger in the future

    out_filename = str(in_filename[:-7]+".fasta")
    #attempting to rename the file based on the input of the original file
    out_handle = open(out_filename, "w")
    #open file to write to

    for start in range(0, count, batch_size):
        end = min(count, start+batch_size)
        print("Going to download record %i to %i" % (start+1, end))
        attempt = 0
        while attempt < 3:
            attempt += 1
            try:
                fetch_handle = Entrez.efetch(db="protein", #says which db
                                             rettype="fasta", #says what format the data should be in
                                             retmode="text", #what the output should be
                                             retstart = start, #say what range of results you want returned
                                             retmax = batch_size, #say end of range of results want returned
                                             webenv = webenv_return, #specify the info we uploaded with ePost
                                             query_key = query_key_return)
            except HTTPError as err:
                if 500 <= err.code <= 599:
                    print("Recieved error from server %s" % err)
                    print("Attempt %i of 3" % attempt)
                    time.sleep
                else:
                    raise
        data = fetch_handle.read()
        fetch_handle.close()
        out_handle.write(data)
    out_handle.close()
```

### Align with MUSCLE in Biopython
```
from Bio.Align.Applications import MuscleCommandline
muscle_cline = MuscleCommandline(input = "WHRN_tiny.fasta", out = "WHRN_tiny_Muscle_align.txt")
stdout, stder = muscle_cline()
```

### Run through RAxML

in shell
```
./raxmlHPC -m PROTGAMMAWAG -p 12345 -s Path/to/alignment/WHRN_tiny_Muscle_align.txt  -# 5 -n WHRNtn2
```
path depends on where the alignment is stored

### Format the Sequence Names For Tree making


Use Bash to grab the top lines of the Fasta lines

```
cat WHRN_tiny.fasta | grep ">" > WHRN_tiny_annotations.txt
```

In python

```
WHRN_annotations = open("WHRN_tiny_annotations.txt", "r").readlines()
name_file= open("WHRN_gene&sci&annotation.txt", "w")
for line in WHRN_annotations:
    #this handles formating of names
    #there are essentially a number of formats the genes can be name in, os it makes the naming process a bit tricky
    
    sequence_id = line.split(" ")[0].replace(">", "")
    #split the sequence ID out and get rid of FASTA formating
    
    
    formated_org = line.split("[")[1].split(' ')
    genus_code = formated_org[0][:3]
    genus = formated_org[0]
    species_code = formated_org[1][:3]
    final_org_name = genus_code + "_" + species_code
    #splits out the genus and species
    
    taxa_group = taxa_Dict[final_org_name]
    #assigns the species to a broad taxonomic division 
    
    gene_annotation_slice = line.split("[")[0].split(" ")[1:]
    if "PREDICTED:" in gene_annotation_slice:
        if "-like" in gene_annotation_slice:
            gene_annotation = gene_annotation_slice[0].replace(" ", "")
        else:
            gene_annotation = gene_annotation_slice[1:3]
    elif "Drosophila" in gene_annotation_slice:
        gene_annotation = gene_annotation_slice[1:3]
    else:
        gene_annotation = gene_annotation_slice[0:1]
    gene_annotation_final = "_".join(gene_annotation).replace(",", "")
    #cuts down the gene annoatation names to make them more manageable
    #problematic as the gene names do not conform to a single format that makes them easily parsable
    #have to switch between formats
    
    combined_name = '"' +final_org_name + " " + gene_annotation_final + '"'
    genus_and_gene = '"' +genus + " " + gene_annotation_final + '"'
    
    name_file.write(sequence_id+","+
                    final_org_name+","+
                    gene_annotation_final+","+ 
                    taxa_group+ ","+ 
                    combined_name+ "," +
                    genus_and_gene+'\n')
    #Writes it to the file
    
name_file.close()
```

### Plot Tree in R

```
rm(list = ls())
library(ape)
### Basic Tree!

WHRN_tree <- read.tree("/Path/o/best/Tree/RAxML_bestTree.WHRNtn2")
plot(WHRN_tree)
#bring in the tree from where ever it is and plot it
```
<img src = "files/WHRN_basic_tree.png">


###### Increase the readability of the tree

```
# Read in and format the annotation data
WHRN_gene_annotations <- read.csv("WHRN_gene&sci&annotation.txt", 
                                  header =FALSE, 
                                  stringsAsFactors = FALSE)
names(WHRN_gene_annotations) <- c("GeneID", "OrgName", "Annotation", "Group", "CombinedName", "Genus_Gene")
rownames(WHRN_gene_annotations) <- WHRN_gene_annotations$GeneID



# Reorder data frame to match rows

order <- match(WHRN_tree$tip.label, rownames(WHRN_gene_annotations))
WHRN_gene_annotations <- WHRN_gene_annotations[,][order,]


WHRN_tree$tip.label <- WHRN_gene_annotations$Genus_Gene
# Rename tips of Tree to increase Readability

plot(WHRN_tree, cex = 0.5, label.offset = 0.1)
# decrease size of labels a bit

title("Whirlin (USH2D) best hits tree")
#add a title
add.scale.bar()
#add a scale bar!

# Color the tips to indicate which lineage the organism belongs to

WHRN_gene_annotations$Group <- as.factor(WHRN_gene_annotations$Group)

#start by setting the Group to factor
lineages <- unique(WHRN_gene_annotations$Group)
cols <- rainbow(n = length(lineages))
#color vector for legend
colvec <- cols[WHRN_gene_annotations$Group]
#color vector for tree
tiplabels(pch = 19, col = colvec)


# Add a legend!
legend(x = "bottomright", lwd =0, pch = 19 , legend = levels(lineages), col = cols, cex = 0.7)
```
Here is the output:



<img src ="files/Whirlin_presentation_color_tree.png">

##### Check for Homology
use HMMER to annotate domains of sequences and see if the appropriate domains are there

```
for gene in list_of_retrieved_sequences:
    hmmer gene vs pfam database
        for hit from hmmer of sufficient significance:
            add domain id and length to metadata of sequence
        if annotated sequence has at least one of each domain present in search sequence:
            add to document containing list of potential homologs and their sequence
```

##### Reconcile tree with species tree
compare the trees generated to established species trees
```
for gene_tree in list_of_tree_objects:
    comparison_tree = (well support tree of Metazoa)
    calibrate gene_tree with comparison_tree
    
```
<img src="Darwins_first_tree.jpg">

Placeholder image of tree,  by Charles Darwin, retrieved from Wikimedia Commons

## References
Kremer, H., van Wijk, E., Märker, T., Wolfrum, U., & Roepman, R. (2006). Usher syndrome: molecular links of pathogenesis, proteins and pathways. Human molecular genetics, 15(suppl 2), R262-R270.

Sodergren, E., Weinstock, G.M., Davidson, E.H., Cameron, R.A., Gibbs, R.A., Angerer, R.C., Angerer, L.M., Arnone, M.I., Burgess, D.R., Burke, R.D. and Coffman, J.A., 2006. The genome of the sea urchin Strongylocentrotus purpuratus. Science, 314(5801), pp.941-952.

Peña, J. F., Alié, A., Richter, D. J., Wang, L., Funayama, N., & Nichols, S. A. (2016). Conserved expression of vertebrate microvillar gene homologs in choanocytes of freshwater sponges. EvoDevo, 7(1), 13.