# Usher Syndrome and the Evolution of Metazoan Sensory Structures


## Introduction

Usher Syndrome (USH), a genetic sensory disorder, is the most common cause of combined blindness-deafness in humans.  The genes associated with Usher syndrome play key structural and functional roles in ciliated sensory cells of the vertebrate retina and inner ear.  Usher genes form interciliary links and their anchoring complexes in photoreceptors and the mechanosensory hair cell (Kremer et al 2006).  When a mutation occurs in one of these genes,  mechanotransduction is abolished and the retina degenerates, resulting in blindness, deafness and impaired vestibular function.  

Given the key role these genes play in vertebrate sensory structures, it is concievable that these genes may serve similar sensory functions in other Metazoan groups.   Previously thought to be confined to vertebrates, USH homologs were identifed within an Echinoderm genome (Sodergren 2006).  Recently, USH homologs have been shown to be upregulated in the choanocytes of the sponge *Ephydatia*, hinting that these genes may play a conserved role in the evolution of ciliated sensory structures of the Metazoa, and begging the question of how early these genes arose (Pena et al 2016).  By investigating the evolutionary history of the genes involved in Usher syndrome, this project can better determine how the suite of genes involved with Usher syndrome were assembled within the Metazoa and its close relatives, and what role these genes might have played in the sensory evolution of early animals.

## Steps to the process

### BLAST  Human USH sequences against NCBI data base

```
def search_taxa_all_gene_delay(list_of_taxa):
    # blasts sequences in a file against a user submitted list of taxa
    # loop through the list and run blast for each one
    # will save each result to a separate xml file
    
    
    from Bio.Blast import NCBIWWW
        # imports the NCBIWWW module from Biopython to allow remote querry 
        
    import time
        # let's us delay imputs to not spam the NCBI servers and get kicked off
    
    with open("USH_Search_seq.fasta", "r") as fasta_file:
        sequences = fasta_file.read()
        fasta_file.close()
        #reads in sequences we will be searching
        
    for i in list_of_taxa:
        result_handle = NCBIWWW.qblast("blastp", # specifies the program for a protein-protein search
                                       "refseq_protein",  #  database of protein sequences
                                       sequences, # our list of sequences we read in
                                       alignments = 100, # asks for 100 best hits
                                       descriptions = 100, 
                                       expect = 0.00001, # specifies the E-value cut off (i.e. how likely a random match for our query would be
                                       entrez_query = str(i)) # specifies the taxa as we loop through it
                                       
        file_name=str("USH_Search_"+str(i)+".xml") #this creates a name for the file
        
        save_file=open(file_name, "w")  #we are opening a file that does not yet exist to write to it
        
        save_file.write(result_handle.read())  #writing the result of our blast search to local file
        
        save_file.close() #closing it to allow the file to actually write it
        result_handle.close() #close the results handle
        
        print("created "+ file_name) #this is just a nice way to track the progress of the program
        
        time.sleep(60)  #this gives 1 minute between writing the output and sending another request to the ncbi server
            # NCBI is a shared resource, so you can't monopolize computer time
        
```


Here is the list of taxa:
```
full_taxa_file_name=open("/home/eeb177-student/Desktop/eeb-177/project/sandbox/Testrun_multi_genes_same_org/full_list_taxa_NCBI.txt", "r")
```
please note that this only includes all the taxa from NCBI's database, there are still a number I want to include in the analysis from separate databases

### Parse BLAST Output to CSV
```
def parse_and_summarize(blast_output_xml):
    #  goes through the output of a BLAST xml file and finds the relevant stats to summarize the search
    from Bio.Blast import NCBIXML
    from Bio.SeqRecord import SeqRecord
    #import the required libraries
    
    for file_name in blast_output_xml:
        result_handle = open(str(file_name), "r") #sets the result handle
        
        blast_records = NCBIXML.parse(result_handle)
        #need to use parse if it has multiple records in it
        
        for blast_record in blast_records:
            org_desig=file_name.split("_")[2].split("[")[0].replace(" ", "_")
            #properly formats the taxa id so we can use it to name things
            
            homo_sapiens = "[Homo sapiens]"
            blast_query=blast_record.query
            if homo_sapiens in blast_query:
                gene_name=blast_record.query.split("|")[4].split("[")[0].replace(" ", "_").replace("_protein", "").replace("_isoform_b3", "")
                formated_gene_name = gene_name[1:-1]
            else:
                formated_gene_name=blast_query.split("|")[4].split("_")[0]
            #this conditional properly formats the gene name so we can use it to name things
                #Basically there are 2 formats of sequence names I used to do the search
                #this statement switches the naming convention depending on which is used
                #it's not very general, so in the future I will make sure to properly name my search sequences
                #that should make this statement unnecessary 
```
This gives us the BLAST results for one gene in one species as a CSV

### Further functions needed

##### Combine Records for a gene into one CSV and download from NCBI

```
for gene in list of genes:
    Search the names of files for the common gene, regardless of organism
        if file matches this,
        grab the column with gene ids and append to txt document holding all the sequences for one gene
    Check for redundant gene ids (use sort and uniq
    
    Submit list of genes to NCBI for retrieval
        (do this during a low useage moment so as to not get banned from the system)
    save sequences and IDs to file > list_of_retrieved_sequences.
```

##### Check for Homology
use HMMER to annotate domains of sequences and see if the appropriate domains are there

```
for gene in list_of_retrieved_sequences:
    hmmer gene vs pfam database
        for hit from hmmer of sufficient significance:
            add domain id and length to metadata of sequence
        if annotated sequence has at least one of each domain present in search sequence:
            add to document containing list of potential homologs and their sequence
```

##### Align and build tree
Use Muscle and tree builder to build the trees:
```
for all docs in list of vetted sequences:
    alignment_muscle = muscle align(sequences within the docs)
    tree object = treemaker(alingment_muscle)
```
##### Vet the genes again:
    ```
    If gene in gene_list clusters with known_human_USH_gene
        Search CSV of taxa for gene ID
        Extract taxa ID associated with gene ID
        write to homolog_summary_doc: "ORG NAME" + "GENE NAME" + "PRESENT"
    Create summary Table
        
    ```    
    
<img src= "USH_SummaryTable_dummy.png">
Summary table of homologs I made prior to this class by hand, not programmatically

##### Reconcile tree with species tree
compare the trees generated to established species trees
```
for gene_tree in list_of_tree_objects:
    comparison_tree = (well support tree of Metazoa)
    calibrate gene_tree with comparison_tree
    
```
<img src="Darwins_first_tree.jpg">

Placeholder image of tree,  by Charles Darwin, retrieved from Wikimedia Commons

## References
Kremer, H., van Wijk, E., Märker, T., Wolfrum, U., & Roepman, R. (2006). Usher syndrome: molecular links of pathogenesis, proteins and pathways. Human molecular genetics, 15(suppl 2), R262-R270.

Sodergren, E., Weinstock, G.M., Davidson, E.H., Cameron, R.A., Gibbs, R.A., Angerer, R.C., Angerer, L.M., Arnone, M.I., Burgess, D.R., Burke, R.D. and Coffman, J.A., 2006. The genome of the sea urchin Strongylocentrotus purpuratus. Science, 314(5801), pp.941-952.

Peña, J. F., Alié, A., Richter, D. J., Wang, L., Funayama, N., & Nichols, S. A. (2016). Conserved expression of vertebrate microvillar gene homologs in choanocytes of freshwater sponges. EvoDevo, 7(1), 13.