## Generate Fasta Files
Supervisor: Dr. Cedric Chauve.

Author: Laura Gutierrez Funderburk.

Date: June 2017, Revised September 2017.


### Abstract

In this notebook, we code a function whose input is a file containing synteny results, a location where results will be stored and a column name that can either be 0 or 2, and as output returns FASTA files containing gene sequences of a specified (as given in synteny file synteny_in_i_and_n-i_species.xlsx) chromosomoe or scaffold. 

For example, row 1, of file synteny_in_4_and_n-4-0_species.xlsx:

<img src="row2.png" style="width: 2000px;">

From this row, the algorithm will extract species name from either the first or third column as specified. 

Say we pick the first column. 

The algorithm then extracts the word Anopheles_merus (species name), uses the associated keywork "merus", and uses the key word to access a pre-defined dictionary containing all the appropriate FASTA file with gene sequences for the associated gene family. 

Once it identifies the adequate file, the algorithm will extract a second key word containing the chromosome or scaffold. In this example, the second key word will be KI440021.

With the appropriate family name and chromosome/scaffold, the program takes the gene sequence belonging to the data that matches both, and then writes a new FASTA file containing only the sequence we are interested in obtaining. 


#### Open areas of work

This code should be modified so that once it extracts the appropriate FASTA file, it takes the specified coordinates within the gene sequence. The sequences will then be aligned. 

In [3]:
# for pandas DataFrames
import pandas as pd

# for Sequence spliting
from Bio import SeqIO

In [4]:
#Specify paths to results
results ="/media/lgutierrezfunderburk/DATA/lgutierrezfunderburk/Documents/DATA/Results/MultiAlignment_Use_with_Muscle/synteny_in_x_vs_n-x-q/synteny_in_4_and_n-4-2/synteny_in_4_and_n-4-2_species.xlsx"

#Specify where all the FASTA files are found
path_to_FASTA = "/media/lgutierrezfunderburk/DATA/lgutierrezfunderburk/Documents/DATA/FASTA/"

#Specify path to where the results will be written.
path_to_F1 = "/media/lgutierrezfunderburk/DATA/lgutierrezfunderburk/Documents/DATA/Results/MultiAlignment_Use_with_Muscle/synteny_in_x_vs_n-x-q/synteny_in_4_and_n-4-2/P_F1/"
path_to_F2 = "/media/lgutierrezfunderburk/DATA/lgutierrezfunderburk/Documents/DATA/Results/MultiAlignment_Use_with_Muscle/synteny_in_x_vs_n-x-q/synteny_in_4_and_n-4-2/P_F2/"


In [5]:
# Define dictionary
the_dictionary = {'maculatus':'Anopheles-maculatus-Maculatus3_SCAFFOLDS_AmacM1.fa','epiroticus':'Anopheles-epiroticus-Epiroticus2_SCAFFOLDS.AepiE1.fa','atroparvus':'Anopheles-atroparvus-EBRO_SCAFFOLDS_AatrE1.fa','sinensis':'Anopheles-sinensis-SINENSIS_SCAFFOLDS_AsinS1.fa','melas':'Anopheles-melas-CM1001059_SCAFFOLDS_AmelC1.fa','merus':'Anopheles-merus-MAF1_SCAFFOLDS_AmerM1.fa','stephensi':('Anopheles-stephensi-SDA-500_SCAFFOLDS_AsteS1.fa','Anopheles-stephensiI-Indian_SCAFFOLDS_AsteI2.fa'),'darlingi':('Anopheles-darlingi-Coari_SCAFFOLDS_AdarC3.fa','Anopheles-darlingi-Coari_SCAFFOLDS_AdarC2.fa'),'gambiae':('Anopheles-gambiae-PEST_SCAFFOLDS_AgamP3.fa','Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.fa'),'minimus':'Anopheles-minimus-MINIMUS1_SCAFFOLDS_AminM1.fa','arabiensis':'Anopheles-arabiensis-Dongola_SCAFFOLDS_AaraD1.fa','farauti':('Anopheles-farauti-FAR1_SCAFFOLDS_AfarF1.fa'),'quadriannulatus':'Anopheles-quadriannulatus-SANGWE_SCAFFOLDS_AquaS1.fa','funestus':'Anopheles-funestus-FUMOZ_SCAFFOLDS_AfunF1.fa','dirus':'Anopheles-dirus-WRAIR2_SCAFFOLDS_AdirW1.fa','christyi':'Anopheles-christyi-ACHKN1017_SCAFFOLDS_AchrA1.fa','culicifacies':'Anopheles-culicifacies-A37_SCAFFOLDS_AculA1.fa','albimanus':'Anopheles-albimanus-STECLA_SCAFFOLDS_AalbS1.fa'}
pd_dictionary = pd.DataFrame(the_dictionary)

# Key words
key_words= {'maculatus', 'epiroticus', 'atroparvus', 'sinensis', 'melas', 'merus', 'stephensi', 
'darlingi', 'gambiae', 'minimus', 'arabiensis', 'farauti', 'quadriannulatus', 'funestus', 'dirus', 
'christyi', 'culicifacies', 'albimanus'}

In [6]:
def extract_ID(word):
    
    """Apply split twice in an extremely particular way to use the information in the file
    synteny_in_3_and_n-3_species.xlsx"""
    
    stage_one = word.split('.')
    stage_two = stage_one[1]
    stage_three = stage_two.split(':')
    return stage_three

In [7]:
def Generate_FASTA(name_of_file, path_to_write, column_number):
    
    """This function takes as input a file containing synteny results, the path where
    results will be written as well as column number. 
    Based on the file synteny_in_i_and_n-i_species.xlsx, column number can be either 0 or 2 
    corresponding to F1 and F2 respectively
    
    Output is a fasta files: each of which contains the gene sequence of the specified 
    chromosome. One file is written per row in the synteny_in_3_and_n-3_species.xlsx file"""
    
    # Open file and specify length of the file
    result_file = pd.read_excel(name_of_file, header=0)
    length_of_file = len(result_file)
    
    
    for i in range(length_of_file):
        # This area runs through the rows in synteny_in_3_and_n-3_species.xlsx for a specified column 
        # and extracts two words: gene name key word and chromosome name
        get_key = [word for word in key_words if word in result_file.iloc[i,column_number]]
        get_ID = extract_ID(result_file.iloc[i,column_number])
        identificator =zip(get_key,get_ID)
        
        # Store 
        # Gene name is assigned to key variable while chromosome name is assigned to ID
        key = identificator[0][0]
        ID = identificator[0][1]
        
        # Parse the sequences using the key to open the adequate FASTA file, extract the chromosome
        # specified in the ith row, and write a FASTA file with the gene sequence of the chromosome
        j = 0
        short_seq_iterator = []
        while not short_seq_iterator:
            fasta_file = pd_dictionary[key][j]
            input_seq_iterator = SeqIO.parse(path_to_FASTA + fasta_file, "fasta")
            for record in input_seq_iterator:
                if ID==record.id:
                    short_seq_iterator.append(record)
            j += 1
        SeqIO.write(short_seq_iterator, path_to_write  + str(i) + "_row_" +str(key)+ "_"+  str(ID)  +".fa", "fasta")

In [44]:
# Sample usage
Generate_FASTA(results,path_to_F1,0)