#### Matt Fry mkfry@uncc.edu
#### <font color='purple'>Working with Pandas</font>

## Pandas Lab 1
For this lab, you will need to read in the provided FASTA file and generate a DataFrame containing the following information for each sequence record:
1. Sequence Length
2. GC content
3. Begins with start codon?
4. Ends in stop codon?
5. Complete frame? (Meaning is the length a multiple of 3?)

The row index should be the sequence label, and columns should be labeled with descriptions of these 5 criteria I've given you. Be mindful of the datatype you choose for description 3-5.

### PIP installs that might have been performed if python modules are not available added some time magics to check performance of code

In [None]:
#%pip install biopython
#%pip install pandas
#import sys
#!{sys.executable} -m pip install ipython-autotime

### __Imports

In [1]:
from Bio import SeqIO
import pandas as pd
#supplied codon/aa dictionary
aa_dict = {'M':['ATG'], 'F':['TTT', 'TTC'], 'L':['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 'C':['TGT', 'TGC'], 'Y':['TAC', 'TAT'], 'W':['TGG'], 'P':['CCT', 'CCC', 'CCA', 'CCG'], 'H':['CAT', 'CAC'],
'Q':['CAA', 'CAG'], 'R':['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'I':['ATT', 'ATC', 'ATA'], 'T':['ACT', 'ACC', 'ACA', 'ACG'],
'N':['AAT', 'AAC'], 'K':['AAA', 'AAG'], 'S':['AGT', 'AGC', 'TCT', 'TCC', 'TCA', 'TCG'], 'V':['GTT', 'GTC', 'GTA', 'GTG'],
'A':['GCT', 'GCC', 'GCA', 'GCG'], 'D':['GAT', 'GAC'], 'E':['GAA', 'GAG'], 'G':['GGT', 'GGC', 'GGA', 'GGG'], '*':['TAA','TAG','TGA']}

### __Functions 

In [2]:
def codoncheckerSTR(stringy):
    if len(stringy)==3:
        for key,val in aa_dict.items():
            if str.upper(stringy) in val:  #Function returns a key entry in aa_dict and @ if fails
                return key
            else:
                return "@"
    else:
        return "Function needs to be supplied to be 3 chars"

def codoncheckerBool(stringy):
    if len(stringy)==3:
        for key,val in aa_dict.items():
            if str.upper(stringy) in val:#Function checks for 3 chars and returns a boolean if value in dict
                return True
            else:
                return False
    else:
        return False
    
def returnTriples(stringy):  #breaks DNA sequence into 3 character item list
    triples = [stringy[i:i+3] for i in range(0, len(stringy), 3)]
    return triples

### Delcare new Seq objects from parsing supplied FASTA file

In [3]:
%%time
fh = open("Mdomestica-pandasLab.fa","r")
newOBJs = SeqIO.to_dict(SeqIO.parse(fh,"fasta"))#convert to dict for easier dataframe translation

CPU times: user 143 ms, sys: 7.88 ms, total: 151 ms
Wall time: 150 ms


### Build new Pandas DataFrame from Dictionary and adding future columns

In [4]:
%%time 
myDF = pd.DataFrame(newOBJs.items(),columns=["Label","Seq"])
myDF.set_index("Label",inplace=True)# Setting index to Seq ID
#Build columns that will be updated later
myDF["SeqLen"]=0  # 1. Sequence Length
myDF["GCcontent"]= 0.0000  #2.GC content ratio of "G" & "C" divided by seq length
myDF["startCodon"]=False#3. if a valid start Codon Boolean by checking supplied aa_dict 
myDF["stopCodon"]=False#4. if a valid end Codon Boolean by checking supplied aa_dict 
myDF["compFrame"]=False#5. Boolean if divisble by 3 using modulus to check
#added coloumns for fun and testing later commented out for better dataframe visablility
#myDF['startString3'] ="AAA" # Test field to first 3 char
#myDF["endString3"]="ZZZ" # Test field to last 3 char
#myDF["startCodonAA"]="NA"#return translated AA -@ if not avail
#myDF["stopCodonAA"]="NA"#return translated AA -@ if not avail
#myDF["NumberofValidCodons"]=0

CPU times: user 5.81 ms, sys: 4.2 ms, total: 10 ms
Wall time: 8.47 ms


In [5]:
myDF.index #index proof that assigned Label 

Index(['MD10G1276500', 'MD10G1110200', 'MD10G1036500', 'MD10G1170700',
       'MD10G1250900', 'MD10G1316600', 'MD10G1188400', 'MD10G1113500',
       'MD10G1046100', 'MD10G1288900',
       ...
       'MD14G1048800', 'MD14G1203800', 'MD14G1121400', 'MD14G1080400',
       'MD14G1172500', 'MD14G1066400', 'MD14G1225900', 'MD14G1120900',
       'MD14G1057700', 'MD14G1237500'],
      dtype='object', name='Label', length=7496)

### Parsing through new Dataframe object, updating values

In [6]:
%%time
for indx in myDF.index:
        L=len(str(myDF.loc[indx,"Seq"].seq)) #determine seq length and store for recalling with ease
        myDF.at[indx,"SeqLen"]=L  #set seq length
        ctr = 0
        for i in str(myDF.loc[indx,"Seq"].seq): #iterate through chars in seq for GC
            if i == "C" or i=="G":
                ctr+=1
        myDF.at[indx,"GCcontent"]=ctr/L #set GC raito value
        #myDF.at[indx,"startString3"]=str(myDF.loc[indx,"Seq"].seq[0:3])  #first 3 chars of seq
        myDF.at[indx,"startCodon"]=codoncheckerBool(str(myDF.loc[indx,"Seq"].seq)[0:3]) #boolean check if start is valid DNA codon
        #myDF.at[indx,"startCodonAA"]=codoncheckerSTR(str(myDF.loc[indx,"Seq"].seq)[0:3]) #use codonchecker function to return in aa dictionary
        #myDF.at[indx,"endString3"]=str(myDF.loc[indx,"Seq"].seq)[L-3:L] #last 3 chars of seq
        myDF.at[indx,"stopCodon"]=codoncheckerBool(str(myDF.loc[indx,"Seq"].seq)[L-3:L]) #boolean check if end is valid DNA codon
        #myDF.at[indx,"stopCodonAA"]=codoncheckerSTR(str(myDF.loc[indx,"Seq"].seq)[L-3:L])#use codoncheck function to return in aa dictionary
               
        if L%3 == 0:  #determine complete dna frame
            myDF.at[indx,"compFrame"]=True
        else:
            myDF.at[indx,"compFrame"]=False

CPU times: user 2.25 s, sys: 3.64 ms, total: 2.25 s
Wall time: 2.25 s


### DataFrame output

In [7]:
%time myDF  #output of DataFrame

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.48 µs


Unnamed: 0_level_0,Seq,SeqLen,GCcontent,startCodon,stopCodon,compFrame
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MD10G1276500,"(C, A, G, T, C, C, G, T, G, G, C, T, C, C, T, ...",2940,0.460544,False,False,True
MD10G1110200,"(A, T, G, G, C, G, T, C, T, C, T, C, T, C, C, ...",1731,0.464471,True,False,True
MD10G1036500,"(A, T, G, T, C, G, T, C, G, T, C, G, T, C, G, ...",468,0.544872,True,False,True
MD10G1170700,"(A, T, G, T, A, T, C, G, C, T, T, C, G, C, C, ...",1728,0.446759,True,False,True
MD10G1250900,"(A, T, G, G, A, A, G, T, G, T, A, T, G, G, G, ...",1278,0.402973,True,False,True
...,...,...,...,...,...,...
MD14G1066400,"(A, T, G, C, C, A, T, C, G, T, G, G, T, T, C, ...",423,0.529551,True,False,True
MD14G1225900,"(A, T, G, G, C, T, T, C, C, C, C, T, A, A, C, ...",594,0.503367,True,False,True
MD14G1120900,"(A, T, G, G, A, T, A, A, C, T, C, T, G, C, A, ...",939,0.481363,True,False,True
MD14G1057700,"(A, T, G, G, A, A, G, C, T, A, T, C, A, C, T, ...",483,0.523810,True,False,True
