A Distance Unit for Genes Based on Codon Usage Bias

Think about KNN, a simple way of clustering data points using a distance unit between points, the most commonly used distance metric is euclidian distance, however, difference distance metrics affect the result of KNN drastically.

Now, we have a fasta file, and we want to cluster the genes in this fasta file, based on their CUB, we know for a fact that CUB can vary between gene groups with different conditions even in the same species, can we devise a way to cluster these genes based on metrics we get from CUB statistics? (aka, come up with a way of measuring gene distance based on CUB statistics)

Let's take a shot first, first we load the fasta file.

In [1]:
def findSequenceByID(inputFile,idType="locus_tag"):
    print ("Selected id Type: %s"%(idType))
    geneDict=dict()
    from Bio import SeqIO
    records=SeqIO.parse(inputFile, "fasta")
    cnt=0
    mySum=0
    for record in records:
        mySum+=1
        header=str(record.description)
        if idType=="raw":
            geneDict[header]=str(record.seq)
        else:
            startTargetIndex=header.find(str(idType))
            if startTargetIndex<0:
#                print ("couldn't find the target idType")
                cnt+=1
                continue
            startIndex=startTargetIndex+len(idType)+1
            idName=""
            charIndex=startIndex
            while not (header[charIndex]=="]" or header[charIndex]==","):
                idName+=header[charIndex]
                charIndex+=1
            if idName not in geneDict:
                geneDict[idName]=str(record.seq)
    print ("There are %s entries NOT found out of %s"%(cnt,mySum))
    print ("%s distinct record in %s entries"%(len(geneDict),mySum))
    return geneDict

In [2]:
targetFastaFile="Fastas/c_elegan.fasta"
geneDict=findSequenceByID(targetFastaFile,idType='Gn')


Selected id Type: Gn
There are 4617 entries NOT found out of 30167
19514 distinct record in 30167 entries


Now we need to define the function that gets CUB statistics from a gene, and convert that to a distance metric. 

In [8]:


sampleGeneSeq=geneDict[list(geneDict.keys())[0]]


def MLE_from_gene(seq):
    from Code import MLE

    MLE.deltaEtaFile="Archive/Crei_Selection.csv"
    MLE.deltaMFile="Archive/Crei_Mutation.csv"
    MLE.main()
    codonList=MLE.loadSequence(sampleGeneSeq)
    MLE_PHI_List=MLE.method4(codonList)
    return MLE_PHI_List


  

ATGAGCTCTTCATCATCTTCTAGAATTCACAATGGTGAAGATGTTTATGAAAAGGCGGAGGAATACTGGAGCCGCGCGAGCCAGGACGTCAACGGAATGCTCGGCGGATTCGAAGCGCTTCACGCGCCCGACATATCGGCGTCGAAACGATTTATTGAAGGACTGAAGAAAAAGAATCTATTCGGCTACTTTGACTATGCACTGGACTGCGGAGCGGGTATCGGACGTGTTACAAAGCATCTCTTAATGCCATTCTTCTCGAAAGTTGATATGGAAGACGTCGTCGAGGAGTTGATCACGAAAAGTGATCAATATATTGGAAAACATCCACGAATTGGAGATAAATTCGTCGAAGGACTGCAGACGTTTGCACCGCCCGAACGACGTTATGATTTGATATGGATTCAATGGGTTTCAGGGCATTTGGTTGATGAGGATTTGGTTGATTTCTTTAAAAGATGTGCGAAAGGACTGAAACCTGGTGGATGTATTGTGCTCAAGGATAATGTGACAAATCACGAGAAACGGTTATTCGACGATGATGATCATAGTTGGACGAGAACAGAGCCCGAGCTTCTTAAAGCGTTCGCCGATTCTCAACTGGACATGGTCTCGAAAGCACTGCAAACCGGATTCCCAAAGGAGATTTATCCAGTAAAAATGTATGCATTGAAGCCTCAACACACCGGATTCACCAATAATTGA
