# Finding a Shared Motif

## Searching Through the Haystackclick to collapse

In “Finding a Motif in DNA”, we searched a given genetic string for a motif; however, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.

The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database; such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.

## Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a **longest common substring** if there does not exist a longer common substring. For example, "CG" is a common substring of "A**CG**TACGT" and "AAC**CG**TATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "A**CGTAC**GT" and "AAC**CGTA**TA".

Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".

### Given: 
A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

### Return: 
A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

# Solution One: Lego Bricks

Get a list of all the possible pairs of bases. Check each pair against the fasta file and keep the codon pairs that are found in all the sequences in a new list.
Take this new list and add a codon to each pair and check again making a new list. Keep doing this untill the new list is null.

## Functions

* load fasta files using biopython
* loop through fasta sequences and check motif
* loop through motifs and check if they are valid (call above)

In [1]:
# Load packages to parse fasta file and load file
from Bio import SeqIO # Import biopython

In [2]:
def getFasta():

    # Load file from pathway provided by user
    path = input("Please enter path to file (q to exit): ")
    
    if path == "q":
        exit()
    else:
        fasta_sequences = SeqIO.parse(open(path,'fasta'))
        # try catch to get file
#         try:
#             #open the fasta file
#             fasta_sequences = SeqIO.parse(open(path,'fasta'))
#         except expression as identifier:
#             pass
                                              
    return fasta_sequences
            

In [3]:
# Check a motif string is in all fasta sequences. Return true or false
def checkFasta(motif, fastas):
    result = True
    
    for seq in fastas:
        #print(seq)
        if motif not in seq:
            result = False
            break
    return result
    

In [4]:
# for each motif in a list add a base and check if found in fasta sequences
# return list
def newMotifs(motifs, fasta):
    bases = ["A", "T", "C", "G"]
    newLst = []
    
    for motif in motifs:
        for base in bases:
            potential = motif + base
            if checkFasta(potential, fasta):
                newLst.append(potential)
    
    #print(newLst)
    
    return newLst
    

In [9]:
# Main function
# 

def main():
    # get fasta files
    fastaSeq = getFasta()
    
    getMotif(fastaSeq)
    
    
def getMotif(fasta):
    
    # Original list of potential motifs
    motifs = ['AT', 'AC', 'AG', 'AA', 'TA', 'TC', 'TG', 'TT','CA', 'CT', 'CG', 'CC', 'GA', 'GT', 'GC', 'GG']
    count = 2
    
    # While motifs is not empty get the motifs
    while len(motifs) > 0:
        # Copy motifs to oldlist
        oldMotifs = [x for x in motifs]
        motifs = []
        #print(motifs)
        if count == 2:
                for motif in oldMotifs:
                    #print(motif)
                    if checkFasta(motif, fasta):
                        motifs.append(motif)
                        #print(motifs)
        else:
            motifs = newMotifs(oldMotifs, fasta)
            #print(motifs)
        count += 1
        #print(len(motifs))
        
    print(oldMotifs)
        
        

In [6]:
check = """GATTACA
TAGACCA
ATACA"""

check = check.split('\n')

print(check)
getMotif(check)

['GATTACA', 'TAGACCA', 'ATACA']
['AC', 'TA', 'CA']


In [11]:
check2 = """ATCCAGCTACT
GGGCAACTACT
ATGGATCTACT
AAGCAACCACT
TTGGAACTACT
ATGCCATTACT
ATGGCACTACT"""

check2 = check2.split('\n')

getMotif(check2)

['AC']
['AC', 'CT']
['ACT']
[]
['ACT']


In [23]:
# C:\Users\rwswo\Documents\Bioinformatics\git\rosalindTry\rosalind_cons.txt

fast = """>Rosalind_1
ATCCAGCTACT
>Rosalind_2
GGGCAACTACT
>Rosalind_3
ATGGATCTACT
>Rosalind_4
AAGCAACCACT
>Rosalind_5
TTGGAACTACT
>Rosalind_6
ATGCCATTACT
>Rosalind_7
ATGGCACTACT"""

#fa = [str(f.seq) for f in SeqIO.parse(fast,'fasta')]
fastaf = SeqIO.parse(fast,'fasta')
print(fastaf)

for f in fastaf:
    print(f.id)
# for seq_record in fa:
#     print(f)

<generator object parse at 0x7faf2957f3b8>


FileNotFoundError: [Errno 2] No such file or directory: '>Rosalind_1\nATCCAGCTACT\n>Rosalind_2\nGGGCAACTACT\n>Rosalind_3\nATGGATCTACT\n>Rosalind_4\nAAGCAACCACT\n>Rosalind_5\nTTGGAACTACT\n>Rosalind_6\nATGCCATTACT\n>Rosalind_7\nATGGCACTACT'