# Finding a Shared Motif
**Given:** A collection of *k* (*k*<=100) DNA strings of length at most 1 kbp each in FASTA format.

**Return:** A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)


# Sample Dataset

In [1]:
%%file Sample_Dataset.txt
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA



Overwriting Sample_Dataset.txt


# Sample Output

In [2]:
%%file Sample_Output.txt
AC



Overwriting Sample_Output.txt


# Solution

In [3]:
import collections

def parseFastaFile(fasta_file_path):
    fasta_file = open(fasta_file_path,'r')
    fasta_file_lines = fasta_file.readlines()
    
    fasta_records = collections.OrderedDict()
    
    for line in fasta_file_lines:
        if line[0] == ">":
            fasta_records[line[1:].rstrip()] = ""
        else:
            fasta_records[next(reversed(fasta_records))] += line.rstrip()
            
    fasta_file.close()
    
    return fasta_records


In [4]:
import itertools

def getLongestCommonSubsequence(sequences):
    "Given a collection of k (k<=100) DNA strings of length at most 1 kbp each in FASTA format, return a longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)"
    longest_common_subsequence = ""
    for subsequence_start in range(len(sequences[0])):
        for subsequence_end in range(subsequence_start,len(sequences[0])):
            subsequence = sequences[0][subsequence_start:subsequence_end]
            if len(subsequence) >= len(longest_common_subsequence) and all(subsequence in sequence for sequence in sequences):
                longest_common_subsequence = subsequence
            
    return longest_common_subsequence


In [5]:
def getLongestCommonSubsequenceFromFileToFile(input_file_path, output_file_path):
    "Wraps getLongestCommonSubsequence to read from input_file_path and write to output_file_path"
    
    output_file = open(output_file_path, 'w')
    output_string = getLongestCommonSubsequence(list(parseFastaFile(input_file_path).values()))
    output_file.write("%s\n" % output_string)
    output_file.close()
    
    return


# Test Solution

In [6]:
getLongestCommonSubsequenceFromFileToFile("Sample_Dataset.txt", "Test_Output.txt")

In [7]:
%%bash
echo Sample_Output.txt
md5sum Sample_Output.txt
cat Sample_Output.txt

Sample_Output.txt
a82b620e3d0ff46272c794c576f9040a  Sample_Output.txt
AC


In [8]:
%%bash
echo Test_Output.txt
md5sum Test_Output.txt
cat Test_Output.txt

Test_Output.txt
a82b620e3d0ff46272c794c576f9040a  Test_Output.txt
AC


In [9]:
%%bash
if [ $(md5sum Sample_Output.txt|cut -f1 -d' ') == $(md5sum Test_Output.txt|cut -f1 -d' ') ]
then
    echo Sample output matches test output.
else
    echo Sample output does not Match test output.
fi

Sample output matches test output.


# Downloaded Dataset

In [10]:
%%bash
cp ~/Downloads/rosalind_lcsm.txt ./
cat rosalind_lcsm.txt

>Rosalind_1648
TAATATGGATGTATTGGCTGGTTCGTGTTTACATGGTGGATAAGAATTCCATCCACATTC
CGTACTCAGTAGTAACAGTGTTAGAAGTATTTCGGTTTTCCTGGCTTATTTTACGCTCGT
ACAAACGCTGACCAAACCTTCTTTGTGCCGTAGATGGTAGGTGGTGGACATGTTAATTGA
GTAAATGCCCCTATCCCGCTACGACGGCCGTAGCTATAAGCAGCTTTATAGAAGGGGTGA
ATGATGAGGAATGTCGCGCCTGCTTTCATCGACCTGCGACTTCTTCCGCGCTGACGTAAC
GGATTCCACTGATGGCTCAGGTCCTTGGGTAGCTGAGGGTGACACAGGTCAGCCCGGCAC
GTCCCGGCCGACTCGTTACTCAGCTTGAATTTAAGGTGTTTCTTAAGCAGTTCAATTATT
TAGACAAGCAAATTACGCACAAGTCTCCCCTGACGGGGCGAAATCCCAAGCCTGTTTCTT
ATAGCGTGGGCAGTCCTAGCGCGAATTGCGCGTAACTACCAGTACTCGTCCACTGACGAG
TAGTACTGGAAGCGAGACTGTTATCATTAAACCGGCACTAATCGATTCGCATGATACTTA
GCCCAGTCAGTATATAGGAGATTCCTGGCTAGGTGTCATCGCAGCAGCATACTGCCTAGA
CTGCGATTTAACGCTATAGGATTGCATCGCTGATCACTTCAAGGTTCCCCGAAGCAGTTA
AAGAGAGAGTCATCACATAAGGGGGTGCCGGTGAATAGTAGGAGGAGAAGTAGCATTTAT
TTCAGGTCGATATGTTTCTTTCTCTGGCAGAAAGTCAGGTAATGCGACCCACGCAATTGA
TGCGTACGGAACAGGACTGCGAGACGTTAAGTAGCCGGTACTGTTTGAATCAACTCGTAA
TATGCCCCGAGGACGGCCTAACTCGGCGAACAAGGATAGTACTCGCGTGACGTTTTGTAT
CGCATTGAA

# Solution to Downloaded Dataset

In [11]:
getLongestCommonSubsequenceFromFileToFile("rosalind_lcsm.txt", "Solution_Output.txt")

In [12]:
%%bash
cat Solution_Output.txt

AGCGAGACTGTTATCATTAAACCGGCACTAATCGATTCGCATGATACTTAGCCCAGTCAGTATATAGGAGATTCCTGGCTAGGTGTCATCGCAGCAGCATACTGCCTAGACTGCGATTTAACGCTATAGGATTGCATCGCTGATCACTTCAAGGTTCCCCGAAGCAGTTAAAGAGAGAGTCATCACATAAGGGGGTGCCGGTGAATAGTAGGAGGAGAAGTAGCATTTATTTCAGGTCGATATGTT
