Bioinformatics Algorithms: genome assembly

To assemble a genome from its shotgun fragments (complete collection of all possible fragments, all the same length), do the following:
. turn the k-mer collection (p137_data.txt) into an adjacency list by running the code 'De Bruijn algorithm'.
. use the adjacency list (p147_data.txt) to make an ordered list of the k-mers by running the code 'Euler's theorem: path'.
. use the ordered list (p125_data.txt) to reconstruct the DNA string by running the code 'Un-shotgun'.

 Shotgun<br>
 Input: An integer k and a string Text.<br>
 Output: All possible k-mers that are part of Text, k(Text) (the k-mers in alphabetical order).

In [1]:
with open('p120_data.txt') as f:
    firstline = f.readline().split('\n')
    k = int(firstline[0])
    secondline = f.readline().split('\n')
    text = secondline[0]
output = []
for i in range(len(text)-k+1):
    output.append(text[i:i+k])
    output.sort()
print('\n'.join(output))

AAAACAATTTATACGGCTGAGAACGGGCGACCGCACTGCTTCGGCCATTCGGAGGAAAGTAGTTCACCTATATCCCACCTAAAATGGACGTCCCTATCAT
AAAACCCATACCGTACCTCGTATTCTGTTCCTATTCTATGGCTATGCAGAGCGCTTGTATACGTTCTGCTGTCGCGCTCACTCCATTCAACATCCCGCCT
AAAACGCATAGCCTCTCGCAGTTGAGCTGGCGAGGCCGTCGTAACCCGTGGATCTGTAGCGGAACTCGATGAGATCTCGCCTATTAGGGACCCTAGGGAT
AAAACGGAAGCACCTGGAGTAACAGCTCTCGCCAAGTGCCGGGAGTATGCGTTGCTATGGAGGCGGCCAGCACTGCAGTTACGGACATAGTGCGCAAGTC
AAAACTCAGGTCTGGCATCAACGTTTCGGTCGCGACTCCCTAGTATGGTAATTTGAGGAGAATGCTAGCGGGCCATGTTCCGCCATGACGAGAGCTAGGG
AAAAGAAATGGCAGTCTCATTGACGACCAGTAGTACTTGCCGCCCCGAAAAGGTCAGCCCTCTACTAGTGACTCAGCATCCACCCTCTTGGCTACCAAGC
AAAAGCGGTGACTCTGCGTGCCGCTGTAGCTTACTTTTAGTACCCACTCTCAGGAATAGTTACAATGTCCAATGAACCAGAAGCACTGTACCGCATCATG
AAAAGGCCACCAAGCCAACGACTTTGAACTTCTGTACGCGTAGCCTCTGAAGGTCAAATGCAAGCAAGCCACCAAAAGAAATGGCAGTCTCATTGACGAC
AAAAGGCGGAGCTTGAAGCTCATCCGCGTCCGCGAAGGGGCCAACTGTGGTGCGGTGCCAGTTGCGGTGAGGTCTTACCCCACTTTGTTCATTTATATGA
AAAAGGTCAGCCCTCTACTAGTGACTCAGCATCCACCCTCTTGGCTACCAAGCGCCAGGCCATCAACCAGGAGATCTCAATTACGCCATCG

Un-shotgun<br>
Input: ordered k-mers (k_set) that overlap perfectly and shift one base to the right each time.<br>
Output: re-assembled shotgunned string.

In [11]:
with open('p125_data.txt') as f:
    k_set = f.read().split('\n')
f.close()
output = ''
for i in range(len(k_set)-1):
    output = output + k_set[i][0]
output = output + k_set[-1]
print(output)

ATCACCTACATTTATCCCAGTCCGATCCGAATCTATTGTACCTTTTTACACGACATGTCCGAGAACTATGTCGTGTCCAGCTCAGAGTACACCAAGCGCACGAAGTCGTCATCCCGTCGTGAACTATCGCAAAACCCCTGACCTAAGACTGGACCTTCTATCGGCTGAACTTTACGCTCTGCGGGGGACATTGGACACACGAGCGGTGTGCGGGAGACTAAAAGAAGTCGAGGCCTTCGATAAAATATCAACTGGACCACGTCACCGGTGCTTGACGGTTACCTTGCCCCCCGGAGGTCGATCGGAGTGGAGGGAGTAGGGTCTGCGGTGGATAGGATATCATTGGGACTCACTAGTCGTATCAAGTCAACACCATCGGGGTAGCTACACACATGACCATTGACTCATCGCTCGATAACCGGGGAACAAGGTGACTCCTTCATGGTTGCTTTCGTGTGGTATTCTCCAGAAGCTGGCTCAGAATCTGCTACATCCCCCGCATATTACCTCTCGAGGCGAGATACCACGTACCTTATTAAAATCCCTTAATAGAAAATTAAAACATCTTGTCCTGTTAAATTCCAGCACGGCCTCTGTGGGGCATGTGCTGAGATACGAGTCTGCGAGGTGTGCAGGGAAAGAAGTCTCACGTGACCCCAGGGACCTTACGGCAGGCTCTTCCAATACGGAGCTAGCGTTCTGCCGACTAATAACTTGTACAAAACTTACGCACAGAACGGAGTCCACACGTGCCCAATCGCGATACACCCAGGTTATCACGATGCACCAAGACTTGCCTCACAATCCGTGTGAACAGATGACTCGAGGCCCTTCGCTAGCTAAGTTGAATCTAGTCTTTCTTCAGTTCGTAACCTTCATAGGACTAGGACGCCGGAATTAGTGGCTGACGAGTGTCTTAAACCCGTGTATTGACACTCTGACACTTAAAACGCCGGTTGTCCAGGAAGATTGTGGAAGTAAAGAAAGCTCTCTAAACCAC

Construct overlap graph<br>
Input: A collection Patterns of k-mers.<br>
Output: List (dim N nodes) of lists. The overlap graph Overlap(Patterns), in the form of an adjacency list. 

In [53]:
with open('p128_data.txt') as f:
    k_set = f.read().split('\n')
f.close()
output = []
for i in range(len(k_set)):
    sub_output = []
    compare_set = list(k_set)
    compare_set.remove(k_set[i])
    for j in range(len(compare_set)):
        if k_set[i][1:] == compare_set[j][:-1]:
            sub_output.append(compare_set[j])
    if len(sub_output) > 0:
        print(k_set[i] + ' -> ' + sub_output[0])

GCAACTGTTATTCAAATAAAACGTA -> CAACTGTTATTCAAATAAAACGTAT
AACGCCCCGGTCGCGGTGTGGCGAT -> ACGCCCCGGTCGCGGTGTGGCGATT
GGGCGTAGGAAAAAGCTTCAGTAGG -> GGCGTAGGAAAAAGCTTCAGTAGGA
TTTTTGGTTTTCATTAACACAGTAA -> TTTTGGTTTTCATTAACACAGTAAC
TTTGCAGGGTTTACACCCTCGGTAT -> TTGCAGGGTTTACACCCTCGGTATG
ATGTCTATTGAATAGCCTTCGCTCG -> TGTCTATTGAATAGCCTTCGCTCGG
ACAGTCATTCCGTGGGGGGCGTCGT -> CAGTCATTCCGTGGGGGGCGTCGTA
CGTTTGCGCGGCCAGCACGGCGAGC -> GTTTGCGCGGCCAGCACGGCGAGCA
TCTATCACGTCTAAGATAGCTTGGT -> CTATCACGTCTAAGATAGCTTGGTG
TCGCCCTGTTGCTCTGGGATTGGTA -> CGCCCTGTTGCTCTGGGATTGGTAC
ATCAAGAATTGCTCGCTTCGACGCT -> TCAAGAATTGCTCGCTTCGACGCTG
TCCTGAGCCGCATCACGGTTAATCA -> CCTGAGCCGCATCACGGTTAATCAG
TTTACCATACAGCATTACTCGGCCG -> TTACCATACAGCATTACTCGGCCGA
ACTCTCTAGCATTTACCAGAATTAC -> CTCTCTAGCATTTACCAGAATTACA
TCCAAATTTGCATTTGGTACGAGGC -> CCAAATTTGCATTTGGTACGAGGCC
CCTTCGCTCGGATACTACTGAGTAC -> CTTCGCTCGGATACTACTGAGTACT
AGATGTCATCAATTAAGCGGGCAGA -> GATGTCATCAATTAAGCGGGCAGAC
GATAACGGGTTTACCCGCAGTATCG -> ATAACGGGTTTACCCGCAGTATCGC
AAAGAAGATT

TTGTATACCACTTTATCTAGAACTT -> TGTATACCACTTTATCTAGAACTTC
ATCATCGTACCTAGTCGTACCTAGC -> TCATCGTACCTAGTCGTACCTAGCT
TTTCCGTGGAAATACCCGCAGAGGT -> TTCCGTGGAAATACCCGCAGAGGTA
CCCTATCGTTGGATGCAACGCAGGA -> CCTATCGTTGGATGCAACGCAGGAT
CATCAATTAAGCGGGCAGACGCAGG -> ATCAATTAAGCGGGCAGACGCAGGG
ACAAACTAAACAGTTACATTATGAG -> CAAACTAAACAGTTACATTATGAGG
ACGCGGACTACCATTCATTAGTGCG -> CGCGGACTACCATTCATTAGTGCGG
TTGAGCTGGACCAGACAGTTCGGCA -> TGAGCTGGACCAGACAGTTCGGCAT
ACCTGACCGCTAGGTACTGGCACCG -> CCTGACCGCTAGGTACTGGCACCGA
GAATAGGTGGACAAACTAAACAGTT -> AATAGGTGGACAAACTAAACAGTTA
CGCCCCGGTCGCGGTGTGGCGATTG -> GCCCCGGTCGCGGTGTGGCGATTGT
ACTGTTCGCTCTCAGCATTATGGTT -> CTGTTCGCTCTCAGCATTATGGTTT
AACGGAGGCTTTACATGATTCGTGA -> ACGGAGGCTTTACATGATTCGTGAT
TGCGTGCCCCCACATCACAGGACCT -> GCGTGCCCCCACATCACAGGACCTA
GAGCTTCAATCCTCGAGAAGGGACG -> AGCTTCAATCCTCGAGAAGGGACGC
CCAGTCTGCTCTTGGGAGGTGTTCA -> CAGTCTGCTCTTGGGAGGTGTTCAT
GTATTTCAGTGCGAACAGTATCCAC -> TATTTCAGTGCGAACAGTATCCACT
GCACGCTGGCCCCCGCAGGCGCTAA -> CACGCTGGCCCCCGCAGGCGCTAAT
CAAGCCGCAA

GTCCTATGAGCTAAAACCAAATCCT -> TCCTATGAGCTAAAACCAAATCCTT
TCGCCTCTCTAGATTCAAGGTATAC -> CGCCTCTCTAGATTCAAGGTATACG
GGGACGCCTGCGAGGACGCCGTGCT -> GGACGCCTGCGAGGACGCCGTGCTC
CGACGCTGTGAAAGTAGAATCCGGG -> GACGCTGTGAAAGTAGAATCCGGGC
TGAACAGTAAAACTGTTGATCACTA -> GAACAGTAAAACTGTTGATCACTAC
CATGCTCGGAAGAGCAAATATCATG -> ATGCTCGGAAGAGCAAATATCATGT
CCTGGGATCATCGTACCTAGTCGTA -> CTGGGATCATCGTACCTAGTCGTAC
CAGGTCTCGCGACTACGAGTACACC -> AGGTCTCGCGACTACGAGTACACCG
TCTTTACTAGAAGGTAGAAGCTGAG -> CTTTACTAGAAGGTAGAAGCTGAGA
GCGATTGGAGGGTAGAGTCTACTTG -> CGATTGGAGGGTAGAGTCTACTTGG
TATCAATATAGTCGCGCAGACGGAA -> ATCAATATAGTCGCGCAGACGGAAC
GTATCGAGCAATAGTGAGCTGATCA -> TATCGAGCAATAGTGAGCTGATCAG
TTGGGAGCACCGCCTTGACGTCTTT -> TGGGAGCACCGCCTTGACGTCTTTC
CACCCTGCGACACAGACGTGTGCAC -> ACCCTGCGACACAGACGTGTGCACG
GAGCATGCTCGGAAGAGCAAATATC -> AGCATGCTCGGAAGAGCAAATATCA
AACACGACAAGCACAGTTCGCCTCG -> ACACGACAAGCACAGTTCGCCTCGC
CCTGTCAATGTTATTAGGAATCTCA -> CTGTCAATGTTATTAGGAATCTCAT
GGCTTATAGCGTGCCTCGCTGGCCT -> GCTTATAGCGTGCCTCGCTGGCCTG
GTAAAACTGT

TCTCTAGATTCAAGGTATACGTCCG -> CTCTAGATTCAAGGTATACGTCCGT
CGTATTTCAGTGCGAACAGTATCCA -> GTATTTCAGTGCGAACAGTATCCAC
TTTCTGACTAGTTACAGAGCCTGCG -> TTCTGACTAGTTACAGAGCCTGCGA
ACGGAAAGCAGAATGGTGTTCACTA -> CGGAAAGCAGAATGGTGTTCACTAG
CGTCCCGGTAATGCTCCTCGGATAG -> GTCCCGGTAATGCTCCTCGGATAGC
TTTGTGCCCGGACTGTAAGAAGCGT -> TTGTGCCCGGACTGTAAGAAGCGTT
TGAGATGCGTGCCCCCACATCACAG -> GAGATGCGTGCCCCCACATCACAGG
GTTTACACCCTCGGTATGGTGCTCG -> TTTACACCCTCGGTATGGTGCTCGT
TAAACAGTTACATTATGAGGCCAAC -> AAACAGTTACATTATGAGGCCAACA
ACTAGAAGGTAGAAGCTGAGATACA -> CTAGAAGGTAGAAGCTGAGATACAT
TTCCGACTATGACTCTCTTCTGATC -> TCCGACTATGACTCTCTTCTGATCC
TTAGGGAACAAATAAGCGACATTGA -> TAGGGAACAAATAAGCGACATTGAG
TCAGGAATGGTCGGCACGCTACGCC -> CAGGAATGGTCGGCACGCTACGCCT
TCACAGGACCTATGATCGCCCAGCG -> CACAGGACCTATGATCGCCCAGCGT
GTGGTGATCTCCTTACTATATGGAG -> TGGTGATCTCCTTACTATATGGAGG
CCGCGCCTGCAACAGGTCTCGCGAC -> CGCGCCTGCAACAGGTCTCGCGACT
CAATTCCTGGCCGCTTGCGTTGCGC -> AATTCCTGGCCGCTTGCGTTGCGCC
GGTCAGGTCTACCTTTATTGGCTGA -> GTCAGGTCTACCTTTATTGGCTGAT
TTCCTGGTAT

ACCTTTACCATACAGCATTACTCGG -> CCTTTACCATACAGCATTACTCGGC
GCTGGCCTGGGACGCCTGCGAGGAC -> CTGGCCTGGGACGCCTGCGAGGACG
CTGTCATTTATCCGTTCCTGTCAAT -> TGTCATTTATCCGTTCCTGTCAATG
TAACATAACCATGGTTTTTCTGTGT -> AACATAACCATGGTTTTTCTGTGTA
GCGCTCGTACTGTTGGGAGCACCGC -> CGCTCGTACTGTTGGGAGCACCGCC
CTCGGAAGAGCAAATATCATGTCTA -> TCGGAAGAGCAAATATCATGTCTAC
TCCTAGGGGACGCTATACCTTTACC -> CCTAGGGGACGCTATACCTTTACCA
CGAAATAATTTCCGACTAAAGTAAC -> GAAATAATTTCCGACTAAAGTAACG
ACCTAGTCGTACCTAGCTATCAAGT -> CCTAGTCGTACCTAGCTATCAAGTT
TTCAAATAAAACGTATTGGCCTCTA -> TCAAATAAAACGTATTGGCCTCTAC
CGTTTGCCGATTCACAAGCGTTTAG -> GTTTGCCGATTCACAAGCGTTTAGA
GGGTATACCAGAGAGGAGTCCGAAT -> GGTATACCAGAGAGGAGTCCGAATC
TCGCGACGACAGACCGTAACGGTTG -> CGCGACGACAGACCGTAACGGTTGG
AATTGCTCGCTTCGACGCTGTGAAA -> ATTGCTCGCTTCGACGCTGTGAAAG
CATTCATTAGTGCGGGTCAGGTCTA -> ATTCATTAGTGCGGGTCAGGTCTAC
CAATCTATGAGAAAATCAGCCTTCG -> AATCTATGAGAAAATCAGCCTTCGC
ACCCCGCAAAAAACAAAAGAAGATT -> CCCCGCAAAAAACAAAAGAAGATTA
GCGTGCCTCGCTGGCCTGGGACGCC -> CGTGCCTCGCTGGCCTGGGACGCCT
TCGCCATCGG

GGTTTGCATGCCGCCCCTCTACGAA -> GTTTGCATGCCGCCCCTCTACGAAA
TCCTTTTGTACTCCTAGGGGACGCT -> CCTTTTGTACTCCTAGGGGACGCTA
GTTCGCCTCGCAATTCCAGTCTGCT -> TTCGCCTCGCAATTCCAGTCTGCTC
TAAAAATCGCGGAAGTACGACTCCC -> AAAAATCGCGGAAGTACGACTCCCG
GATTTGTATACCACTTTATCTAGAA -> ATTTGTATACCACTTTATCTAGAAC
TCTACTTGCCCCGAGCAATTATTGT -> CTACTTGCCCCGAGCAATTATTGTC
CATGATTCGTGATGAGGCCGTAATC -> ATGATTCGTGATGAGGCCGTAATCT
CTAAGACGCCGGCGGTGCGGGGCGT -> TAAGACGCCGGCGGTGCGGGGCGTA
TGCCGTATATGCTCCAACACTAGCC -> GCCGTATATGCTCCAACACTAGCCT
GGCATCTGCAACTGTTATTCAAATA -> GCATCTGCAACTGTTATTCAAATAA
GTCTCTATACAGATGTTCTTGGATA -> TCTCTATACAGATGTTCTTGGATAC
CTGTGAAAGTAGAATCCGGGCTTAT -> TGTGAAAGTAGAATCCGGGCTTATA
GAGCCTGCGATCAGTGGCTCGCTAT -> AGCCTGCGATCAGTGGCTCGCTATT
AGCTTCCTGGGATCATCGTACCTAG -> GCTTCCTGGGATCATCGTACCTAGT
CTCTACGAAACCGTATTTTTCGAAA -> TCTACGAAACCGTATTTTTCGAAAG
TTTTGACCGATCTTGTACGATTGTA -> TTTGACCGATCTTGTACGATTGTAT
TCATCCGGGTCAGGTCACCCAACGA -> CATCCGGGTCAGGTCACCCAACGAA
TTTAGACTAGGTTCCCCGTTCAGGC -> TTAGACTAGGTTCCCCGTTCAGGCT
GGGCACTCGT

ACTTAGCACGCGCCGATCGGGCGCG -> CTTAGCACGCGCCGATCGGGCGCGA
ATGACGCGACCGACCTCACAGTCAT -> TGACGCGACCGACCTCACAGTCATT
TCATTCCGTGGGGGGCGTCGTATAG -> CATTCCGTGGGGGGCGTCGTATAGA
TATGTCTTTACTAGAAGGTAGAAGC -> ATGTCTTTACTAGAAGGTAGAAGCT
CAGACGTGTGCACGGAGTACGTCGT -> AGACGTGTGCACGGAGTACGTCGTG
GATACGGCGGTAATATCTATCACGT -> ATACGGCGGTAATATCTATCACGTC
CCGCATCACGGTTAATCAGGAATGG -> CGCATCACGGTTAATCAGGAATGGT
CACCTCTCTGCTTTGTCTATGTTAT -> ACCTCTCTGCTTTGTCTATGTTATA
CAGTGGCTCGCTATTAAAGCCTTTA -> AGTGGCTCGCTATTAAAGCCTTTAT
TCTATTGAATAGCCTTCGCTCGGAT -> CTATTGAATAGCCTTCGCTCGGATA
CCAGCACGGCGAGCACAGTGATAAC -> CAGCACGGCGAGCACAGTGATAACA
CGCGACTACGAGTACACCGATGACG -> GCGACTACGAGTACACCGATGACGC
AGATGCGTGCCCCCACATCACAGGA -> GATGCGTGCCCCCACATCACAGGAC
CGGTATGGTGCTCGTCTGTCCGCAA -> GGTATGGTGCTCGTCTGTCCGCAAT
TGCAACTGTTATTCAAATAAAACGT -> GCAACTGTTATTCAAATAAAACGTA
AGGTTTGACCCTATCGTTGGATGCA -> GGTTTGACCCTATCGTTGGATGCAA
ATTATGCTTCCGACTATGACTCTCT -> TTATGCTTCCGACTATGACTCTCTT
GAGCTGGACCAGACAGTTCGGCATC -> AGCTGGACCAGACAGTTCGGCATCT
AGGTCACCCA

TAAGTAGACGTGCATCTTGACAACG -> AAGTAGACGTGCATCTTGACAACGG
GTGGCAAGCTGCTTTGATTTTTCGC -> TGGCAAGCTGCTTTGATTTTTCGCC
TGTACCAGAGTATGAACTAATCTAG -> GTACCAGAGTATGAACTAATCTAGC
CGGAATTTCTCCATCAAGAATTGCT -> GGAATTTCTCCATCAAGAATTGCTC
AGAGCATGATTCTAGCGTGTGGGAC -> GAGCATGATTCTAGCGTGTGGGACA
TGCATTTTCAGAGCTTCAATCCTCG -> GCATTTTCAGAGCTTCAATCCTCGA
AAACAGTTACATTATGAGGCCAACA -> AACAGTTACATTATGAGGCCAACAG
GGTATACCAGAGAGGAGTCCGAATC -> GTATACCAGAGAGGAGTCCGAATCG
CAGACCGTAACGGTTGGTCCAAATT -> AGACCGTAACGGTTGGTCCAAATTT
ACTCCCAGACACATCGTCCCGGTAA -> CTCCCAGACACATCGTCCCGGTAAT
TACTACGCGGACTACCATTCATTAG -> ACTACGCGGACTACCATTCATTAGT
CTTATGCGCGGCTAAGTAGACGTGC -> TTATGCGCGGCTAAGTAGACGTGCA
GAGTGGCGGGTATACCAGAGAGGAG -> AGTGGCGGGTATACCAGAGAGGAGT
TTGCGTTTGCCGATTCACAAGCGTT -> TGCGTTTGCCGATTCACAAGCGTTT
TCATTTGCGATCAACGGCTGAATTC -> CATTTGCGATCAACGGCTGAATTCA
GTTACATTCGCTGAGTAACAGTACC -> TTACATTCGCTGAGTAACAGTACCT
ATGGGAACGGTAACAGTTCCTGGCC -> TGGGAACGGTAACAGTTCCTGGCCT
TCTGTCTACAGTGGAAGGGTAAATT -> CTGTCTACAGTGGAAGGGTAAATTA
AAATGTCATT

GCGCACGCTGGTAGTCTGCACGACA -> CGCACGCTGGTAGTCTGCACGACAT
TTGTCTGAGCGCTCGTACTGTTGGG -> TGTCTGAGCGCTCGTACTGTTGGGA
ATCGGGCGCGACACATTCATAAGAG -> TCGGGCGCGACACATTCATAAGAGA
GTTATTCAAATAAAACGTATTGGCC -> TTATTCAAATAAAACGTATTGGCCT
TCTAGCATTTACCAGAATTACAGGC -> CTAGCATTTACCAGAATTACAGGCA
GTGCGAACAGTATCCACTCCCAGAC -> TGCGAACAGTATCCACTCCCAGACA
CTCTTACGCCACTCTTATTACTGCT -> TCTTACGCCACTCTTATTACTGCTC
GAAGGGTAAATTACGTGCTTCTATA -> AAGGGTAAATTACGTGCTTCTATAG
CGTTCCTGAGCCGCATCACGGTTAA -> GTTCCTGAGCCGCATCACGGTTAAT
CTGACCGCTAGGTACTGGCACCGAA -> TGACCGCTAGGTACTGGCACCGAAG
ATAGGAAGTTGCCTTTCCTTTCGAA -> TAGGAAGTTGCCTTTCCTTTCGAAA
TGATGATTGAGGCGCCGTTTTAGTC -> GATGATTGAGGCGCCGTTTTAGTCG
AGGAGTCCGAATCGCCTCTCTAGAT -> GGAGTCCGAATCGCCTCTCTAGATT
ATTGGCTGATTACTGCCTAGTTAAG -> TTGGCTGATTACTGCCTAGTTAAGA
CCTTACTATATGGAGGCCAAAACGG -> CTTACTATATGGAGGCCAAAACGGA
ACTGTGTGGTCTGTGAATAGGTGGA -> CTGTGTGGTCTGTGAATAGGTGGAC
AATTTCCGACTAAAGTAACGTAAGA -> ATTTCCGACTAAAGTAACGTAAGAG
GGCGTCGTATAGAGGCTCTGCCTTA -> GCGTCGTATAGAGGCTCTGCCTTAT
ACTGTAAGAA

CAGACAGTTGGCAGCTAGTGCGCAC -> AGACAGTTGGCAGCTAGTGCGCACG
AGTAACGTAAGAGCTTATGCGCGGC -> GTAACGTAAGAGCTTATGCGCGGCT
GACAGGACCCATATGTCTATTGAAT -> ACAGGACCCATATGTCTATTGAATA
CTGACTAGTTACAGAGCCTGCGATC -> TGACTAGTTACAGAGCCTGCGATCA
CCGCCCCTCTACGAAACCGTATTTT -> CGCCCCTCTACGAAACCGTATTTTT
GTAGAGTCTACTTGGGATATGATAA -> TAGAGTCTACTTGGGATATGATAAA
TCGCCCTGGTCGGAGCCGAGTGGCG -> CGCCCTGGTCGGAGCCGAGTGGCGG
TAAAAGCATCCGAGATCGTGCCCCC -> AAAAGCATCCGAGATCGTGCCCCCG
CGTGTGGGACACGGCAGCATGTTTG -> GTGTGGGACACGGCAGCATGTTTGA
ATAAAACGTATTGGCCTCTACGGGA -> TAAAACGTATTGGCCTCTACGGGAC
GGTGGCAAAGAGAACTCATAAGTGG -> GTGGCAAAGAGAACTCATAAGTGGA
TCAATGTTATTAGGAATCTCATGGG -> CAATGTTATTAGGAATCTCATGGGT
GCAATAGGAAGTTGCCTTTCCTTTC -> CAATAGGAAGTTGCCTTTCCTTTCG
GAACACAGATCTACAAAACTGTATT -> AACACAGATCTACAAAACTGTATTC
AAAAAGGCCGTTGCCTGCGAGGCCC -> AAAAGGCCGTTGCCTGCGAGGCCCC
CGTACTGTTGGGAGCACCGCCTTGA -> GTACTGTTGGGAGCACCGCCTTGAC
ACAGTAACTCGTAATTGACAAGAAG -> CAGTAACTCGTAATTGACAAGAAGA
CTGTGACCACAGAGATTCGCCCTGG -> TGTGACCACAGAGATTCGCCCTGGT
GGGGTTGTGT

CCATATGTCTATTGAATAGCCTTCG -> CATATGTCTATTGAATAGCCTTCGC
GAGGCTTTACATGATTCGTGATGAG -> AGGCTTTACATGATTCGTGATGAGG
CTTTTTTGGTTTTCATTAACACAGT -> TTTTTTGGTTTTCATTAACACAGTA
CCGCAATGTCTTTTTTGGTTTTCAT -> CGCAATGTCTTTTTTGGTTTTCATT
TTTCAGTGCGAACAGTATCCACTCC -> TTCAGTGCGAACAGTATCCACTCCC
CAGCCTGTGACCACAGAGATTCGCC -> AGCCTGTGACCACAGAGATTCGCCC
CCGGCGACAATGTGATGTCGTCCAC -> CGGCGACAATGTGATGTCGTCCACA
GATCATCGTACCTAGTCGTACCTAG -> ATCATCGTACCTAGTCGTACCTAGC
CCGAGTGGCGGGTATACCAGAGAGG -> CGAGTGGCGGGTATACCAGAGAGGA
GTCTACTTGGGATATGATAAAATGT -> TCTACTTGGGATATGATAAAATGTC
TGCCTTGAGATGCGTGCCCCCACAT -> GCCTTGAGATGCGTGCCCCCACATC
CTACTACAGCCTTGTTAATTCATAC -> TACTACAGCCTTGTTAATTCATACT
TTTCGAAAGGAAATATTCCATGCCG -> TTCGAAAGGAAATATTCCATGCCGT
TTAGTACCACACGGGAAAGCGAACA -> TAGTACCACACGGGAAAGCGAACAC
CACGGGAAAGCGAACACAGATCTAC -> ACGGGAAAGCGAACACAGATCTACA
AGCGGGCAGACGCAGGGTATGTAAC -> GCGGGCAGACGCAGGGTATGTAACT
CGTTGATAACGGGTTTACCCGCAGT -> GTTGATAACGGGTTTACCCGCAGTA
ATGTCTTTACTAGAAGGTAGAAGCT -> TGTCTTTACTAGAAGGTAGAAGCTG
TATAGCGTGC

TCGCTAACATAACCATGGTTTTTCT -> CGCTAACATAACCATGGTTTTTCTG
CGGAGGCTTTACATGATTCGTGATG -> GGAGGCTTTACATGATTCGTGATGA
AACGGGAATGAAGCCTCTGATCACG -> ACGGGAATGAAGCCTCTGATCACGC
AAAGGCCTTTCCCACATACGCTTAA -> AAGGCCTTTCCCACATACGCTTAAA
TCTGTCATTTATCCGTTCCTGTCAA -> CTGTCATTTATCCGTTCCTGTCAAT
GTCATTCCGTGGGGGGCGTCGTATA -> TCATTCCGTGGGGGGCGTCGTATAG
TAAAGATATTTGGTCAAGTGGTGTT -> AAAGATATTTGGTCAAGTGGTGTTT
GCGTTTGCCGATTCACAAGCGTTTA -> CGTTTGCCGATTCACAAGCGTTTAG
AATTCCTGGCCGCTTGCGTTGCGCC -> ATTCCTGGCCGCTTGCGTTGCGCCT
CAGCATTACTCGGCCGACGGGCAAC -> AGCATTACTCGGCCGACGGGCAACG
CAGTTACATTATGAGGCCAACAGGC -> AGTTACATTATGAGGCCAACAGGCT
CAATCCTCGAGAAGGGACGCACGCT -> AATCCTCGAGAAGGGACGCACGCTG
ATAACCAATCTGCCCATTACTCTGT -> TAACCAATCTGCCCATTACTCTGTG
TGCCCATTACTCTGTGCCATTATGC -> GCCCATTACTCTGTGCCATTATGCT
AGACTAAACATTTATTTATACCAGG -> GACTAAACATTTATTTATACCAGGT
TCAACCCGCTGATACTACGCGGACT -> CAACCCGCTGATACTACGCGGACTA
GTCCGCAATGTCTTTTTTGGTTTTC -> TCCGCAATGTCTTTTTTGGTTTTCA
CTCTCTAGCATTTACCAGAATTACA -> TCTCTAGCATTTACCAGAATTACAG
CAGTATTGTA

GTACGAGGCCCTAACGTATTTCGGT -> TACGAGGCCCTAACGTATTTCGGTG
CAGTTCGGCATCTGCAACTGTTATT -> AGTTCGGCATCTGCAACTGTTATTC
TTGGGAGGTGTTCATCGCCTGAACA -> TGGGAGGTGTTCATCGCCTGAACAG
GTGCATTTTCAGAGCTTCAATCCTC -> TGCATTTTCAGAGCTTCAATCCTCG
CAACAGGTGCCATAACCAATCTGCC -> AACAGGTGCCATAACCAATCTGCCC
GAGGGGGGATCGCTGTCGTGCCGCA -> AGGGGGGATCGCTGTCGTGCCGCAG
TTCAGGCTGTGGTCAAGTCCCCGGC -> TCAGGCTGTGGTCAAGTCCCCGGCC
TCTACCTTTATTGGCTGATTACTGC -> CTACCTTTATTGGCTGATTACTGCC
ACTGCCTAGTTAAGATGCGACGTAC -> CTGCCTAGTTAAGATGCGACGTACC
TCATTTATCCGTTCCTGTCAATGTT -> CATTTATCCGTTCCTGTCAATGTTA
TACCTAGTCGTACCTAGCTATCAAG -> ACCTAGTCGTACCTAGCTATCAAGT
AGGGAACAAATAAGCGACATTGAGA -> GGGAACAAATAAGCGACATTGAGAA
AATCTGCAAAGAGACATACAGGCCA -> ATCTGCAAAGAGACATACAGGCCAA
AAGATATTTGGTCAAGTGGTGTTTG -> AGATATTTGGTCAAGTGGTGTTTGA
TCCTGCCATATATCATATAGATGTC -> CCTGCCATATATCATATAGATGTCA
CTCGCTGGCCTGGGACGCCTGCGAG -> TCGCTGGCCTGGGACGCCTGCGAGG
CGGGTCAGGTCACCCAACGAAATAA -> GGGTCAGGTCACCCAACGAAATAAT
GCCGGAATTTCTCCATCAAGAATTG -> CCGGAATTTCTCCATCAAGAATTGC
CGGCGTACAG

GAGCTTATGCGCGGCTAAGTAGACG -> AGCTTATGCGCGGCTAAGTAGACGT
CCGCCTTGACGTCTTTCAATGTCTG -> CGCCTTGACGTCTTTCAATGTCTGT
TATGCTCCAACACTAGCCTGGTCGT -> ATGCTCCAACACTAGCCTGGTCGTA
CATTCCGTGGGGGGCGTCGTATAGA -> ATTCCGTGGGGGGCGTCGTATAGAG
TGGTGATCTCCTTACTATATGGAGG -> GGTGATCTCCTTACTATATGGAGGC
AGAGCCTGCGATCAGTGGCTCGCTA -> GAGCCTGCGATCAGTGGCTCGCTAT
ATGTCCTACGCGATCTTGCATAAAA -> TGTCCTACGCGATCTTGCATAAAAA
GATTGCGTTTGCCGATTCACAAGCG -> ATTGCGTTTGCCGATTCACAAGCGT
CCTATGAGCTAAAACCAAATCCTTT -> CTATGAGCTAAAACCAAATCCTTTT
GCATAAAAATCGCGGAAGTACGACT -> CATAAAAATCGCGGAAGTACGACTC
AGCCTTGTTAATTCATACTTGAAAT -> GCCTTGTTAATTCATACTTGAAATG
TGAGGCCAACAGGCTCCTGCCATAT -> GAGGCCAACAGGCTCCTGCCATATA
CGACGTACCAAGGTAACTCGGATTT -> GACGTACCAAGGTAACTCGGATTTA
GAGACATACAGGCCAACACGTGCGG -> AGACATACAGGCCAACACGTGCGGG
CAAAGAGAACTCATAAGTGGAATAT -> AAAGAGAACTCATAAGTGGAATATG
CTGCGTTGAGCAATAAAGCACAACG -> TGCGTTGAGCAATAAAGCACAACGG
TAGAAGGTAGAAGCTGAGATACATG -> AGAAGGTAGAAGCTGAGATACATGT
AGCAAAGATGGGAACGGTAACAGTT -> GCAAAGATGGGAACGGTAACAGTTC
ATCTACTTGC

GTAAAATACCAATCTGCAAAGAGAC -> TAAAATACCAATCTGCAAAGAGACA
CTTTACGCATCCACGATGCGCGTGC -> TTTACGCATCCACGATGCGCGTGCA
ACGCTTAAAAGCATCCGAGATCGTG -> CGCTTAAAAGCATCCGAGATCGTGC
TCCGTTCCTGTCAATGTTATTAGGA -> CCGTTCCTGTCAATGTTATTAGGAA
TCTTCTGATCCAATCTATGAGAAAA -> CTTCTGATCCAATCTATGAGAAAAT
CACGCTGGCCCCCGCAGGCGCTAAT -> ACGCTGGCCCCCGCAGGCGCTAATT
TGTGACCACAGAGATTCGCCCTGGT -> GTGACCACAGAGATTCGCCCTGGTC
GTATTAAAGATATTTGGTCAAGTGG -> TATTAAAGATATTTGGTCAAGTGGT
GAGCATGATTCTAGCGTGTGGGACA -> AGCATGATTCTAGCGTGTGGGACAC
GATTACTGCCTAGTTAAGATGCGAC -> ATTACTGCCTAGTTAAGATGCGACG
CCATGCCGTATATGCTCCAACACTA -> CATGCCGTATATGCTCCAACACTAG
AGCCTGCGATCAGTGGCTCGCTATT -> GCCTGCGATCAGTGGCTCGCTATTA
TCCTAAGGAGCCTTCCAGGAAGACC -> CCTAAGGAGCCTTCCAGGAAGACCT
CTGATCCAATCTATGAGAAAATCAG -> TGATCCAATCTATGAGAAAATCAGC
CACACCGTTGGGTCCTAAGGAGCCT -> ACACCGTTGGGTCCTAAGGAGCCTT
ACTAAAGTAACGTAAGAGCTTATGC -> CTAAAGTAACGTAAGAGCTTATGCG
GACGTGTGCACGGAGTACGTCGTGA -> ACGTGTGCACGGAGTACGTCGTGAT
GGTCAGGTCACCCAACGAAATAATT -> GTCAGGTCACCCAACGAAATAATTT
ATTAAGCGGG

TCTGTTACATTCGCTGAGTAACAGT -> CTGTTACATTCGCTGAGTAACAGTA
GGACCTATGATCGCCCAGCGTCCAT -> GACCTATGATCGCCCAGCGTCCATG
TCGTACGCAGAGTACTGCAGCTGGA -> CGTACGCAGAGTACTGCAGCTGGAT
CTGGTAGTCTGCACGACATCGAGAC -> TGGTAGTCTGCACGACATCGAGACT
CATAGGCGCCGAGATGAATCTTGGA -> ATAGGCGCCGAGATGAATCTTGGAC
TCTCTGCTTTGTCTATGTTATACAC -> CTCTGCTTTGTCTATGTTATACACG
CGCCTGCAACAGGTCTCGCGACTAC -> GCCTGCAACAGGTCTCGCGACTACG
ACGGCAGCATGTTTGATTAGAGGCG -> CGGCAGCATGTTTGATTAGAGGCGA
CGTTGAACGCATCTACTTGCCCCGA -> GTTGAACGCATCTACTTGCCCCGAG
ACGCGACCGACCTCACAGTCATTCC -> CGCGACCGACCTCACAGTCATTCCG
CTAGGTTTCTGACTAGTTACAGAGC -> TAGGTTTCTGACTAGTTACAGAGCC
ATTACTCGGCCGACGGGCAACGTAA -> TTACTCGGCCGACGGGCAACGTAAC
TGCTTTGTCTATGTTATACACGAAT -> GCTTTGTCTATGTTATACACGAATT
GCGCCGATCGGGCGCGACACATTCA -> CGCCGATCGGGCGCGACACATTCAT
TGACTAGTTACAGAGCCTGCGATCA -> GACTAGTTACAGAGCCTGCGATCAG
TGTCCGGCGTGGTGATCTCCTTACT -> GTCCGGCGTGGTGATCTCCTTACTA
ACCAGCCTGTGACCACAGAGATTCG -> CCAGCCTGTGACCACAGAGATTCGC
ACGGCGGTAATATCTATCACGTCTA -> CGGCGGTAATATCTATCACGTCTAA
GCGGGGCGTA

ACCATATTAAAATGATCCCAGCTGA -> CCATATTAAAATGATCCCAGCTGAT
ATCTATCACGTCTAAGATAGCTTGG -> TCTATCACGTCTAAGATAGCTTGGT
GCGAGCACAGTGATAACAAAGATTG -> CGAGCACAGTGATAACAAAGATTGC
ACACCCACTTATCCGCCACGTATCG -> CACCCACTTATCCGCCACGTATCGA
CGTACGCAGAGTACTGCAGCTGGAT -> GTACGCAGAGTACTGCAGCTGGATA
TCAAAGTGACAACTGCCTTCCAGAC -> CAAAGTGACAACTGCCTTCCAGACA
TGTGATGTCGTCCACACGTCTAACC -> GTGATGTCGTCCACACGTCTAACCC
ATTTCCGCGCCTGCAACAGGTCTCG -> TTTCCGCGCCTGCAACAGGTCTCGC
AGACCGTAACGGTTGGTCCAAATTT -> GACCGTAACGGTTGGTCCAAATTTG
TGTTGTCTGTCTACAGTGGAAGGGT -> GTTGTCTGTCTACAGTGGAAGGGTA
CTGTTGCTCTGGGATTGGTACTCCG -> TGTTGCTCTGGGATTGGTACTCCGA
GAGCTCGCCGTAAACACACCGTTGG -> AGCTCGCCGTAAACACACCGTTGGG
CGGTGCGGGGCGTAGGAAAAAGCTT -> GGTGCGGGGCGTAGGAAAAAGCTTC
CCTGCGAGGCCCCCGTATTAAAGAT -> CTGCGAGGCCCCCGTATTAAAGATA
TGAGCTAAAACCAAATCCTTTTGAG -> GAGCTAAAACCAAATCCTTTTGAGC
ATGGTTTGCCGCGAACCAGACAGTT -> TGGTTTGCCGCGAACCAGACAGTTG
CCTCAGCGGGAATTTCCGTGGAAAT -> CTCAGCGGGAATTTCCGTGGAAATA
AACGGTTGGTCCAAATTTGCATTTG -> ACGGTTGGTCCAAATTTGCATTTGG
CGTAGCCTAG

TGTTATACACGAATTGTACCAGAGT -> GTTATACACGAATTGTACCAGAGTA
ACCCACTTATCCGCCACGTATCGAG -> CCCACTTATCCGCCACGTATCGAGC
CTGGCCTGGGACGCCTGCGAGGACG -> TGGCCTGGGACGCCTGCGAGGACGC
ACAGGAGCTACTGTTCGCTCTCAGC -> CAGGAGCTACTGTTCGCTCTCAGCA
AAACGTATTGGCCTCTACGGGACAG -> AACGTATTGGCCTCTACGGGACAGG
GAGCACCGCCTTGACGTCTTTCAAT -> AGCACCGCCTTGACGTCTTTCAATG
CTGGGATTGGTACTCCGAGTCGCTC -> TGGGATTGGTACTCCGAGTCGCTCG
TCGAAAAAAAGGCCGTTGCCTGCGA -> CGAAAAAAAGGCCGTTGCCTGCGAG
CAGTCATTCCGTGGGGGGCGTCGTA -> AGTCATTCCGTGGGGGGCGTCGTAT
ATACTACGCGGACTACCATTCATTA -> TACTACGCGGACTACCATTCATTAG
GCCTTTATGATTCTCGCACAATGAC -> CCTTTATGATTCTCGCACAATGACT
TACGTCGGACACAGCAAAGATGGGA -> ACGTCGGACACAGCAAAGATGGGAA
CTAGCGTGTGGGACACGGCAGCATG -> TAGCGTGTGGGACACGGCAGCATGT
CTAGAAGGTAGAAGCTGAGATACAT -> TAGAAGGTAGAAGCTGAGATACATG
AGGCGCTAATTAGTACCACACGGGA -> GGCGCTAATTAGTACCACACGGGAA
ACGCCCCAGTCCCACCATATTAAAA -> CGCCCCAGTCCCACCATATTAAAAT
TGTGGTCTGTGAATAGGTGGACAAA -> GTGGTCTGTGAATAGGTGGACAAAC
GCATAAACGGCCTGTAGATCCGGGG -> CATAAACGGCCTGTAGATCCGGGGA
GGCACGCTAC

ACCGTATTTTTCGAAAGGAAATATT -> CCGTATTTTTCGAAAGGAAATATTC
GCTAGGTACTGGCACCGAAGGGACT -> CTAGGTACTGGCACCGAAGGGACTT
GTGTGGTCTGTGAATAGGTGGACAA -> TGTGGTCTGTGAATAGGTGGACAAA
AGTCGCGCAGACGGAACTTTACATT -> GTCGCGCAGACGGAACTTTACATTG
GCCTCAGCGGGAATTTCCGTGGAAA -> CCTCAGCGGGAATTTCCGTGGAAAT
CGGCAAGCGCCCGGTGCGGTCAGCG -> GGCAAGCGCCCGGTGCGGTCAGCGG
TAATGCTCCTCGGATAGCTTCCTGG -> AATGCTCCTCGGATAGCTTCCTGGG
CAAGTGGTGTTTGAACTAGACCTGT -> AAGTGGTGTTTGAACTAGACCTGTA
TCCTATACAGAGACTACCGTGAGCC -> CCTATACAGAGACTACCGTGAGCCG
TAAACACACCGTTGGGTCCTAAGGA -> AAACACACCGTTGGGTCCTAAGGAG
TCTATGTTATACACGAATTGTACCA -> CTATGTTATACACGAATTGTACCAG
CGCCAAAGGGGCACTCGTGCGAACG -> GCCAAAGGGGCACTCGTGCGAACGC
TAGCCTTCGCTCGGATACTACTGAG -> AGCCTTCGCTCGGATACTACTGAGT
GTGGCCTCGTCGGCGCACAAAGATT -> TGGCCTCGTCGGCGCACAAAGATTT
ACAAAGATTGCGTTTGCCGATTCAC -> CAAAGATTGCGTTTGCCGATTCACA
CTAGTAACGAGCACGAATAAGACCT -> TAGTAACGAGCACGAATAAGACCTG
ACTGTGGGTGCACAGGAGCTACTGT -> CTGTGGGTGCACAGGAGCTACTGTT
TATGGTGCTCGTCTGTCCGCAATGT -> ATGGTGCTCGTCTGTCCGCAATGTC
TCGTGCGAAC

CCGCATAAACGGCCTGTAGATCCGG -> CGCATAAACGGCCTGTAGATCCGGG
CTGATGATTGAGGCGCCGTTTTAGT -> TGATGATTGAGGCGCCGTTTTAGTC
ACCACTTTATCTAGAACTTCCTCCT -> CCACTTTATCTAGAACTTCCTCCTA
GCGGGAATTTCCGTGGAAATACCCG -> CGGGAATTTCCGTGGAAATACCCGC
TACAAAACTGTATTCTTCGAGTCTG -> ACAAAACTGTATTCTTCGAGTCTGT
CGGTTGGTCCAAATTTGCATTTGGT -> GGTTGGTCCAAATTTGCATTTGGTA
TGTCTCTATACAGATGTTCTTGGAT -> GTCTCTATACAGATGTTCTTGGATA
CCGTTCGCGACGACAGACCGTAACG -> CGTTCGCGACGACAGACCGTAACGG
ACCGCTATTTCATAAATCGGCAGCG -> CCGCTATTTCATAAATCGGCAGCGG
TCGGGCAGCATTTGCTCTCTTCTTA -> CGGGCAGCATTTGCTCTCTTCTTAG
TGTCGGAAGAGTGGGTCTTCATTTT -> GTCGGAAGAGTGGGTCTTCATTTTG
GGAATGGTCGGCACGCTACGCCTGA -> GAATGGTCGGCACGCTACGCCTGAC
CCATCAAGAATTGCTCGCTTCGACG -> CATCAAGAATTGCTCGCTTCGACGC
CAAGCACAGTTCGCCTCGCAATTCC -> AAGCACAGTTCGCCTCGCAATTCCA
AAGATTAGCTGAGCAAAAATTGCGT -> AGATTAGCTGAGCAAAAATTGCGTA
ATCTTGCATAAAAATCGCGGAAGTA -> TCTTGCATAAAAATCGCGGAAGTAC
CAGCGAGGTGCCTTGAGATGCGTGC -> AGCGAGGTGCCTTGAGATGCGTGCC
GTCCCACCATATTAAAATGATCCCA -> TCCCACCATATTAAAATGATCCCAG
TCAAGTTAAC

TAACATCAGCGAGGTGCCTTGAGAT -> AACATCAGCGAGGTGCCTTGAGATG
AATGCTCCTCGGATAGCTTCCTGGG -> ATGCTCCTCGGATAGCTTCCTGGGA
AACGAGCACGAATAAGACCTGACCG -> ACGAGCACGAATAAGACCTGACCGC
GCTTAAAAGCATCCGAGATCGTGCC -> CTTAAAAGCATCCGAGATCGTGCCC
ACGACAAGCACAGTTCGCCTCGCAA -> CGACAAGCACAGTTCGCCTCGCAAT
TTCTTGGATACGTCGGACACAGCAA -> TCTTGGATACGTCGGACACAGCAAA
GATCTCCTTACTATATGGAGGCCAA -> ATCTCCTTACTATATGGAGGCCAAA
CCACTTAGACATATACGCCCCAGTC -> CACTTAGACATATACGCCCCAGTCC
GGTCGGGCAGCATTTGCTCTCTTCT -> GTCGGGCAGCATTTGCTCTCTTCTT
TCGGCGATTGGAGGGTAGAGTCTAC -> CGGCGATTGGAGGGTAGAGTCTACT
CCGGCGTGGTGATCTCCTTACTATA -> CGGCGTGGTGATCTCCTTACTATAT
AGCACGGCGAGCACAGTGATAACAA -> GCACGGCGAGCACAGTGATAACAAA
GCTCTCTTCTTAGTCAACCTTGCAT -> CTCTCTTCTTAGTCAACCTTGCATC
GTGACAACTGCCTTCCAGACAGAGG -> TGACAACTGCCTTCCAGACAGAGGG
GACACATTCATAAGAGAGGCAGTAT -> ACACATTCATAAGAGAGGCAGTATC
GTCTCAACCCGCTGATACTACGCGG -> TCTCAACCCGCTGATACTACGCGGA
CTCGAGAAGGGACGCACGCTGGCCC -> TCGAGAAGGGACGCACGCTGGCCCC
GCAGCATTTGCTCTCTTCTTAGTCA -> CAGCATTTGCTCTCTTCTTAGTCAA
GACTTTACGC

CATATGTCTATTGAATAGCCTTCGC -> ATATGTCTATTGAATAGCCTTCGCT
AGATACCAACGCTGTTTTGACCGAT -> GATACCAACGCTGTTTTGACCGATC
AGCACCTCTCTGCTTTGTCTATGTT -> GCACCTCTCTGCTTTGTCTATGTTA
AGAAATAACCCCGCAAAAAACAAAA -> GAAATAACCCCGCAAAAAACAAAAG
TCGCCTCGCAATTCCAGTCTGCTCT -> CGCCTCGCAATTCCAGTCTGCTCTT
TGATCCAATCTATGAGAAAATCAGC -> GATCCAATCTATGAGAAAATCAGCC
GCAAAAATTGCGTACCCGTCAATTC -> CAAAAATTGCGTACCCGTCAATTCC
TCGTGATAGGCGGACCTCAGGTGGC -> CGTGATAGGCGGACCTCAGGTGGCA
ACAGGACCTATGATCGCCCAGCGTC -> CAGGACCTATGATCGCCCAGCGTCC
ATGATCCCAGCTGATCTATGCTGAC -> TGATCCCAGCTGATCTATGCTGACC
GGCTAAGTAGACGTGCATCTTGACA -> GCTAAGTAGACGTGCATCTTGACAA
AATCTTGGACATAAGTTATGAATGA -> ATCTTGGACATAAGTTATGAATGAT
ACTCTCTTCTGATCCAATCTATGAG -> CTCTCTTCTGATCCAATCTATGAGA
CTATCACGTCTAAGATAGCTTGGTG -> TATCACGTCTAAGATAGCTTGGTGG
AGTCAACCTTGCATCAGACAGACTG -> GTCAACCTTGCATCAGACAGACTGT
CTAAGTAGACGTGCATCTTGACAAC -> TAAGTAGACGTGCATCTTGACAACG
ACGACATCGAGACTGTAAAATACCA -> CGACATCGAGACTGTAAAATACCAA
CACAGCAAAGATGGGAACGGTAACA -> ACAGCAAAGATGGGAACGGTAACAG
CGCTAATTAG

TAACAGTTCCTGGCCTAACCACGTC -> AACAGTTCCTGGCCTAACCACGTCC
GCCCTAACGTATTTCGGTGGCCTCG -> CCCTAACGTATTTCGGTGGCCTCGT
CGCGGCTAAGTAGACGTGCATCTTG -> GCGGCTAAGTAGACGTGCATCTTGA
AGCGCTCGTACTGTTGGGAGCACCG -> GCGCTCGTACTGTTGGGAGCACCGC
TGCACCCTCGCCATCGGGACCGTTG -> GCACCCTCGCCATCGGGACCGTTGA
GAGGCCAACAGGCTCCTGCCATATA -> AGGCCAACAGGCTCCTGCCATATAT
CATTATGCTTCCGACTATGACTCTC -> ATTATGCTTCCGACTATGACTCTCT
GTTATGAATGATACTAGCGTTGTCT -> TTATGAATGATACTAGCGTTGTCTG
GGCGTCCTCAATAACTTACTTACAC -> GCGTCCTCAATAACTTACTTACACA
GGAGTCCGAATCGCCTCTCTAGATT -> GAGTCCGAATCGCCTCTCTAGATTC
TACAGGCAGCAACAGGTGCCATAAC -> ACAGGCAGCAACAGGTGCCATAACC
AGAAGCGTTCTCCATTACTCTCTAG -> GAAGCGTTCTCCATTACTCTCTAGC
CTGTTACATTCGCTGAGTAACAGTA -> TGTTACATTCGCTGAGTAACAGTAC
ACGCTGAAGAAATAACCCCGCAAAA -> CGCTGAAGAAATAACCCCGCAAAAA
ATTGCTCGCTTCGACGCTGTGAAAG -> TTGCTCGCTTCGACGCTGTGAAAGT
ATCTTGTACGATTGTATGTATCCAA -> TCTTGTACGATTGTATGTATCCAAG
GGCGGGTATACCAGAGAGGAGTCCG -> GCGGGTATACCAGAGAGGAGTCCGA
TCCTCGGATAGCTTCCTGGGATCAT -> CCTCGGATAGCTTCCTGGGATCATC
GGGGATCGCT

ACTAGTTACAGAGCCTGCGATCAGT -> CTAGTTACAGAGCCTGCGATCAGTG
TGCTCGTCTGTCCGCAATGTCTTTT -> GCTCGTCTGTCCGCAATGTCTTTTT
TTTCGTACGCAGAGTACTGCAGCTG -> TTCGTACGCAGAGTACTGCAGCTGG
AGACGCAGGGTATGTAACTGGGGGT -> GACGCAGGGTATGTAACTGGGGGTT
GGGGCCCTCCAAGCCGCAACCAGCC -> GGGCCCTCCAAGCCGCAACCAGCCT
TGTGCTACCGTTCGCGACGACAGAC -> GTGCTACCGTTCGCGACGACAGACC
TACCAACGCTGTTTTGACCGATCTT -> ACCAACGCTGTTTTGACCGATCTTG
AGAAAAACGTATTTCAGTGCGAACA -> GAAAAACGTATTTCAGTGCGAACAG
GTGTGGGACACGGCAGCATGTTTGA -> TGTGGGACACGGCAGCATGTTTGAT
CCTAGTTAAGATGCGACGTACCAAG -> CTAGTTAAGATGCGACGTACCAAGG
GCTTAAAAAACACGACAAGCACAGT -> CTTAAAAAACACGACAAGCACAGTT
GGAACCATAGAAGTAGTTCTGTCAT -> GAACCATAGAAGTAGTTCTGTCATT
CGTTGGATGCAACGCAGGATTCCTG -> GTTGGATGCAACGCAGGATTCCTGG
ATACGCCCCAGTCCCACCATATTAA -> TACGCCCCAGTCCCACCATATTAAA
AGCGCCCGGTGCGGTCAGCGGGGAC -> GCGCCCGGTGCGGTCAGCGGGGACG
CTTCTTAGTCAACCTTGCATCAGAC -> TTCTTAGTCAACCTTGCATCAGACA
CAGACTAAACATTTATTTATACCAG -> AGACTAAACATTTATTTATACCAGG
AACTAGCACCTCTCTGCTTTGTCTA -> ACTAGCACCTCTCTGCTTTGTCTAT
TGGTCGTAGC

TCATCGTATAGTCCGAAGGAAATAG -> CATCGTATAGTCCGAAGGAAATAGA
TGCGAACAGTATCCACTCCCAGACA -> GCGAACAGTATCCACTCCCAGACAC
AGAACTTCCTCCTATACAGAGACTA -> GAACTTCCTCCTATACAGAGACTAC
GCAGCATGTTTGATTAGAGGCGAAC -> CAGCATGTTTGATTAGAGGCGAACT
AACATCAGCGAGGTGCCTTGAGATG -> ACATCAGCGAGGTGCCTTGAGATGC
TACCGTGAGCCGGTTTGCATGCCGC -> ACCGTGAGCCGGTTTGCATGCCGCC
GGCAGTATCCGGCGACAATGTGATG -> GCAGTATCCGGCGACAATGTGATGT
ACCGATGACGCGACCGACCTCACAG -> CCGATGACGCGACCGACCTCACAGT
ATCGTTGAGCTGGACCAGACAGTTC -> TCGTTGAGCTGGACCAGACAGTTCG
CCGCTGATACTACGCGGACTACCAT -> CGCTGATACTACGCGGACTACCATT
TTTGCTCTCTTCTTAGTCAACCTTG -> TTGCTCTCTTCTTAGTCAACCTTGC
ACCCTAGAACTGTTGTCTGTCTACA -> CCCTAGAACTGTTGTCTGTCTACAG
AACCGTATTTTTCGAAAGGAAATAT -> ACCGTATTTTTCGAAAGGAAATATT
CCGCAGGCGCTAATTAGTACCACAC -> CGCAGGCGCTAATTAGTACCACACG
CGCTCCCAACGGAAAGCAGAATGGT -> GCTCCCAACGGAAAGCAGAATGGTG
ATTTCCGACTAAAGTAACGTAAGAG -> TTTCCGACTAAAGTAACGTAAGAGC
TTAGTCGTTGATAACGGGTTTACCC -> TAGTCGTTGATAACGGGTTTACCCG
TTGTCCCCTGAGCTCGCCGTAAACA -> TGTCCCCTGAGCTCGCCGTAAACAC
TTTTAGTCGT

De Bruijn graph<br>
Input: An integer k and a string Text.<br>
Output: DeBruijnk(Text), in the form of an adjacency list (dictionary).

In [92]:
import collections
with open('p132_data.txt') as f:
    k = int(f.readline())
    text = f.readline().strip('\n')
f.close()
nodes = []
bruijn = {}
for i in range(len(text)-k+2):
    node = text[i:i+k-1]
    nodes.append(node)
for j in range(len(nodes)-1):
    if nodes[j] in bruijn:
        bruijn[nodes[j]].append(nodes[j+1])
    else:
        bruijn[nodes[j]] = [nodes[j+1]]
o_bruijn = collections.OrderedDict(sorted(bruijn.items()))
for key, value in o_bruijn.items():
    print(key + ' -> ' + ','.join(value))

AA -> AT
AT -> TG,TG,TG
CA -> AT
CC -> CA
GA -> AT
GC -> CC
GG -> GG,GA
GT -> TT
TA -> AA
TG -> GG,GC,GT


De Bruijn algorithm<br>
Input: A collection of k-mers Patterns.<br>
Output: The adjacency list of the de Bruijn graph DeBruijn(Patterns).

In [9]:
import collections
with open('p137_data.txt') as f:
    patterns = f.read().split('\n')
f.close()
nodes = []
bruijn = {}
k=len(patterns[0])
for j in range(len(patterns)):
    node = patterns[j][:k-1]
    if node in bruijn:
        bruijn[node].append(patterns[j][-(k-1):])
    else:
        bruijn[node] = [patterns[j][-(k-1):]]
o_bruijn = collections.OrderedDict(sorted(bruijn.items()))
for key, value in o_bruijn.items():
    value.sort()
    print(key + ' -> ' + ','.join(value))

AAAAACAGGCTCATATGCCCAACT -> AAAACAGGCTCATATGCCCAACTG
AAAAACCACACCCCGCAATTCTAC -> AAAACCACACCCCGCAATTCTACT
AAAAAGAGTGTATCATCGAACAGT -> AAAAGAGTGTATCATCGAACAGTC
AAAACAGGCTCATATGCCCAACTG -> AAACAGGCTCATATGCCCAACTGG
AAAACATCTTGTCCTGTTAAATTC -> AAACATCTTGTCCTGTTAAATTCC
AAAACCACACCCCGCAATTCTACT -> AAACCACACCCCGCAATTCTACTC
AAAACCCCTGACCTAAGACTGGAC -> AAACCCCTGACCTAAGACTGGACC
AAAACGCCGGTTGTCCAGGAAGAT -> AAACGCCGGTTGTCCAGGAAGATT
AAAACTTACGCACAGAACGGAGTC -> AAACTTACGCACAGAACGGAGTCC
AAAAGAAGTCGAGGCCTTCGATAA -> AAAGAAGTCGAGGCCTTCGATAAA
AAAAGAGTGTATCATCGAACAGTC -> AAAGAGTGTATCATCGAACAGTCA
AAAATATCAACTGGACCACGTCAC -> AAATATCAACTGGACCACGTCACC
AAAATCCCTTAATAGAAAATTAAA -> AAATCCCTTAATAGAAAATTAAAA
AAAATCTCCTGGGAGCATCACGAC -> AAATCTCCTGGGAGCATCACGACA
AAAATTAAAACATCTTGTCCTGTT -> AAATTAAAACATCTTGTCCTGTTA
AAACAGGCTCATATGCCCAACTGG -> AACAGGCTCATATGCCCAACTGGT
AAACATCATTCTTATTTCAGGCCC -> AACATCATTCTTATTTCAGGCCCC
AAACATCTTGTCCTGTTAAATTCC -> AACATCTTGTCCTGTTAAATTCCA
AAACCACACCCCGCAATTCTACTC -> AACCACACCCCGCAATTC

GGAGACTAAAAGAAGTCGAGGCCT -> GAGACTAAAAGAAGTCGAGGCCTT
GGAGCATCACGACACTGAACGCAT -> GAGCATCACGACACTGAACGCATG
GGAGCTAGCGTTCTGCCGACTAAT -> GAGCTAGCGTTCTGCCGACTAATA
GGAGGGAGTAGGGTCTGCGGTGGA -> GAGGGAGTAGGGTCTGCGGTGGAT
GGAGGTCGATCGGAGTGGAGGGAG -> GAGGTCGATCGGAGTGGAGGGAGT
GGAGGTGCAAAGCTTATTTCTTCT -> GAGGTGCAAAGCTTATTTCTTCTG
GGAGTAGGGTCTGCGGTGGATAGG -> GAGTAGGGTCTGCGGTGGATAGGA
GGAGTCCACACGTGCCCAATCGCG -> GAGTCCACACGTGCCCAATCGCGA
GGAGTGGAGGGAGTAGGGTCTGCG -> GAGTGGAGGGAGTAGGGTCTGCGG
GGAGTTGCTCGCTTATTCGAATAA -> GAGTTGCTCGCTTATTCGAATAAG
GGATACCCATGATACACGCGTCTT -> GATACCCATGATACACGCGTCTTA
GGATAGCTTATGTTGCCGCGAAGT -> GATAGCTTATGTTGCCGCGAAGTG
GGATAGGATATCATTGGGACTCAC -> GATAGGATATCATTGGGACTCACT
GGATATCATTGGGACTCACTAGTC -> GATATCATTGGGACTCACTAGTCG
GGATCACTTGTGTAAGTCTACTAT -> GATCACTTGTGTAAGTCTACTATC
GGATCATTGATGCCAGGCAAACTG -> GATCATTGATGCCAGGCAAACTGG
GGATCCATGCTGTCTTGATTATCC -> GATCCATGCTGTCTTGATTATCCA
GGATCGATTAATCTCGCGGAGGTG -> GATCGATTAATCTCGCGGAGGTGC
GGATGAGTGCGACCCTTGTAATAT -> GATGAGTGCGACCCTTGT

Euler's theorem: cycle<br>
Input: The adjacency list of an Eulerian directed graph.<br>
Output: An Eulerian cycle in this graph (6->8->7->9->6->5->4->2->1->0->3->2->6).

In [107]:
import re
bruijn = {}
with open('p146_data.txt') as f:
    for line in f:
        numbers = re.findall('([0-9]+)',line)
        key = int(numbers[0])
        values = numbers[1:]
        for i in range(len(values)):
            v = int(values[i])
            values[i] = v
        bruijn[key] = values
f.close()
bruijn_leftover = bruijn
cycle = []
start = list(bruijn.keys())[0]
cycle.append(start)
cycle_next_step = bruijn[start][0]
current_position = start

#find the first cycle
while cycle_next_step != start:
    cycle.append(cycle_next_step)
    current_position = cycle_next_step
    cycle_next_step = bruijn[current_position][0]
    bruijn_leftover[cycle[-2]].remove(current_position)
bruijn_leftover[cycle[-1]].remove(bruijn_leftover[cycle[-1]][0])
unexplored_edges = []
for key in bruijn_leftover:
    unexplored_edges.extend(bruijn_leftover[key])

#find the other cycles
while len(unexplored_edges) > 1:
    #find an unexplored edge in cycle
    for edge in cycle:
        if len(bruijn_leftover[edge]) > 0:
            new_start = edge
    #re-order cycle to start at new_start
    cycle = cycle[cycle.index(new_start):] + cycle[:cycle.index(new_start)]
    #unexplored_edges.remove(new_start)
    current_position = new_start
    unexplored_edges.remove(new_start)
    cycle.append(current_position)
    cycle_next_step = bruijn_leftover[current_position][0]
    while cycle_next_step != new_start:
        cycle.append(cycle_next_step)
        unexplored_edges.remove(cycle_next_step)
        current_position = cycle_next_step
        cycle_next_step = bruijn_leftover[current_position][0]
        bruijn_leftover[cycle[-2]].remove(current_position)
    bruijn_leftover[cycle[-1]].remove(bruijn_leftover[cycle[-1]][0])
cycle.append(cycle_next_step)
#print(bruijn_leftover)
output = []
for j in range(len(cycle)):
    output.append(str(cycle[j]))
print('->'.join(output))

1002->1000->1604->1603->1605->2179->2181->2180->1605->1000->1001->12->11->14->1657->1658->1659->14->13->751->753->752->1444->1445->1446->752->13->102->339->338->337->102->527->722->723->776->1009->1044->1042->1043->1009->1011->1010->2325->2323->2324->1010->776->775->777->723->1300->1301->1302->723->721->527->526->2080->2082->2081->526->528->102->101->204->2107->2414->2413->2415->2107->2109->2108->204->202->2011->2013->2012->202->203->101->100->2966->2965->2967->100->463->464->465->902->901->2528->2527->2529->901->903->2759->2758->2760->903->465->100->676->2404->2405->2406->676->678->677->1513->1514->2260->2261->2262->1514->1515->677->100->13->15->54->223->1737->1736->1735->223->224->225->734->2369->2370->2368->2885->2886->2884->2368->734->733->2798->2797->2799->733->735->2863->2864->2865->735->225->54->52->374->2508->2506->2507->374->2957->2956->2958->374->375->373->52->53->1007->1008->1006->53->15->11->233->720->719->718->1083->1081->1082->718->233->1780->1781->1782->233->234->2450->2

Euler's theorem: path<br>
Input: The adjacency list of a directed graph that has an Eulerian path.<br>
Output: An Eulerian path in this graph  6->7->8->9->6->3->0->2->1->3->4.<br>

In [10]:
import re
bruijn = {}
with open('p147_data.txt') as f:
    for line in f:
        numbers = re.findall('([0-9A-Z]+)',line)
        key = numbers[0]
        values = numbers[1:]
        for i in range(len(values)):
            v = values[i]
            values[i] = v
        bruijn[key] = values
f.close()
cycle = []

#find the in- and outnodes; they are unbalanced, i.e., the number of indegrees != number of outdegrees
balance = {}
for key in bruijn:
    if key in balance:
        balance[key] = balance[key] + len(bruijn[key])
    else:
        balance[key] = len(bruijn[key])
    for j in range(len(bruijn[key])):
        if bruijn[key][j] in balance:
            balance[bruijn[key][j]] = balance[bruijn[key][j]] - 1
        else:
            balance[bruijn[key][j]] = -1
for key in balance:
    if balance[key] == 1: beginning = key
    if balance[key] == -1: end = key

#connect the unbalanced nodes
if end in bruijn:
    bruijn[end].append(beginning)
else:
    bruijn[end] = [beginning]
    
#find the first cycle
bruijn_leftover = bruijn
cycle = []
start = list(bruijn.keys())[0]
cycle.append(start)
cycle_next_step = bruijn[start][0]
current_position = start
while cycle_next_step != start:
    cycle.append(cycle_next_step)
    current_position = cycle_next_step
    cycle_next_step = bruijn[current_position][0]
    bruijn_leftover[cycle[-2]].remove(current_position)
bruijn_leftover[cycle[-1]].remove(bruijn_leftover[cycle[-1]][0])
unexplored_edges = []
for key in bruijn_leftover:
    unexplored_edges.extend(bruijn_leftover[key])
    
#find the other cycles
while len(unexplored_edges) > 1:
    #find an unexplored edge in cycle
    for edge in cycle:
        if len(bruijn_leftover[edge]) > 0:
            new_start = edge
    #re-order cycle to start at new_start
    cycle = cycle[cycle.index(new_start):] + cycle[:cycle.index(new_start)]
    #unexplored_edges.remove(new_start)
    current_position = new_start
    unexplored_edges.remove(new_start)
    cycle.append(current_position)
    cycle_next_step = bruijn_leftover[current_position][0]
    while cycle_next_step != new_start:
        cycle.append(cycle_next_step)
        unexplored_edges.remove(cycle_next_step)
        current_position = cycle_next_step
        cycle_next_step = bruijn_leftover[current_position][0]
        bruijn_leftover[cycle[-2]].remove(current_position)
    bruijn_leftover[cycle[-1]].remove(bruijn_leftover[cycle[-1]][0])
cycle.append(cycle_next_step)

#re-order cycle by cutting after end
if cycle[-1] != end:
    for c in range(len(cycle)-1):
        if cycle[c] == end and cycle[c+1] == beginning: cutting_point = c+1
    cycle = cycle[cutting_point:-1] + cycle[:cutting_point]

#print(bruijn_leftover)
output = []
for j in range(len(cycle)):
    output.append(str(cycle[j]))
print('\n'.join(output))

ATCACCTACATTTATCCCAGTCCG
TCACCTACATTTATCCCAGTCCGA
CACCTACATTTATCCCAGTCCGAT
ACCTACATTTATCCCAGTCCGATC
CCTACATTTATCCCAGTCCGATCC
CTACATTTATCCCAGTCCGATCCG
TACATTTATCCCAGTCCGATCCGA
ACATTTATCCCAGTCCGATCCGAA
CATTTATCCCAGTCCGATCCGAAT
ATTTATCCCAGTCCGATCCGAATC
TTTATCCCAGTCCGATCCGAATCT
TTATCCCAGTCCGATCCGAATCTA
TATCCCAGTCCGATCCGAATCTAT
ATCCCAGTCCGATCCGAATCTATT
TCCCAGTCCGATCCGAATCTATTG
CCCAGTCCGATCCGAATCTATTGT
CCAGTCCGATCCGAATCTATTGTA
CAGTCCGATCCGAATCTATTGTAC
AGTCCGATCCGAATCTATTGTACC
GTCCGATCCGAATCTATTGTACCT
TCCGATCCGAATCTATTGTACCTT
CCGATCCGAATCTATTGTACCTTT
CGATCCGAATCTATTGTACCTTTT
GATCCGAATCTATTGTACCTTTTT
ATCCGAATCTATTGTACCTTTTTA
TCCGAATCTATTGTACCTTTTTAC
CCGAATCTATTGTACCTTTTTACA
CGAATCTATTGTACCTTTTTACAC
GAATCTATTGTACCTTTTTACACG
AATCTATTGTACCTTTTTACACGA
ATCTATTGTACCTTTTTACACGAC
TCTATTGTACCTTTTTACACGACA
CTATTGTACCTTTTTACACGACAT
TATTGTACCTTTTTACACGACATG
ATTGTACCTTTTTACACGACATGT
TTGTACCTTTTTACACGACATGTC
TGTACCTTTTTACACGACATGTCC
GTACCTTTTTACACGACATGTCCG
TACCTTTTTACACGACATGTCCGA
ACCTTTTTACACGACATGTCCGAG
