__[Problem 1I](http://rosalind.info/problems/ba1i/)__

>  __Frequent Words with Mismatches Problem__ - `FreqWordsMismatches( text, k, d )`

*Objective:* Find the most frequent $k$-mers with mismatches in a string

*  Input: a DNA string (*Text*), and integers $k, d$
*  Output: all most frequent $k$-mers with up to $d$ mismatches in *Text*

Def:
*  given strings *Text*, *Pattern*, and integer $d$, we define $\textit{Count}_d(\textit{Text, Pattern})$ as the total number of occurences of *Pattern* in *Text* with a most $d$ mismatches,
*  a __most frequent $k$-mer with up to $d$ mismatches__ is *Text* is a string *Pattern* maximizing $\textit{Count}_d(\textit{Text, Pattern})$ among all $k$-mers.

In [1]:
def d_Nbhd(Pattern, d):
    
    # Trivial cases
    if d == 0:
        return Pattern
    if len(Pattern) == 1:
        return ['A', 'C', 'G', 'T']
    
    Nbhd = set()
    Suffix_Pattern = Pattern[1:]
    Suffix_Nbhd = d_Nbhd(Suffix_Pattern, d)
    
    for text in Suffix_Nbhd:
        if HammingDistance(Suffix_Pattern, text) < d:
            for x in ['A', 'C', 'G', 'T']:
                if (x + text) not in Nbhd:
                    Nbhd.add(x + text)
        else:
            if (Pattern[0] + text) not in Nbhd:
                Nbhd.add(Pattern[0] + text)
        
    return Nbhd

In [2]:
def HammingDistance(String1, String2):
    if len(String1) != len(String2):
        distance = 'Error: Lengths do not match.'
    else:
        distance = 0
        for i in range(0,len(String1)):
            if String1[i] != String2[i]:
                distance += 1
    return distance

In [3]:
def FreqWordsMismatches( Text, k, d ):
    
    # Initialize helper values, lists
    noChar = len(Text)
    pttn_Nbhd = set()
    
    # Loop 1: Get all possible k-mers with <= d mismatches from what observed
    for i in range(0, noChar-k+1):
        pattern = Text[i:i+k]
        # Small fix for if d=0, pattern is a set of length 1, need to convert to list
        if d == 0:
            pttn_Nbhd = pttn_Nbhd.union([d_Nbhd(pattern, d)])
        else:
            pttn_Nbhd = pttn_Nbhd.union(d_Nbhd(pattern, d))
    # Convert final d-nbhd set to a list
    pttn_Nbhd = list(pttn_Nbhd)
        
    # Loop 2: Get Frequency of each of those k-mers from from Loop 1
    count_arr = []
    for pttn in pttn_Nbhd:
        # Initialize count of current pattern
        count = 0
        # Loop 2.1: Consider each pattern string in Text
        for i in range(0, noChar-k+1):
            # Get current string
            stringText = Text[i:i+k]
            # Check for if Hamming distance <= d
            if HammingDistance(pttn, stringText) <= d:
                count += 1
        # Done counting, add to record
        count_arr.append(count)
    
    max_count = max(count_arr)
    indices = [i for i, x in enumerate(count_arr) if x == max_count]
    
    most_freq = []
    for i in range(0,len(indices)):
        most_freq.append(pttn_Nbhd[indices[i]])
    
    return most_freq

In [4]:
Text = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
print('Sample test: ' + str(FreqWordsMismatches(Text,4,1)))

Sample test: ['ATGC', 'ATGT', 'GATG']


In [5]:
Text = 'AAAAAAAAAA'
print('Test 1: ' + str(FreqWordsMismatches(Text,2,1)))

Test 1: ['AT', 'AA', 'GA', 'CA', 'TA', 'AG', 'AC']


In [6]:
Text = 'AGTCAGTC'
print('Test 2: ' + str(FreqWordsMismatches(Text,4,2)))

Test 2: ['GGCC', 'GTTC', 'TCTC', 'CGTT', 'ACAC', 'TGTG', 'AGAG', 'GGTA', 'ACTG', 'AGGT', 'AGCA', 'CATC', 'AATT', 'ATTA', 'CGGC', 'TGAC', 'AAGC', 'ATCC']


In [7]:
Text = 'AATTAATTGGTAGGTAGGTA'
print('Test 3: ' + str(FreqWordsMismatches(Text,4,0)))

Test 3: ['GGTA']


In [8]:
Text = 'ATA'
print('Test 4: ' + str(FreqWordsMismatches(Text,3,1)))

Test 4: ['AGA', 'AAA', 'CTA', 'TTA', 'ATA', 'ATC', 'ATT', 'ACA', 'GTA', 'ATG']


In [9]:
Text = 'AAT'
print('Test 5: ' + str(FreqWordsMismatches(Text,3,0)))

Test 5: ['AAT']


In [10]:
Text = 'TAGCG'
print('Test 6: ' + str(FreqWordsMismatches(Text,2,1)))

Test 6: ['GG', 'TG']


In [11]:
Text = 'CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC'
print('Extra Test: ' + str(FreqWordsMismatches(Text,10,2)))

Extra Test: ['GCACACAGAC', 'GCGCACACAC']


<br/>

[Rosalind final test:](http://rosalind.info/problems/ba1i/)

In [12]:
Text = 'GGGGGGATCTGGGGGGATCTGGGGGGATCTAGAAAACGGAGAAAACGGAGAAAACGGTATGTCGGTTATAACTCGGGGGGATCTGGGGGGATCTGGGGGGATCTTATAACTCTATGTCGGTTGCTTGAACAAGAAAACGGAGAAAACGGTGCTTGAACATATAACTCTATAACTCTATGTCGGTTGCTTGAACATGCTTGAACAGGGGGGATCTTATAACTCTGCTTGAACAAGAAAACGGTGCTTGAACAGGGGGGATCTTGCTTGAACATATAACTCAGAAAACGGTATGTCGGTTATGTCGGTTGCTTGAACAGGGGGGATCTTATAACTCTGCTTGAACATGCTTGAACATATGTCGGTGGGGGGATCTTATGTCGGTTATGTCGGTTATAACTCAGAAAACGGTGCTTGAACATATAACTCTATGTCGGTTGCTTGAACAAGAAAACGGAGAAAACGGTATAACTCAGAAAACGGTATAACTCTGCTTGAACATATGTCGGTAGAAAACGGTGCTTGAACATATAACTCTGCTTGAACATATGTCGGTAGAAAACGGAGAAAACGGTGCTTGAACATGCTTGAACATATAACTCAGAAAACGGGGGGGGATCTAGAAAACGGTATGTCGGTTGCTTGAACAAGAAAACGGTGCTTGAACAAGAAAACGGTATGTCGGTGGGGGGATCTTGCTTGAACATGCTTGAACAAGAAAACGGAGAAAACGGAGAAAACGGAGAAAACGGGGGGGGATCTTGCTTGAACATGCTTGAACAGGGGGGATCTTATGTCGGTTATAACTCGGGGGGATCTTGCTTGAACATATGTCGGTTGCTTGAACATGCTTGAACATGCTTGAACATGCTTGAACAAGAAAACGGAGAAAACGGTATAACTCTGCTTGAACA'
answer = FreqWordsMismatches(Text,5,2)
print('Final test: ' + str(answer))

Final test: ['TAAAA']
