# Chapter 1: Where in the genome does DNA replication begin?

## Counting Words

> We use the term **k-mer** to refer to a string of length *k* and define COUNT(Text, Pattern) as the number of times that a k-mer *Pattern* appears as a substring of *Text*

Below, implement PATTERNCOUNT(*Text*, *Pattern*) as a sliding window over *Text* checking whether *Pattern* appears as a substring. Return the number of times *Pattern* appears as a substring in *Text*

In [1]:
def pattern_count(text, pattern):
    count = 0
    k = len(pattern)
    for i in range(len(text) - k + 1):
        if text[i:i+k] == pattern:
            count += 1
            
    return count

# test data - should return 3 -------------------------------------------------
s = 'ACAACTATGCATACTATCGGGAACTATCCT'
p = 'ACTAT'

print(pattern_count(s, p))

3


## Frequent Words

> A straightforward algorithm for finding the most frequent k-mers in a string *Text* checks all k-mers appearing in this string then computes how many times each k-mer appears in *Text*

Below, implement the inefficient FREQUENTWORDS(*Text*, *k*) algorithm. Return the most frequent k-mers in text.

In [2]:
def frequent_words(text, k):
    frequent_patterns = set()
    count = []
    for i in range(len(text) - k + 1):
        pattern = text[i:i+k]
        count.append(pattern_count(text, pattern))
    
    max_count = max(count)
    for i in range(len(text) - k + 1):
        if count[i] == max_count:
            frequent_patterns.add(text[i:i+k])
            
    return frequent_patterns

# -----------------------------------------------------------------------------
Text = 'ACTGACTCCCACCCC'
K = 3

print(frequent_words(Text, K))

{'CCC'}


## Building a faster frequent words function

This implementation slides down *Text* only once, adding a count to the frequency array of a string *Text* where the ith element of the array holds the count of the number of times that ith k-mer appears in *Text*

In [23]:
def pattern_to_number(pattern):
    if not pattern:
        return 0
    
    symbols = {"A":0, "C":1, "G":2, "T":3}
    symbol = pattern[-1]
    prefix = pattern[:-1]
    
    return 4 * pattern_to_number(prefix) + symbols[symbol]


def number_to_pattern(index, k):
    numbers = "ACGT"
    
    if k == 1:
        return numbers[index]
    
    prefix_index = index // 4
    r = index % 4
    symbol = numbers[r]
    prefix_pattern = number_to_pattern(prefix_index, k-1)
    
    return prefix_pattern + symbol


def computing_frequencies(text, k):
    frequency_array = [0 for _ in range(4**k)]
    for i in range(len(text) - k + 1):
        pattern = text[i:i+k]
        j = pattern_to_number(pattern)
        frequency_array[j] += 1
    
    return frequency_array


def faster_frequent_words(text, k):
    frequent_patterns = set()
    frequency_array = computing_frequencies(text, k)
    max_count = max(frequency_array)
    for i in range(4**k-1):
        if frequency_array[i] == max_count:
            pattern = number_to_pattern(i, k)
            frequent_patterns.add(pattern)
    
    return frequent_patterns


faster_frequent_words(Text, K)

{'CCC'}

## Finding Frequent Words by Sorting

In [24]:
def frequent_words_by_sorting(text, k):
    frequent_patterns = set()
    index = [0 for _ in range(len(text) - k + 1)]
    count = [0 for i in index]
    for i in range(len(text) - k + 1):
        pattern = text[i:i+k]
        index[i] = pattern_to_number(pattern)
        count[i] += 1
    
    sorted_index = sorted(index)
    for i in range(len(text) - k + 1):
        if sorted_index[i] == sorted_index[i-1]:
            count[i] = count[i-1] + 1
    
    max_count = max(count)
    for i in range(len(text) - k + 1):
        if count[i] == max_count:
            pattern = number_to_pattern(sorted_index[i], k)
            frequent_patterns.add(pattern)
    
    return frequent_patterns        


frequent_words_by_sorting(Text, K)

{'CCC'}