In [11]:
#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## DnaA boxes

DnaA protein are binded to regions called dnaA boxes in the region knowed as ori, where the replication processes begin. Finding ori is one of the key tasks to understand how cells replicate. First lets search for frequent characters in ori, because some nucleotide string appear surprisingly often in small regions of genome. 
We will start with a bacterium called Vibrio cholerae, and then design a computational approach for finding ori in other bacteria genomes.

Here is the nucleotide sequence appearing in the ori of Vibrio cholerae:

![ORI vibrio](data/Screenshot_20200712_003201.png)

## K-mer

One possible approach is design a "sliding window" that will go through the text checking where each substring of the input text matches with the pattern that we are looking for

K-mer pseudocode:

![Pseudocode k-mer](data/k-mer.png)

Using the python language, let's implement the pseudocode.

ps: Obviously, in python language, there is a lot of ways of doing so, many of them much more efficient and simple, but lets keep the code more like the pseudocode

In [3]:
def PatternCount(text, pattern):
	count = 0

	for i in range(len(text) - len(pattern)):
		if text[i: i + len(pattern)] == pattern:
			count += 1
	return count

In [9]:
#Example
text = open("data/ori_vibrio.txt", "r")
texto = text.read()
count = PatternCount(texto, 'CCG')
print(count)

13125


## Frequent word
We say that a pattern is a most frequent k-mer in the input text if it maximizes Count(Text, Pattern) among all k-mers. For instance, ACTAT is a most frequent 5-mer of ACAACTATGCATACTATCGGGAACTATCCT.

One algorithm for finding the most frequent k-mers in a string checks all k-mers appearing in this input string, then computes how many times each k-mer appears in the string. To implement this FrequentWords algorithm, lets make an array Count, where Count(i) stores Count(Text, Pattern) for Pattern = Text(i, k).

Frequentwords pseudocode:
![Pseudocode frequentwords](data/frequentwords.png)

In [12]:
def FrequentWordsProblem(text, k):
	frequentPatterns = []
	count = np.zeros(shape=(len(text) - k + 1))

	for i in range(len(text) - k):
		pattern = text[i: i + k]
		count[i] = PatternCount(text, pattern)

	max_count_indicies = np.where(count == np.max(count))
	
	for i in max_count_indicies[0]:
		if text[i:i+k] not in frequentPatterns:
			frequentPatterns.append(text[i:i+k])
	return frequentPatterns

In [14]:
frequent = FrequentWordsProblem('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4)
print(frequent)

['GCAT', 'CATG']


FrequentWords finds most frequent k-mers, but is not very efficient. Each call to PatternCount function checks whether the k-mer pattern appears in all positions of the text. Since each k-mer requires |Text| − k + 1 such checks, each one requiring as many as k comparisons, the overall number of steps of PatternCount is (|Text| − k + 1) · k. Furthermore, FrequentWords must call PatternCount |Text| − k + 1 times (once for each k-mer of Text), so that its overall number of steps is (|Text| − k + 1) · (|Text| − k + 1) · k.

The Frequency Array are another approach for solve the Frequent Words Problem.