## BA2F

**Implement RandomizedMotifSearch**

Given: Positive integers k and t, followed by a collection of strings Dna.

Return: A collection BestMotifs resulting from running RandomizedMotifSearch(Dna, k, t) 1000 times. Remember to use pseudocounts!

Link: https://rosalind.info/problems/ba2f/

**Funkcije iz prethodnih zadataka koje se koriste:**

In [16]:
def profile(k_mers):
    
    prof= list(dict())

    for i in range(len(k_mers[0])):
        prof_col = {'A':0, 'C':0, 'G':0, 'T':0}
        for j in range(len(k_mers)):
            prof_col[k_mers[j][i]] += 1

        # da dobijemo vjerojatnost
        for key in prof_col.keys():
            prof_col[key] = prof_col[key] / len(k_mers)

        prof.append(prof_col)

    return prof

In [17]:
# malo drugačija funkcija od BA2C zbog drugacijeg formata inputa
def profile_most_probable_kmer(dna, k, profile):
    max_probability = 0
    # postavljamo na prvi
    most_probable = dna[0:k]
    for i in range(len(dna)-k+1):
        substr = dna[i:i+k]
        probability = 1
        for j in range(len(substr)):
          probability *= profile[j][substr[j]]
        if probability > max_probability:
            most_probable = substr
            max_probability = probability

    return most_probable

In [18]:
import collections
def score(motifs,k,t):

    columns = [''.join(sequence) for sequence in zip(*motifs)]
    maxCount = 0
    for column in columns:
      maxCount += collections.Counter(column).most_common(1)[0][1]

    return k*t - maxCount

In [19]:
def profilePseudocounts(motifs):
  prof = list(dict())
  for i in range(len(motifs[0])):
        prof_col = {'A':0, 'C':0, 'G':0, 'T':0}
        for j in range(len(motifs)):
            prof_col[motifs[j][i]] += 1

        # samo ovo je drugacije u odnosu na profile iz BA2D
        for key in prof_col.keys():
            prof_col[key] = (prof_col[key]+1)  / (len(motifs) +4)

        prof.append(prof_col)
  return prof

**Novo:**

In [20]:
from random import randint

In [21]:
def MotifsFromProfile(dna, k, profile):
  motifs_list = []
  for seq in dna:
    motifs_list.append(profile_most_probable_kmer(seq,k,profile))
  
  return motifs_list

In [22]:
def RandomizedMotifSearch(k,t,Dna):
  best_motifs = list()
  for i in Dna:
    random_int = randint(0, len(i)-k)
    best_motifs.append(i[random_int:random_int + k])
    
  while True:
    prof = profilePseudocounts(best_motifs)
    motifs = MotifsFromProfile(Dna,k,prof)
    if score(motifs,k,t) < score(best_motifs,k,t):
      best_motifs = motifs
    else:
      return best_motifs

In [23]:
def RandomizedMotifSearch_Iteration(k, t, Dna):

  best_motifs = list()
  for i in range(t):
    best_motifs.append(Dna[i][0:k])
  
  count = 0
  while count < 1000:
    motifs = RandomizedMotifSearch(k,t,Dna)
    if score(motifs,k,t) < score(best_motifs,k,t):
        best_motifs = motifs
    count += 1

  return best_motifs

In [24]:
# sample dataset
with open('/content/sample_data/input.txt') as input_data:
		k,t = map(int, input_data.readline().split())
		dna = [line.strip() for line in input_data.readlines()]

In [25]:
results = RandomizedMotifSearch_Iteration(k,t,dna)

In [26]:
for result in results:
  print(result)

TCTCGGGG
CCAAGGTG
TACAGGCG
TTCAGGTG
TCCACGTG


In [None]:
# dataset
with open('/content/rosalind_ba2f.txt') as input_data:
		k,t = map(int, input_data.readline().split())
		dna = [line.strip() for line in input_data.readlines()]

In [None]:
results = RandomizedMotifSearch_Iteration(k,t,dna)

In [None]:
for result in results:
  print(result)

TGCTGTCTCCACCCG
TGTACCCCCGAGCGT
TGTTGTCCCGACATT
TGTTCAACCGAGCGT
CTCTGTCCCGAGCGT
GATTGTCCCGAGCGG
TGTTGTCCGAGGCGT
TGAGCTCCCGAGCGT
TAACGTCCCGAGCGT
TGTTGTTGTGAGCGT
TGTTTGACCGAGCGT
TGTTGTCCCGCATGT
AGTTGTCCCGAGCTG
TGTTGTCCCGAGTAC
TGTGTACCCGAGCGT
TGTTGCATCGAGCGT
TGTTGTTAGGAGCGT
TGTTGTCCCTGACGT
TGTTGTCGTTAGCGT
TGTTGGAACGAGCGT
