- is there a 'clock gene'?
- transcription begins when an RNA polymerase binds to a promoter sequence on the DNA molecule, which is often located just upstream from the starting point for transcription
- initiation of transcription is a convenient control point for the cell to regulate the gene expression

- plants: cells keep track of day and night independently of other cells, LHY, CCA1, TOC1 --> regulatory proteins

- negative feedback loop
- transcription factors
- regulatory motif
- transcription factor binding site
- upstream region

# 2.2  
what is the expected number of occurrences of a 9-mer in 500 random DNA strings, each of length 1000? Assume that the sequences are formed by selecting each nucleotide with the same probability.

- evening element
- immunity genes
- NF-kB

In [1]:
# Helper functions
symbols = ['A', 'T', 'C', 'G']

NumberToSymbol = {0: 'A', 1: 'C', 2:'G', 3:'T'}

def NumberToPattern(index, k):
    if k == 1:
        return NumberToSymbol[index]
    prefixIndex = index//4
    r = index - 4*prefixIndex
    symbol = NumberToSymbol[r]
    PrefixPattern = NumberToPattern(prefixIndex, k-1)
    return PrefixPattern + symbol

def HammingDistance(dna1, dna2):
    if len(dna1) != len(dna2):
        print('String are not of equal length')
    return sum(n1 != n2 for n1, n2 in zip(dna1, dna2))

def Neighbors(Pattern, d):
    if d == 0:
        return Pattern
    if len(Pattern) == 1:
        return symbols
    neighborhood = []
    suffixNeighbors = Neighbors(Pattern[1:len(Pattern)], d)
    #print('s', suffixNeighbors)
    for text in suffixNeighbors:
        #print('t', text)
        if HammingDistance(Pattern[1:], text) < d:
            for symbol in symbols:
                neighborhood.append(symbol + text)
        else:
            neighborhood.append(Pattern[0] + text)
    return neighborhood
    

In [2]:
# 2a Motif Enumeration
def MotifEnumeration(k, d, strings):
    patterns = []
    for kmer in strings:
        for i in range(len(kmer) - k + 1):
            # for each kmer Pattern in Dna
            pattern = kmer[i:i+k]
            kmer_rest_list = list(filter(lambda x: kmer not in x, strings))
            # for each k-mer pattern' differing from pattern by <= d mismatches
            # kmer_d = pattern' differing from pattern by <= d mismatches
            for pattern_d in Neighbors(pattern, d):
                count = 0
                for kmer_rest in kmer_rest_list:
                    pattern_prime_all = [kmer_rest[j:j+k] for j in range(len(kmer_rest) - k + 1)]
                    # if pattern' appears in each string from Dna with <= d mismatches
                    if any(HammingDistance(pattern_d, pattern_prime) <= d for pattern_prime in pattern_prime_all):
                        count += 1
                if count == len(kmer_rest_list):
                    patterns.append(pattern_d)
    patterns = list(set(patterns))
    return patterns

In [4]:
' '.join(MotifEnumeration(3, 1, ['ATTTGGC', 'TGCCTTA', 'CGGTATC', 'GAAAATT']))

'TTT ATA GTT ATT'

In [5]:
test_file = 'rosalind_ba2a.txt'
k_d_strings = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_d_strings.append(line.strip('\n'))
k = int(k_d_strings[0].split(' ')[0])
d = int(k_d_strings[0].split(' ')[1])
strings = k_d_strings[2:]
' '. join(MotifEnumeration(k, d, strings))

'GTTGT GTTTT GTTCT GTTAT'

A more appropriate problem formulation would score individual instances of motifs depending on how similar they are to an “ideal” motif (i.e., a transcription factor binding site that binds the best to the transcription factor). However, since the ideal motif is unknown, we attempt to select a k-mer from each string and score these k-mers depending on how similar they are to each other.

- count matrix
- profile matrix
- consensus string
- probability distribution
- entropy: a measure of the uncertainty of a probability distribution (p1, ..., pN)

max entropy = -4.1/4.log2 (1/4) = 2, min entropy = 0

- motif logo: a diagram for visualizing motif conservation taht consists of a stack of letters at each position
- information content: total height in a column = 2-H(p1, ..., pN). lower entropy -> higher information content --> all columns in the motif logo are highly conserved

__Motif Finding Problem__: Given a collection of strings, find a set of k-mers, one from each string, that minimizes the score of the resulting motif.

__Input__: A collection of strings Dna and an integer k.

__Output__: A collection Motifs of k-mers, one from each string in Dna, minimizing Score(Motifs) among all possible choices of k-mers.

min score(motifs) = min(hamming distance for each row) ie d(pattern, motifs)
--> instead of searching for a collection of k-mers motifs minimizing score(motifs), search for a potential consensus string pattern minimizing d(pattern, motifs)

__Equivalent Motif Finding Problem__: Given a collection of strings, find a collection of k-mers (one from each string) that minimizes the distance between all possible patterns and all possible collections of k-mers.

__Input__: A collection of strings Dna and an integer k.

__Output__: A k-mer Pattern and a collection of k-mers, one from each string in Dna, minimizing d(Pattern, Motifs) among all possible choices of Pattern and Motifs.

__Median String Problem__: Find a median string.

__Input__: A collection of strings Dna and an integer k.

__Output__: A k-mer Pattern that minimizes d(Pattern, Dna) among all possible choices of k-mers.
Notice that finding a median string requires solving a double minimization problem. We must find a k-mer Pattern that minimizes d(Pattern, Dna), where this function is itself computed by taking a minimum over all choices of k-mers from each string in Dna.


In [6]:
# 2b Find a Median String

def DistancePatternAndString(Pattern, Text):
    k = len(Pattern)
    pattern_prime = [Text[i:i+k] for i in range(len(Text) - k + 1)]
    d_pattern_text = min(list(map(lambda x: HammingDistance(Pattern, x), pattern_prime)))
    
    return d_pattern_text

def MedianString(k, strings):
    distance = 1e9
    for i in range(4**k):
        pattern = NumberToPattern(i, k)
        d_pattern_strings = sum(list(map(lambda x: DistancePatternAndString(pattern, x), strings)))
        if distance > d_pattern_strings:
            distance = d_pattern_strings
            median = pattern
    return median

In [100]:
distance = 0
for text in ['TTACCTTAAC', 'GATATCTGTC', 'ACGGCGTTCG', 'CCCTAAAGAG', 'CGTCAGAGGT']:
    distance += DistancePatternAndString('AAA', text)
distance

5

In [108]:
test_file = 'rosalind_ba2h.txt'
pattern_texts = []
with open(test_file, 'r') as reader:
    for line in reader:
        pattern_texts.append(line.strip('\n'))
pattern = pattern_texts[0]
texts = pattern_texts[1].split(' ')
distance = 0
for text in texts:
    distance += DistancePatternAndString(pattern, text)
distance

40

In [7]:
MedianString(3, ['AAATTGACGCAT', 'GACGACCACGTT', 'CGTCAGCGCCTG', 'GCTGAGCACCGG', 'AGTACGGGACAG'])

'ACG'

In [8]:
test_data = 'rosalind_ba2b.txt'
k_strings = []
with open(test_data, 'r') as reader:
    for line in reader:
        k_strings.append(line.strip('\n'))
k = int(k_strings[0])
strings = k_strings[1:]
MedianString(k, strings)

'TTTTAA'

- greedy algorithm
- profile-most probable

__Profile-most Probable k-mer Problem__: Find a Profile-most probable k-mer in a string.

__Input__: A string Text, an integer k, and a 4 × k matrix Profile.

__Output__: A Profile-most probable k-mer in Text.

In [9]:
A = '0.2 0.2 0.3 0.2 0.3'
C = '0.4 0.3 0.1 0.5 0.1'
G = '0.3 0.3 0.5 0.2 0.4'
T = '0.1 0.2 0.1 0.1 0.2'

Profile = {'A': list(map(lambda x: float(x), A.split(' '))),
           'C': list(map(lambda x: float(x), C.split(' '))),
           'G': list(map(lambda x: float(x), G.split(' '))),
           'T': list(map(lambda x: float(x), T.split(' ')))}

Profile

{'A': [0.2, 0.2, 0.3, 0.2, 0.3],
 'C': [0.4, 0.3, 0.1, 0.5, 0.1],
 'G': [0.3, 0.3, 0.5, 0.2, 0.4],
 'T': [0.1, 0.2, 0.1, 0.1, 0.2]}

In [10]:
# 2c Find Profile-most probable kmer
def ProfileMostProbable(Text, k, Profile):
    probability = 0
    kmers = [Text[i:i + k] for i in range(len(Text) - k + 1)]
    most_probable_kmer = kmers[0]
    for kmer in kmers:
        prob_kmer = 1
        for index in range(len(kmer)):
            prob_kmer *= Profile[kmer[index]][index]
        if probability < prob_kmer:
            probability = prob_kmer
            most_probable_kmer = kmer
    return most_probable_kmer

In [11]:
ProfileMostProbable('ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT', 5, Profile)

'CCGAG'

In [12]:
test_data = 'rosalind_ba2c.txt'
text_k_profile = []
with open(test_data, 'r') as reader:
    for line in reader:
        text_k_profile.append(line.strip('\n'))
text = text_k_profile[0]
k = int(text_k_profile[1])
A = text_k_profile[2]
C = text_k_profile[3]
G = text_k_profile[4]
T = text_k_profile[5]

Profile = {'A': list(map(lambda x: float(x), A.split(' '))),
           'C': list(map(lambda x: float(x), C.split(' '))),
           'G': list(map(lambda x: float(x), G.split(' '))),
           'T': list(map(lambda x: float(x), T.split(' ')))}

ProfileMostProbable(text, k, Profile)

'GGGGTCA'

Our proposed greedy motif search algorithm, GreedyMotifSearch, starts by forming a motif matrix from arbitrarily selected k-mers in each string from Dna (which in our specific implementation is the first k-mer in each string). It then attempts to improve this initial motif matrix by trying each of the k-mers in Dna1 as the first motif. For a given choice of k-mer Motif1 in Dna1, it builds a profile matrix Profile for this lone k-mer, and sets Motif2 equal to the Profile-most probable k-mer in Dna2. It then iterates by updating Profile as the profile matrix formed from Motif1 and Motif2, and sets Motif3 equal to the Profile-most probable k-mer in Dna3. In general, after finding i − 1 k-mers Motifs in the first i − 1 strings of Dna, GreedyMotifSearch constructs Profile(Motifs) and selects the Profile-most probable k-mer from Dnai based on this profile matrix. After obtaining a k-mer from each string to obtain a collection Motifs, GreedyMotifSearch tests to see whether Motifs outscores the current best scoring collection of motifs and then moves Motif1 one symbol over in Dna1, beginning the entire process of generating Motifs again.

In [13]:
# 2d GreedyMotifSearch
k = 3
t = 5
strings = ['GGCGTTCAGGCA', 'AAGAATCAGTCA', 'CAAGGAGTTCGC','CACGTCAATCAC','CAATAATATTCG']
best_motifs = [string[:k] for string in strings]

def FindConsensus(motifs):
    consensus = ""
    for i in range(len(motifs[0])):
        countA, countC, countG, countT = 0, 0, 0, 0
        for motif in motifs:
            if motif[i] == "A":
                countA += 1
            elif motif[i] == "C":
                countC += 1
            elif motif[i] == "G":
                countG += 1
            elif motif[i] == "T":
                countT += 1
        if countA >= max(countC, countG, countT):
            consensus += "A"
        elif countC >= max(countA, countG, countT):
            consensus += "C"
        elif countG >= max(countC, countA, countT):
            consensus += "G"
        elif countT >= max(countC, countG, countA):
            consensus += "T"
    return consensus

def ScoreMotifs(motifs):
    consensus = FindConsensus(motifs)
    score = 0
    for motif in motifs:
        score += HammingDistance(consensus, motif)
    return score


def CreateProfile(motifs, k):
    count = {'A':[0]*k, 'C': [0]*k, 'G': [0]*k, 'T': [0]*k}
    for kmer in motifs:
        for i in range(len(kmer)):
            count[kmer[i]][i] += 1
    profile = {base: [j/len(motifs) for j in count[base]] for base in count}
    return count, profile

def GreedyMotifSearch(k, t, strings):
    best_motifs = [string[:k] for string in strings]
    
    for i in range(len(strings[0]) - k + 1):
        motifs = []
        kmer_0 = strings[0][i:i+k]
        motif_0 = kmer_0
        motifs.append(motif_0)
        for j in range(1, t):
            count_j, profile_j = CreateProfile(motifs, k)
            motif_j = ProfileMostProbable(strings[j], k, profile_j)
            motifs.append(motif_j)
        if ScoreMotifs(motifs) < ScoreMotifs(best_motifs):
            best_motifs = motifs
    return best_motifs

In [14]:
GreedyMotifSearch(k, t, strings)

['CAG', 'CAG', 'CAA', 'CAA', 'CAA']

In [None]:
test_file = 'rosalind_ba2d.txt'
k_t_strings = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_t_strings.append(line.strip('\n'))
k = int(k_t_strings[0].split(' ')[0])
t = int(k_t_strings[0].split(' ')[1])
strings = k_t_strings[1:]
for motif in GreedyMotifSearch(k, t, strings):
    print(motif)

In [None]:
# 2e Greedy Motif Search with Pseudo Count
def GreedyMotifSearchPseudocount(k, t, strings):
    best_motifs = [string[:k] for string in strings]
    
    for i in range(len(strings[0]) - k + 1):
        motifs = []
        kmer_0 = strings[0][i:i+k]
        motif_0 = kmer_0
        motifs.append(motif_0)
        for j in range(1, t):
            count_j, profile_j = CreateProfile(motifs, k)
            # add Laplace succession's rule
            count_j = {base: [k+1 for k in count_j[base]] for base in count_j}
            profile_j = {base: [k/(len(motifs)+4) for k in count_j[base]] for base in count_j}
            
            motif_j = ProfileMostProbable(strings[j], k, profile_j)
            motifs.append(motif_j)
        if ScoreMotifs(motifs) < ScoreMotifs(best_motifs):
            best_motifs = motifs
    return best_motifs

In [None]:
GreedyMotifSearchPseudocount(k, t, strings)

In [None]:
test_file = 'rosalind_ba2e.txt'
k_t_strings = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_t_strings.append(line.strip('\n'))
k = int(k_t_strings[0].split(' ')[0])
t = int(k_t_strings[0].split(' ')[1])
strings = k_t_strings[1:]
for motif in GreedyMotifSearchPseudocount(k, t, strings):
    print(motif)

In [92]:
# 2f Greedy Motif Search with Randomized algorithm

import random
def GreedyMotifSearchRandomized(k, t, strings):
    # randomly generate k-mers from each sequence in the DNAs
    rand_ints = [random.randint(0, len(strings[0]) - k) for a in range(t)]
    best_motifs = [strings[i][r:r+k] for i, r in enumerate(rand_ints)]
    
    while True:
        count_j, profile_j = CreateProfile(best_motifs, k)
        
        count_j = {base: [k+1 for k in count_j[base]] for base in count_j}
        profile_j = {base: [k/(len(best_motifs)+4) for k in count_j[base]] for base in count_j}
        
        motifs = [ProfileMostProbable(string, k, profile_j) for string in strings]
        if ScoreMotifs(motifs) < ScoreMotifs(best_motifs):
            best_motifs = motifs
        else:
            return best_motifs

In [94]:
best_scores = 8 * 5
best_motifs = []
for repeat in range(1000):
    current_motifs = GreedyMotifSearchRandomized(8, 5, ['CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA',
                            'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
                            'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
                            'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
                            'AATCCACCAGCTCCACGTGCAATGTTGGCCTA'])
    if ScoreMotifs(current_motifs) < best_scores:
        best_motifs = current_motifs
best_motifs

['GGGTGTTC', 'AGGTGCCA', 'AAGTATAC', 'AGGTGCAC', 'AGCTCCAC']

In [95]:
test_file = 'rosalind_ba2f.txt'
k_t_strings = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_t_strings.append(line.strip('\n'))
k = int(k_t_strings[0].split(' ')[0])
t = int(k_t_strings[0].split(' ')[1])
strings = k_t_strings[1:]

best_scores = k * t
best_motifs = []
for repeat in range(1000):
    current_motifs = GreedyMotifSearchRandomized(k, t, strings)
    if ScoreMotifs(current_motifs) < best_scores:
        best_motifs = current_motifs
best_motifs

['TGGGAGTCAAGTGTG',
 'TGGGAGTCAAGTGTG',
 'CGCGAAAACTGGGTA',
 'TGCGCGGCGTGTTTG',
 'CGCGAACTCAGTTAC',
 'CGGAGGGAGTGTGCA',
 'TGGGGGTTTTAGGTC',
 'AGCAAAACCACCGTC',
 'TGGAAGAGGTGCGTC',
 'TGCTAGCTGTTCCTC',
 'AGTCAAGTCAGCCTT',
 'TGGCTGGTAAGTATT',
 'ACCCAGGTGACTGTT',
 'CGGGAGATTCCGGTT',
 'AGGATAGTGCGACTC',
 'ACGCAGTGCACGGTT',
 'TGGCAGCCATCTTTC',
 'TGTGAGAGGATAGTT',
 'TGGAAAATATGCTTT',
 'ACCGAAGGGTCTCTC']

Note that RandomizedMotifSearch may change all t strings Motifs in a single iteration. This strategy may prove reckless, since some correct motifs (captured in Motifs) may potentially be discarded at the next iteration. GibbsSampler is a more cautious iterative algorithm that discards a single k-mer from the current set of motifs at each iteration and decides to either keep it or replace it with a new one. 

We now define a Profile-randomly generated k-mer in a string Text. For each k-mer Pattern in Text, compute the probability Pr(Pattern | Profile), resulting in n = |Text| - k + 1 probabilities (p1, …, pn). These probabilities do not necessarily sum to 1, but we can still form the random number generator Random(p1, …, pn) based on them. GibbsSampler uses this random number generator to select a Profile-randomly generated k-mer at each step: if the die rolls the number i, then we define the Profile-randomly generated k-mer as the i-th k-mer in Text.

In [90]:
def GibbsSampler(k, t, N, strings):
    # randomly generate k-mers from each sequence in the DNAs
    rand_ints = [random.randint(0, len(strings[0]) - k) for a in range(t)]
    best_motifs = [strings[i][r:r+k] for i, r in enumerate(rand_ints)]
    
    for j in range(N):
        i = random.randint(0, t-1)
        count_j, profile_j = CreateProfile([motif for index, motif in enumerate(best_motifs) if index != r], k)
        
        count_j = {base: [k+1 for k in count_j[base]] for base in count_j}
        profile_j = {base: [k/(len(best_motifs)+4) for k in count_j[base]] for base in count_j}
        
        motifs = [ProfileMostProbable(strings[index], k, profile_j) if index == r else motif 
                  for index, motif in enumerate(best_motifs)]
        
        if ScoreMotifs(motifs) < ScoreMotifs(best_motifs):
            best_motifs = motifs
    return best_motifs
    

In [91]:
GibbsSampler(8, 5, 100, ['CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA',
                            'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
                            'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
                            'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
                            'AATCCACCAGCTCCACGTGCAATGTTGGCCTA'])

['TAAACGGC', 'TGTAAGTG', 'AAGTATAC', 'TGCACGTC', 'CACCAGCT']

In [98]:
test_file = 'rosalind_ba2g.txt'
k_t_N_strings = []
with open(test_file, 'r') as reader:
    for line in reader:
        k_t_N_strings.append(line.strip('\n'))
k = int(k_t_N_strings[0].split(' ')[0])
t = int(k_t_N_strings[0].split(' ')[1])
N = int(k_t_N_strings[0].split(' ')[2])
strings = k_t_N_strings[1:]

best_scores = k * t
best_motifs = []
for repeat in range(20):
    current_motifs = GibbsSampler(k, t, N, strings)
    if ScoreMotifs(current_motifs) < best_scores:
        best_motifs = current_motifs
best_motifs

['GCTGGCTACTTTTGG',
 'CTTAGTCATATGGTT',
 'CAATACTCGTACTGT',
 'GCTGGAACTTCATGG',
 'TGTGGGTTTTAATAG',
 'TTCGGTGTATGCTGT',
 'ATTTCATCCGACAGC',
 'GGATCATTCACCTAC',
 'ATGTGGATTCAAGCT',
 'GAGAGTTTTTCAGTG',
 'GAAGCTTTGCATAGG',
 'AAGCCAAGGCAAGAG',
 'GCCGGGAACTGAATT',
 'CGGTTCTTCACTAGA',
 'ACGTTAACCGCAGCA',
 'CCCTCGCGAGCGTAA',
 'GCTTCTACTTCAGGG',
 'AAACATGTCAGTCCT',
 'ATTGTCGTCATTTGA',
 'GTTGGACCCTAGCGA']

In [99]:
test = ['GCTGGCTACTTTTGG',
 'CTTAGTCATATGGTT',
 'CAATACTCGTACTGT',
 'GCTGGAACTTCATGG',
 'TGTGGGTTTTAATAG',
 'TTCGGTGTATGCTGT',
 'ATTTCATCCGACAGC',
 'GGATCATTCACCTAC',
 'ATGTGGATTCAAGCT',
 'GAGAGTTTTTCAGTG',
 'GAAGCTTTGCATAGG',
 'AAGCCAAGGCAAGAG',
 'GCCGGGAACTGAATT',
 'CGGTTCTTCACTAGA',
 'ACGTTAACCGCAGCA',
 'CCCTCGCGAGCGTAA',
 'GCTTCTACTTCAGGG',
 'AAACATGTCAGTCCT',
 'ATTGTCGTCATTTGA',
 'GTTGGACCCTAGCGA']
for i in test:
    print(i)

GCTGGCTACTTTTGG
CTTAGTCATATGGTT
CAATACTCGTACTGT
GCTGGAACTTCATGG
TGTGGGTTTTAATAG
TTCGGTGTATGCTGT
ATTTCATCCGACAGC
GGATCATTCACCTAC
ATGTGGATTCAAGCT
GAGAGTTTTTCAGTG
GAAGCTTTGCATAGG
AAGCCAAGGCAAGAG
GCCGGGAACTGAATT
CGGTTCTTCACTAGA
ACGTTAACCGCAGCA
CCCTCGCGAGCGTAA
GCTTCTACTTCAGGG
AAACATGTCAGTCCT
ATTGTCGTCATTTGA
GTTGGACCCTAGCGA
