# Code Challenge: Solve the Clump Finding Problem (restated below). You will need to make sure that your algorithm is efficient enough to handle a large dataset.

Clump Finding Problem: Find patterns forming clumps in a string.

Input: A string Genome, and integers k, L, and t.
Output: All distinct k-mers forming (L, t)-clumps in Genome.

```
FindClumps(Text, k, L, t)
    Patterns ← an array of strings of length 0
    n ← |Text|
    for every integer i between 0 and n − L
        Window ← Text(i, L)
        freqMap ← FrequencyTable(Window, k)
        for every key s in freqMap
            if freqMap[s] ≥ t
                append s to Patterns
    remove duplicates from Patterns
    return Patterns
```

## Sample Input:
CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA
5 50 4
## Sample Output:
CGACA GAAGA

In [47]:
def FrequencyTable(text, k):
    # make a dictionary to store all k-mers
    kmers = {}
    # iterate through text
    for i in range(len(text)-(k-1)):
        slide = text[i:i+k]
        # for each k length slice, update the dictionary value
        if slide in kmers:
            kmers[slide] = kmers[slide] + 1
        else:
            kmers[slide] = 1
    return kmers


def FindClumps(k, L, t, genome):
    patterns = []
    n = len(genome)
    for i in range (n - L):
        window = genome[i:L+i]
        freqMap = FrequencyTable(window, k)
        for key in freqMap:
            if freqMap[key] >= t:
                patterns.append(key)
    # remove duplicates from patterns
    patterns = list(dict.fromkeys(patterns))
    # format as string for answer checking
    return " ".join(patterns)

In [48]:
# import the test data
test_data_file ="FindClumps Test Data/dataset_30274_5.txt"
with open(test_data_file, "r") as file:
    genome = file.readline().strip()
    k, L, t = file.readline().strip().split(" ")
    k = int(k)
    L = int(L)
    t = int(t)
    print(FindClumps(k, L, t, genome))

TGCCTTTTGT TCGCCGGGTT TTCGCCGGGT ATTCCCGGTG GACGTTAAAA ATAGTGTGGC TTCATAGAAT CTGTTTGAAG GCCTAACTAC


In [49]:
# use the function to find clumps in the e coli genome

# define k, L, t from text
k = 9
L = 500
t = 3

# import the E coli genome
test_data_file ="Genomes/E_coli.txt"
with open(test_data_file, "r") as file:
    genome = file.readline().strip()
    print(FindClumps(k, L, t, genome))