# Code Challenge: Implement DistanceBetweenPatternAndStrings

The first potential issue with implementing MedianString is writing a function to compute d(Pattern, Dna) = ∑ti=1 d(Pattern, Dnai), the sum of distances between Pattern and each string in Dna = {Dna1, ..., Dnat}. This task is achieved by the following pseudocode.

```
DistanceBetweenPatternAndStrings(Pattern, Dna)
    k ← |Pattern|
    distance ← 0
    for each string Text in Dna
        HammingDistance ← ∞
        for each k-mer Pattern’ in Text
            if HammingDistance > HammingDistance(Pattern, Pattern’)
                HammingDistance ← HammingDistance(Pattern, Pattern’)
        distance ← distance + HammingDistance
    return distance
```

Input: A string Pattern followed by a collection of space-separated strings Dna.
Output: d(Pattern, Dna).

Sample Input:
AAA
TTACCTTAAC GATATCTGTC ACGGCGTTCG CCCTAAAGAG CGTCAGAGGT
Sample Output:
5

In [30]:
def split_substrings(Dna, k):
    dna_array = Dna.split(" ")
    final = []
    for entry in dna_array:
        for i in range(len(entry)-k):
            final.append(entry[i:i+k])
    return final

def HammingDistance(string_1, string_2):
    # confirm same length
    try:
        assert len(string_1) == len(string_2)
    except AssertionError:
        print("Error: The strings must have the same length.")

    # find hamming distance by iterating over string
    hamming_distance = 0
    string_length = len(string_1)
    for i in range(string_length):
        if string_1[i] != string_2[i]:
            hamming_distance = hamming_distance + 1
    return hamming_distance

def DistanceBetweenPatternAndStrings(Pattern, Dna):
    k = len(Pattern)
    distance = 0
    for Text in Dna.split(" "):
        # arbitrarily large hamming distance
        ham_dist = 100000
        for kmer in split_substrings(Text, k):
            if ham_dist > HammingDistance(Pattern, kmer):
                ham_dist = HammingDistance(Pattern, kmer)
        distance = distance + ham_dist
    return distance

In [31]:
print(DistanceBetweenPatternAndStrings("AAA", "TTACCTTAAC GATATCTGTC ACGGCGTTCG CCCTAAAGAG CGTCAGAGGT"))

5


In [32]:
test_file="MedianString Test Files\dataset_30312_1 (1).txt"

with open(test_file, "r") as file:
    pattern = file.readline().strip()
    dna = file.readline().strip()
    print(DistanceBetweenPatternAndStrings(pattern, dna))

65


# Code Challenge: Implement MedianString.

Input: An integer k, followed by a space-separated collection of strings Dna.
Output: A k-mer Pattern that minimizes d(Pattern, Dna) among all possible choices of k-mers. (If there are multiple such strings Pattern, then you may return any one.)

## Sample Input:
3
AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTTCGGGACAG
## Sample Output:
GAC

```
MedianString(Dna, k)
    distance ← ∞
    for each k-mer Pattern from AA…AA to TT…TT
        if distance > d(Pattern, Dna)
             distance ← d(Pattern, Dna)
             Median ← Pattern
    return Median
```

```
DistanceBetweenPatternAndStrings(Pattern, Dna)
    k ← |Pattern|
    distance ← 0
    for each string Text in Dna
        HammingDistance ← ∞
        for each k-mer Pattern’ in Text
            if HammingDistance > HammingDistance(Pattern, Pattern’)
                HammingDistance ← HammingDistance(Pattern, Pattern’)
        distance ← distance + HammingDistance
    return distance
```

In [14]:
def MedianString(k, Dna):
    distance = 10000000
    median = ""
    for pattern in split_substrings(Dna, k):
        if distance > DistanceBetweenPatternAndStrings(pattern, Dna):
             distance = DistanceBetweenPatternAndStrings(pattern, Dna)
             median = pattern
    return median

In [33]:
print(MedianString(3, "AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTTCGGGACAG"))

GAC


In [34]:
test_file="MedianString Test Files\dataset_30304_9.txt"

with open(test_file, "r") as file:
    k = int(file.readline().strip())
    dna = file.readline().strip()
    print(MedianString(k, dna))

TTAGTT
