### Module 4: Distance Measures

Consider the following three vectors u, v, w in a 6-dimensional space:

u = [1, 0.25, 0, 0, 0.5, 0]

v = [0.75, 0, 0, 0.2, 0.4, 0]

w = [0, 0.1, 0.75, 0, 0, 1]

Suppose cos(x,y) denotes the similarity of vectors x and y under the cosine similarity measure. Compute all three pairwise similarities among u,v, w.

In [4]:
import numpy as np

u = np.array([1, 0.25, 0, 0, 0.5, 0])
v = np.array([0.75, 0, 0, 0.2, 0.4, 0])
w = np.array([0, 0.1, 0.75, 0, 0, 1])

vectors = [u, v, w]
names = ['u', 'v', 'w']

for i in range(0, len(vectors)):
    for j in range(i+1, len(vectors)):
        v1 = vectors[i]
        v2 = vectors[j]
        cos_sim = v1 @ v2 / (np.linalg.norm(v1) * np.linalg.norm(v2))
        print('Vectors {} & {}: {}'.format(names[i], names[j], cos_sim))

Vectors u & v: 0.9496291235462798
Vectors u & w: 0.01740183416297251
Vectors v & w: 0.0


Here are five vectors in a 10-dimensional space:

1111000000  0100100101  0000011110  0111111111  1011111111

Compute the Jaccard distance (not Jaccard "measure") between each pair of the vectors.

In [26]:
vectors = ['1111000000', '0100100101', '0000011110', '0111111111', '1011111111']

for i in range(0, len(vectors)):
    for j in range(i+1, len(vectors)):
        v1 = vectors[i]
        v2 = vectors[j]
        both_1 = sum([
            1 for k in range(0, 10)
            if v1[k] == v2[k] == str(1)
        ])
        either_1 = sum([
            1 for k in range(0, 10)
            if v1[k] == str(1) or v2[k] == str(1)
        ])
        jaccard_sim = both_1 / either_1
        jaccard_dist = 1 - jaccard_sim
        print('Vectors {} & {}: {}'.format(v1, v2, jaccard_dist))

Vectors 1111000000 & 0100100101: 0.8571428571428572
Vectors 1111000000 & 0000011110: 1.0
Vectors 1111000000 & 0111111111: 0.7
Vectors 1111000000 & 1011111111: 0.7
Vectors 0100100101 & 0000011110: 0.8571428571428572
Vectors 0100100101 & 0111111111: 0.5555555555555556
Vectors 0100100101 & 1011111111: 0.7
Vectors 0000011110 & 0111111111: 0.5555555555555556
Vectors 0000011110 & 1011111111: 0.5555555555555556
Vectors 0111111111 & 1011111111: 0.19999999999999996


Here are five vectors in a 10-dimensional space:
    
1111000000  0100100101  0000011110  0111111111  1011111111

Compute the Manhattan distance (L1 norm) between each two of these vectors.

In [27]:
vectors = ['1111000000', '0100100101', '0000011110', '0111111111', '1011111111']

for i in range(0, len(vectors)):
    for j in range(i+1, len(vectors)):
        v1 = vectors[i]
        v2 = vectors[j]
        manhattan_dist = sum([
            1 for k in range(0, 10)
            if v1[k] != v2[k]
        ])
        print('Vectors {} & {}: {}'.format(v1, v2, manhattan_dist))

Vectors 1111000000 & 0100100101: 6
Vectors 1111000000 & 0000011110: 8
Vectors 1111000000 & 0111111111: 7
Vectors 1111000000 & 1011111111: 7
Vectors 0100100101 & 0000011110: 6
Vectors 0100100101 & 0111111111: 5
Vectors 0100100101 & 1011111111: 7
Vectors 0000011110 & 0111111111: 5
Vectors 0000011110 & 1011111111: 5
Vectors 0111111111 & 1011111111: 2


The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. 

In [42]:
strings = ['he', 'she', 'his', 'hers']

def lcs(s1, s2):
    if len(s1) == 0 or len(s2) == 0:
        return ''
    elif s1[-1] == s2[-1]:
        return lcs(s1[0:-1], s2[0:-1]) + s1[-1]        
    else:
        seq1 = lcs(s1, s2[0:-1])
        seq2 = lcs(s1[0:-1], s2)
        return seq1 if len(seq1) > len(seq2) else seq2
    
for i in range(0, len(strings)):
    for j in range(i+1, len(strings)):
        s1 = strings[i]
        s2 = strings[j]
        longest_common_subseq = lcs(s1, s2)
        edit_dist = len(s1) + len(s2) - 2 * len(longest_common_subseq)
        print('Strings {} & {}: {}'.format(s1, s2, edit_dist))

Strings he & she: 1
Strings he & his: 3
Strings he & hers: 2
Strings she & his: 4
Strings she & hers: 3
Strings his & hers: 3
