# Worksheet 04

Name:  Mark Maci
UID: U30478693

### Topics

- Distance & Similarity

### Distance & Similarity

#### Part 1

a) In the minkowski distance, describe what the parameters p and d are.

x is a vector of length n, y is a vector of length n, p is a parameter that determines the type of minkowski distance, and d is the dimension of the vector space.

b) In your own words describe the difference between the Euclidean distance and the Manhattan distance.

Euclidean distance takes the square root of the sum of the squared differences between the two vectors. Manhattan distance takes the sum of the absolute value of the differences between the two vectors. So Euclidean is the shortest, but Manhattan is only along the "grid."

Consider A = (0, 0) and B = (1, 1). When:

- p = 1, d(A, B) = 2
- p = 2, d(A, B) = $\sqrt{2} = 1.41$
- p = 3, d(A, B) = $2^{1/3} = 1.26$
- p = 4, d(A, B) = $2^{1/4} = 1.19$

c) Describe what you think distance would look like when p is very large.

The distance would decrease to the minimum value of the largest difference between the two vectors.

d) Is the minkowski distance still a distance function when p < 1? Expain why / why not.

No because it violates the triagle inequality. The distance between two points is always less than or equal to the sum of the distances between the two points and a third point.

e) when would you use cosine similarity over the euclidan distance?

When the angle between vectors is more important data wise than the difference in magnitude.

f) what does the jaccard distance account for that the manhattan distance doesn't?

The jaccard distance accounts for the number of items that are in both vectors, not just the differences.

#### Part 2

Consider the following two sentences:

In [13]:
s1 = "hello my name is Alice"  
s2 = "hello my name is Bob"

using the union of words from both sentences, we can represent each sentence as a vector. Each element of the vector represents the presence or absence of the word at that index.

In this example, the union of words is ("hello", "my", "name", "is", "Alice", "Bob") so we can represent the above sentences as such:

In [14]:
v1 = [1,    1, 1,   1, 1,    0]
#     hello my name is Alice
v2 = [1,    1, 1,   1, 0, 1]
#     hello my name is    Bob

Programmatically, we can do the following:

In [15]:
corpus = [s1, s2]
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)
v1 = [1 if x in s1 else 0 for x in all_words]
print(v1)

['Alice', 'my', 'is', 'Bob', 'hello', 'name']
[1, 1, 1, 0, 1, 1]


Let's add a new sentence to our corpus:

In [16]:
s3 = "hi my name is Claude"
corpus.append(s3)

a) What is the new union of words used to represent s1, s2, and s3?

In [17]:
new_words = list(set([item for x in corpus for item in x.split()]))
print(new_words)

['Alice', 'my', 'is', 'Bob', 'Claude', 'hello', 'hi', 'name']


b) Represent s1, s2, and s3 as vectors as above, using this new set of words.

In [18]:
def vectorize(s, words):
    return [1 if x in s else 0 for x in words]

s1_vec = vectorize(s1, new_words)
s2_vec = vectorize(s2, new_words)
s3_vec = vectorize(s3, new_words)

print(s1_vec)
print(s2_vec)
print(s3_vec)


[1, 1, 1, 0, 0, 1, 0, 1]
[0, 1, 1, 1, 0, 1, 0, 1]
[0, 1, 1, 0, 1, 0, 1, 1]


c) Write a function that computes the manhattan distance between two vectors. Which pair of vectors are the most similar under that distance function?

In [19]:
def minkowski_distance(v1, v2, p):
    if p < 1:
        raise ValueError("p must be at least 1")
    if len(v1) != len(v2):
        raise ValueError("vectors must be same length")
    res = 0
    for i in range(len(v1)):
        res += abs(v1[i] - v2[i]) ** p
    return res ** (1/p)

def manhattan_distance(v1, v2):
    return minkowski_distance(v1, v2, 1)

def euclidean_distance(v1, v2):
    return minkowski_distance(v1, v2, 2)

def cosine_similarity(v1, v2):
    if len(v1) != len(v2):
        raise ValueError("vectors must be same length")
    dot = 0
    for i in range(len(v1)):
        dot += v1[i] * v2[i]
    mag1 = sum([x ** 2 for x in v1]) ** 0.5
    mag2 = sum([x ** 2 for x in v2]) ** 0.5
    return dot / (mag1 * mag2)

print(cosine_similarity(s1_vec, s2_vec))

0.7999999999999998


d) Create a matrix of all these vectors (row major) and add the following sentences in vector form:

- "hi Alice"
- "hello Claude"
- "Bob my name is Claude"
- "hi Claude my name is Alice"
- "hello Bob"

In [24]:
import numpy as np

corpus = ["hi Alice",
          "hello Claude",
          "Bob my name is Claude",
          "hi Claude my name is Alice",
          "hello Bob",]

vectorized_corpus = [vectorize(x, new_words) for x in corpus]

print(vectorized_corpus)




[[1 0 0 0 0 0 1 0]
 [0 0 0 0 1 1 0 0]
 [0 1 1 1 1 0 0 1]
 [1 1 1 0 1 0 1 1]
 [0 0 0 1 0 1 0 0]]


e) How many rows and columns does this matrix have?

In [22]:
rows = len(vectorized_corpus)
cols = len(vectorized_corpus[0])

f) When using the Manhattan distance, which two sentences are the most similar?

In [27]:
min = manhattan_distance(vectorized_corpus[0], vectorized_corpus[1])
for i in range (1, len(vectorized_corpus)):
    for j in range (i+1, len(vectorized_corpus)):
        if manhattan_distance(vectorized_corpus[i], vectorized_corpus[j]) < min:
            min = manhattan_distance(vectorized_corpus[i], vectorized_corpus[j])


print(min)

TypeError: 'numpy.float64' object is not callable