# Worksheet 04

Name:  Priscilla Nguyen

UID: U83582329

### Topics

- Distance & Similarity

### Distance & Similarity

#### Part 1

a) In the minkowski distance, describe what the parameters p and d are.

p represents the order or the power of the distance calculation.

d is the dimensionality of the space in which the distance is being calculated.

b) In your own words describe the difference between the Euclidean distance and the Manhattan distance.

Euclidean distance calculates the straight-line distance, taking into account the geometric distance between points.

Manhattan distance measures the distance along grid-like paths, considering only horizontal and vertical movements, making it suitable for situations where movement is constrained to grid or network.

Consider A = (0, 0) and B = (1, 1). When:

- p = 1, d(A, B) = 2
- p = 2, d(A, B) = $\sqrt{2} = 1.41$
- p = 3, d(A, B) = $2^{1/3} = 1.26$
- p = 4, d(A, B) = $2^{1/4} = 1.19$

c) Describe what you think distance would look like when p is very large.

As p becomes very large, the Minkowski distance converges towards to a distance metric that emphasizes the most significant dimension while diminishing the impact of all other dimensions.

d) Is the minkowski distance still a distance function when p < 1? Expain why / why not.

No, the Minkowski distance is not a valid distance function when the parameter "p" is less than 1. In order for a function to be considered a valid distance metric, it must satisfy certain properties, such as non-negativity, identity of indescernibles, symmetry, and the triangle inequality.

e) when would you use cosine similarity over the euclidean distance?

I would choose cosine similarity over Euclidean distance when dealing with high-dimensional, sparse, or text data, where the focus is on capturing the direction or similarity of vectors rather than absolute distances.

f) what does the jaccard distance account for that the manhattan distance doesn't?

the Jaccard distance is specifically designed to measure dissimilarity between sets or binary data based on set membership, while the Manhattan distance measures dissimilarity between numerical vectors by considering both the presence and magnitude of differences in each dimension. The Jaccard distance is suitable for categorical or binary data analysis, whereas the Manhattan distance is suitable for quantitative data analysis in multi-dimensional space.

#### Part 2

Consider the following two sentences:

In [None]:
s1 = "hello my name is Alice"  
s2 = "hello my name is Bob"

using the union of words from both sentences, we can represent each sentence as a vector. Each element of the vector represents the presence or absence of the word at that index.

In this example, the union of words is ("hello", "my", "name", "is", "Alice", "Bob") so we can represent the above sentences as such:

In [20]:
v1 = [1,    1, 1,   1, 1,    0]
#     hello my name is Alice
v2 = [1,    1, 1,   1, 0, 1]
#     hello my name is    Bob

Programmatically, we can do the following:

In [17]:
corpus = [s1, s2]
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)
v1 = [1 if x in s1 else 0 for x in all_words]
print(v1)

['Alice', 'Bob', 'hello', 'is', 'name', 'my']
[1, 0, 1, 1, 1, 1]


Let's add a new sentence to our corpus:

In [21]:
s3 = "hi my name is Claude"
corpus.append(s3)

a) What is the new union of words used to represent s1, s2, and s3?

In [22]:
all_words = list(set([item for x in corpus for item in x.split()]))

# Printing the new union of words
print(all_words)

['Alice', 'Bob', 'hello', 'Claude', 'is', 'hi', 'name', 'my']


b) Represent s1, s2, and s3 as vectors as above, using this new set of words.

In [23]:
s1 = "hello my name is Alice"
s2 = "hello my name is Bob"
s3 = "hi my name is Claude"

# Create the vector representations
v1 = [1 if word in s1.split() else 0 for word in all_words]
v2 = [1 if word in s2.split() else 0 for word in all_words]
v3 = [1 if word in s3.split() else 0 for word in all_words]

# Print the vector representations
print(v1)  
print(v2)  
print(v3)

[1, 0, 1, 0, 1, 0, 1, 1]
[0, 1, 1, 0, 1, 0, 1, 1]
[0, 0, 0, 1, 1, 1, 1, 1]


c) Write a function that computes the manhattan distance between two vectors. Which pair of vectors are the most similar under that distance function?

In [24]:
def minkowski_dist(x, y, p):
    if p < 1:
        raise ValueError("p must be greater than 1")
    if len(x) != len(y):
        raise ValueError("x and y must be in the same dimensional space")
    res = 0
    for i in range(len(x)):
        res += abs(x[i] - y[i]) ** p
    return res ** (1/p)

def manhattan_dist(x, y):
    return minkowski_dist(x, y, 1)

print(manhattan_dist(v1, v2))

2.0


d) Create a matrix of all these vectors (row major) and add the following sentences in vector form:

- "hi Alice"
- "hello Claude"
- "Bob my name is Claude"
- "hi Claude my name is Alice"
- "hello Bob"

In [25]:
corpus = ["hi Alice", "hello Claude", "Bob my name is Claude", "hi Claude my name is Alice", "hello Bob"]
all_words = list(set([item for x in corpus for item in x.split()]))

print(all_words)

matrix = [[1 if word in sentence else 0 for word in all_words] for sentence in corpus]
for vector in matrix:   
    print(matrix)

['Bob', 'Alice', 'hello', 'Claude', 'is', 'hi', 'name', 'my']
[[0, 1, 0, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 1, 1], [0, 1, 0, 1, 1, 1, 1, 1], [1, 0, 1, 0, 0, 0, 0, 0]]
[[0, 1, 0, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 1, 1], [0, 1, 0, 1, 1, 1, 1, 1], [1, 0, 1, 0, 0, 0, 0, 0]]
[[0, 1, 0, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 1, 1], [0, 1, 0, 1, 1, 1, 1, 1], [1, 0, 1, 0, 0, 0, 0, 0]]
[[0, 1, 0, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 1, 1], [0, 1, 0, 1, 1, 1, 1, 1], [1, 0, 1, 0, 0, 0, 0, 0]]
[[0, 1, 0, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 1, 0, 1, 1], [0, 1, 0, 1, 1, 1, 1, 1], [1, 0, 1, 0, 0, 0, 0, 0]]


e) How many rows and columns does this matrix have?

There are 8 rows and 8 columns

In [26]:
rows = len(matrix)
cols = len(matrix[0])

f) When using the Manhattan distance, which two sentences are the most similar?

In [27]:
minimum = manhattan_dist(matrix[0], matrix[1])
vectors = (0, 1)
for i in range(len(matrix)):
    for j in range(len(matrix)):
        if i != j:
            new_min = manhattan_dist(matrix[i], matrix[j])
            if minimum > new_min:
                minimum = new_min
                vectors = (i, j)
print(vectors)

(1, 4)
