# Worksheet 04

Name:  Priscilla Nguyen

UID: U83582329

### Topics

- Distance & Similarity

### Distance & Similarity

#### Part 1

a) In the minkowski distance, describe what the parameters p and d are.

p represents the order or the power of the distance calculation.

d is the dimensionality of the space in which the distance is being calculated.

b) In your own words describe the difference between the Euclidean distance and the Manhattan distance.

Euclidean distance calculates the straight-line distance, taking into account the geometric distance between points.

Manhattan distance measures the distance along grid-like paths, considering only horizontal and vertical movements, making it suitable for situations where movement is constrained to grid or network.

Consider A = (0, 0) and B = (1, 1). When:

- p = 1, d(A, B) = 2
- p = 2, d(A, B) = $\sqrt{2} = 1.41$
- p = 3, d(A, B) = $2^{1/3} = 1.26$
- p = 4, d(A, B) = $2^{1/4} = 1.19$

c) Describe what you think distance would look like when p is very large.

As p becomes very large, the Minkowski distance converges towards to a distance metric that emphasizes the most significant dimension while diminishing the impact of all other dimensions.

d) Is the minkowski distance still a distance function when p < 1? Expain why / why not.

No, the Minkowski distance is not a valid distance function when the parameter "p" is less than 1. In order for a function to be considered a valid distance metric, it must satisfy certain properties, such as non-negativity, identity of indescernibles, symmetry, and the triangle inequality.

e) when would you use cosine similarity over the euclidean distance?

I would choose cosine similarity over Euclidean distance when dealing with high-dimensional, sparse, or text data, where the focus is on capturing the direction or similarity of vectors rather than absolute distances.

f) what does the jaccard distance account for that the manhattan distance doesn't?

the Jaccard distance is specifically designed to measure dissimilarity between sets or binary data based on set membership, while the Manhattan distance measures dissimilarity between numerical vectors by considering both the presence and magnitude of differences in each dimension. The Jaccard distance is suitable for categorical or binary data analysis, whereas the Manhattan distance is suitable for quantitative data analysis in multi-dimensional space.

#### Part 2

Consider the following two sentences:

In [None]:
s1 = "hello my name is Alice"  
s2 = "hello my name is Bob"

using the union of words from both sentences, we can represent each sentence as a vector. Each element of the vector represents the presence or absence of the word at that index.

In this example, the union of words is ("hello", "my", "name", "is", "Alice", "Bob") so we can represent the above sentences as such:

In [1]:
v1 = [1,    1, 1,   1, 1,    0]
#     hello my name is Alice
v2 = [1,    1, 1,   1, 0, 1]
#     hello my name is    Bob

Programmatically, we can do the following:

In [None]:
corpus = [s1, s2]
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)
v1 = [1 if x in s1 else 0 for x in all_words]
print(v1)

['hello', 'Bob', 'is', 'name', 'my', 'Alice']
[1, 0, 1, 1, 1, 1]


Let's add a new sentence to our corpus:

In [None]:
s3 = "hi my name is Claude"
corpus.append(s3)

a) What is the new union of words used to represent s1, s2, and s3?

In [4]:
["hello", "my", "name", "is", "Alice", "Bob", "hi", "Claude"]

['hello', 'my', 'name', 'is', 'Alice', 'Bob', 'hi', 'Claude']

b) Represent s1, s2, and s3 as vectors as above, using this new set of words.

In [3]:
s1 = "hello my name is Alice"
s2 = "hello my name is Bob"
s3 = "hi my name is Claude"

corpus = [s1, s2, s3]
all_words = list(set([item for x in corpus for item in x.split()]))

v1 = [1 if x in s1 else 0 for x in all_words]
v2 = [1 if x in s2 else 0 for x in all_words]
v3 = [1 if x in s3 else 0 for x in all_words]

print(all_words)
print(v1)
print(v2)
print(v3)

['Alice', 'Bob', 'hello', 'Claude', 'is', 'hi', 'name', 'my']
[1, 0, 1, 0, 1, 0, 1, 1]
[0, 1, 1, 0, 1, 0, 1, 1]
[0, 0, 0, 1, 1, 1, 1, 1]


c) Write a function that computes the manhattan distance between two vectors. Which pair of vectors are the most similar under that distance function?

In [5]:
def manhattan_distance(v1, v2):
    if len(v1) != len(v2):
        raise ValueError("Vectors must have the same dimensionality")
    return sum(abs(x - y) for x, y in zip(v1, v2))

# Calculate Manhattan distances
distance_s1_s2 = manhattan_distance(v1, v2)
distance_s1_s3 = manhattan_distance(v1, v3)
distance_s2_s3 = manhattan_distance(v2, v3)

print("Manhattan Distance between s1 and s2:", distance_s1_s2)
print("Manhattan Distance between s1 and s3:", distance_s1_s3)
print("Manhattan Distance between s2 and s3:", distance_s2_s3)

Manhattan Distance between s1 and s2: 2
Manhattan Distance between s1 and s3: 4
Manhattan Distance between s2 and s3: 4


d) Create a matrix of all these vectors (row major) and add the following sentences in vector form:

- "hi Alice"
- "hello Claude"
- "Bob my name is Claude"
- "hi Claude my name is Alice"
- "hello Bob"

In [6]:
# Existing vectors
v1 = [1, 1, 1, 1, 1, 1, 0, 0]
v2 = [1, 1, 1, 1, 0, 1, 0, 0]
v3 = [0, 1, 1, 1, 0, 0, 1, 1]

# New sentences
new_sentences = [
    "hi Alice",
    "hello Claude",
    "Bob my name is Claude",
    "hi Claude my name is Alice",
    "hello Bob"
]

# Create vectors for new sentences
new_vectors = []
for sentence in new_sentences:
    new_vector = [1 if word in sentence else 0 for word in all_words]
    new_vectors.append(new_vector)

# Combine all vectors into a matrix
import numpy as np

matrix = np.array([v1, v2, v3] + new_vectors)

# Print the matrix
print(matrix)

[[1 1 1 1 1 1 0 0]
 [1 1 1 1 0 1 0 0]
 [0 1 1 1 0 0 1 1]
 [1 0 0 0 0 1 0 0]
 [0 0 1 1 0 0 0 0]
 [0 1 0 1 1 0 1 1]
 [1 0 0 1 1 1 1 1]
 [0 1 1 0 0 0 0 0]]


e) How many rows and columns does this matrix have?

There are 8 rows and 8 columns

f) When using the Manhattan distance, which two sentences are the most similar?

In [7]:
import numpy as np

# Define the matrix of vectors (including the new sentences)
matrix = np.array([
[1, 1, 1, 1, 1, 1, 0, 0],
 [1, 1, 1, 1, 0, 1, 0, 0],
 [0, 1, 1, 1, 0, 0, 1, 1],
 [1, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 1, 1, 0, 0, 0, 0],
 [0, 1, 0, 1, 1, 0, 1, 1],
 [1, 0, 0, 1, 1, 1, 1, 1],
 [0, 1, 1, 0, 0, 0, 0, 0],
])

# Calculate Manhattan distances between all pairs of sentences
manhattan_distances = np.zeros((len(matrix), len(matrix)))

for i in range(len(matrix)):
    for j in range(len(matrix)):
        manhattan_distances[i][j] = np.sum(np.abs(matrix[i] - matrix[j]))

# Set diagonal elements to a high value to exclude self-comparisons
np.fill_diagonal(manhattan_distances, float('inf'))

# Find the indices of the minimum distance
row, col = np.unravel_index(np.argmin(manhattan_distances), manhattan_distances.shape)

# Determine the most similar sentences
most_similar_sentences = (row, col)
print("The most similar sentences (by Manhattan distance) are sentences", most_similar_sentences)

The most similar sentences (by Manhattan distance) are sentences (0, 1)
