# Worksheet 04

Name: Youxuan Ma

UID: U23330522

### Topics

- Distance & Similarity

### Distance & Similarity

#### Part 1

a) In the minkowski distance, describe what the parameters p and d are.

1. **$p$ (the order of the Minkowski metric):** This parameter determines the type of distance measurement that the Minkowski formula will compute. It's a real number greater than or equal to 1 ($p \geq 1$), where:
   - When $p = 1$, the Minkowski distance becomes the Manhattan (or L1) distance.
   - When $p = 2$, it becomes the Euclidean (or L2) distance.
   - As $p$ approaches infinity ($\infty$), the Minkowski distance converges to the Chebyshev distance.
   
   The formula for the Minkowski distance between two points $x$ and $y$ in a $d$-dimensional space is given by:
   $$
   D(x, y) = \left( \sum_{i=1}^{d} |x_i - y_i|^p \right)^{\frac{1}{p}}
   $$
2. **$d$ (the dimensionality of the points):** This parameter indicates the number of dimensions or attributes that each point $x$ and $y$ has. It's used to sum over the $d$ dimensions in the vector space where the points exist. In the formula, $d$ represents the total number of components (or features) in the vectors $x$ and $y$, and $x_i$ and $y_i$ are the components of $x$ and $y$ in the $i^th$ dimension, respectively.

b) In your own words describe the difference between the Euclidean distance and the Manhattan distance.

- The Euclidean distance between two points is the length of the shortest path connecting them, which is a straight line in the geometric sense, resembling how we typically measure distances in the physical world.

- Whereas, the Manhattan distance measures the distance between two points by only allowing travel along orthogonal (right-angled) paths, similar to navigating the blocks of a grid-like city. 

Consider A = (0, 0) and B = (1, 1). When:

- p = 1, d(A, B) = 2
- p = 2, d(A, B) = $\sqrt{2} = 1.41$
- p = 3, d(A, B) = $2^{1/3} = 1.26$
- p = 4, d(A, B) = $2^{1/4} = 1.19$

c) Describe what you think distance would look like when p is very large.

When $p$ becomes very large, approaching infinity, the distance metric starts to resemble the Chebyshev distance, focusing solely on the maximum difference across all dimensions and ignoring the smaller ones. 

Therefore, when $p$ is very large, the distance between points A and B would be equivalent to the largest single-coordinate difference between them, which is $1$ in this case.

d) Is the minkowski distance still a distance function when p < 1? Expain why / why not.

It would not be a valid distance function when $p < 1$, since it violates the triangle inequality property in that case:
- When $p < 1$, raising the absolute differences $|x_i - y_i|$ to a power less than 1 actually reverses the effect of averaging differences across dimensions. Instead of smoothing out the differences, it amplifies the smaller differences disproportionately compared to larger differences. Therefore, this can lead to situations where the direct path between two points is considered longer than an indirect path through a third point, directly contradicting the triangle inequality.

e) when would you use cosine similarity over the euclidan distance?

I would use Cosine similarity, instead of Euclidean distance, for assessing similarity in orientation or pattern of data points in high-dimensional spaces, particularly for text and similarity-based tasks. 

On the other hand, I would use Euclidean for spatial analyses and algorithms that rely on the physical distance between points.

f) what does the jaccard distance account for that the manhattan distance doesn't?

Jaccard distance accounts for the Presence and Absence of Features, and it is inherently designed for binary or categorical data. 
- It uses set operations (intersection and union) to assess similarity and dissimilarity, which is fundamentally different from the arithmetic operation (sum of absolute differences) used by the Manhattan distance.
- It inherently accounts for the size of the sets by normalizing the intersection by the union of the sets, which is something the Manhattan distance does not inherently do.

#### Part 2

Consider the following two sentences:

In [30]:
s1 = "hello my name is Alice"  
s2 = "hello my name is Bob"

using the union of words from both sentences, we can represent each sentence as a vector. Each element of the vector represents the presence or absence of the word at that index.

In this example, the union of words is ("hello", "my", "name", "is", "Alice", "Bob") so we can represent the above sentences as such:

In [31]:
v1 = [1,    1, 1,   1, 1,    0]
#     hello my name is Alice
v2 = [1,    1, 1,   1, 0, 1]
#     hello my name is    Bob

Programmatically, we can do the following:

In [32]:
corpus = [s1, s2]
all_words = list(set([item for x in corpus for item in x.split()]))
print(all_words)
v1 = [1 if x in s1 else 0 for x in all_words]
print(v1)

['is', 'name', 'Bob', 'Alice', 'hello', 'my']
[1, 1, 0, 1, 1, 1]


Let's add a new sentence to our corpus:

In [33]:
s3 = "hi my name is Claude"
corpus.append(s3)

a) What is the new union of words used to represent s1, s2, and s3?

In [34]:
all_words = list(set(word for sentence in corpus for word in sentence.split()))
print("The new union of words: ", all_words)

The new union of words:  ['is', 'name', 'Bob', 'hi', 'Alice', 'hello', 'my', 'Claude']


b) Represent s1, s2, and s3 as vectors as above, using this new set of words.

In [35]:
vectors = [[1 if word in sentence.split() else 0 for word in all_words] for sentence in corpus]

for i, vector in enumerate(vectors, start=1):
    print(f"Vector for s{i}:", vector)

Vector for s1: [1, 1, 0, 0, 1, 1, 1, 0]
Vector for s2: [1, 1, 1, 0, 0, 1, 1, 0]
Vector for s3: [1, 1, 0, 1, 0, 0, 1, 1]


c) Write a function that computes the manhattan distance between two vectors. Which pair of vectors are the most similar under that distance function?

In [36]:
def manhattan_distance(vec1, vec2):
    if len(vec1) != len(vec2):
        raise ValueError("Vectors must be of the same length.")
    distance = sum(abs(a - b) for a, b in zip(vec1, vec2))
    return distance

distances = {}
for i in range(len(vectors)):
    for j in range(i+1, len(vectors)):
        distance = manhattan_distance(vectors[i], vectors[j])
        distances[(i+1, j+1)] = distance

most_similar_pair = min(distances, key=distances.get)
most_similar_distance = distances[most_similar_pair]

print(f"The most similar pair of vectors under the Manhattan distance is the sentences {most_similar_pair}, with a Manhattan distance of {most_similar_distance}.")


The most similar pair of vectors under the Manhattan distance is the sentences (1, 2), with a Manhattan distance of 2.


As show above by the code, the most similar pair of vectors under the Manhattan distance is the pair s1 and s2.

d) Create a matrix of all these vectors (row major) and add the following sentences in vector form:

- "hi Alice"
- "hello Claude"
- "Bob my name is Claude"
- "hi Claude my name is Alice"
- "hello Bob"

In [37]:
new_sentences = [
    "hi Alice",
    "hello Claude",
    "Bob my name is Claude",
    "hi Claude my name is Alice",
    "hello Bob"
]
corpus.extend(new_sentences)

all_words = list(set(word for sentence in corpus for word in sentence.split()))

vectors = [[1 if word in sentence.split() else 0 for word in all_words] for sentence in corpus]

for i, vector in enumerate(vectors, start=1):
    print(f"Vector for sentence {i}: {vector}")


Vector for sentence 1: [1, 1, 0, 0, 1, 1, 1, 0]
Vector for sentence 2: [1, 1, 1, 0, 0, 1, 1, 0]
Vector for sentence 3: [1, 1, 0, 1, 0, 0, 1, 1]
Vector for sentence 4: [0, 0, 0, 1, 1, 0, 0, 0]
Vector for sentence 5: [0, 0, 0, 0, 0, 1, 0, 1]
Vector for sentence 6: [1, 1, 1, 0, 0, 0, 1, 1]
Vector for sentence 7: [1, 1, 0, 1, 1, 0, 1, 1]
Vector for sentence 8: [0, 0, 1, 0, 0, 1, 0, 0]


e) How many rows and columns does this matrix have?

In [38]:
# Determine the number of rows and columns of the matrix
num_rows = len(vectors)  # The number of rows is the number of vectors
num_columns = len(vectors[0]) if vectors else 0  # The number of columns is the length of any vector

print(f"This matrix has {num_rows} rows and {num_columns} columns.")


This matrix has 8 rows and 8 columns.


f) When using the Manhattan distance, which two sentences are the most similar?

In [39]:
def manhattan_distance(vec1, vec2):
    if len(vec1) != len(vec2):
        raise ValueError("Vectors must be of the same length.")
    distance = sum(abs(a - b) for a, b in zip(vec1, vec2))
    return distance

distances = {}

for i in range(len(vectors)):
    for j in range(i+1, len(vectors)):
        distance = manhattan_distance(vectors[i], vectors[j])
        distances[(i, j)] = distance

most_similar_pair = min(distances, key=distances.get)
most_similar_distance = distances[most_similar_pair]

sentence1 = corpus[most_similar_pair[0]]
sentence2 = corpus[most_similar_pair[1]]

print(f"The most similar pair of sentences under the Manhattan distance is: \n'{sentence1}' and '{sentence2}', \nwith a distance of {most_similar_distance}.")


The most similar pair of sentences under the Manhattan distance is: 
'hi my name is Claude' and 'hi Claude my name is Alice', 
with a distance of 1.


#### Part 3 Challenge

Given a set of graphs $\mathcal{G}$, each graph $G \in \mathcal{G}$ is defined over the same set of nodes $V$. The graphs are represented by their adjacency matrices, which are 2D arrays where each element indicates whether a pair of nodes is connected by an edge.

Your task is to compute the pairwise distances between these graphs based on a specific distance metric. The distance $d(G, G')$ between two graphs $G = (V, E)$ and $G' = (V, E')$ is defined as the sum of the number of edges in $G$ but not in $G'$, and the number of edges in $G'$ but not in $G$. Mathematically, this can be expressed as:

$$
d(G, G') = |E \setminus E'| + |E' \setminus E|.
$$

##### Requirements:
1. **Input**: Should take a list of 2D numpy arrays as input. Each array represents the adjacency matrix of a graph.

2. **Output**: Should output a pairwise distance matrix. If there are $n$ graphs in the input list, the output should be an $n \times n$ matrix where the entry at position $(i, j)$ represents the distance between the $i^{th}$ and $j^{th}$ graph.