In [1]:
import numpy as np

## Euclidean Distance

In [2]:
def euclidean(v1, v2):
    v1, v2 = np.array(v1), np.array(v2)

    diff = np.power(np.array(v1) - np.array(v2), 2)
    sigma_val = np.sum(diff)

    return np.sqrt(sigma_val)


In [4]:
user_1 = [5, 1, 2, 4, 5]
user_2 = np.random.randint(1, 5, size=(5, ))
user_3 = np.random.randint(1, 5, size=(5,))

In [5]:
user_1, user_2, user_3

([5, 1, 2, 4, 5], array([4, 2, 4, 1, 2]), array([2, 2, 1, 2, 1]))

In [7]:
euclidean(user_1, user_2), euclidean(user_1, user_3)

(4.898979485566356, 5.5677643628300215)

In [8]:
alice = [1, 1, 3, 2, 4]
bob = [2, 2, 4, 3, 5]

euclidean(alice, bob)

2.23606797749979

We get a Euclidean distance of about 2.23. However, on closer inspection, we see that Bob
always gives a rating that is one higher than Alice. Therefore, we can say that Alice and
Bob's ratings are extremely correlated. In other words, if we know Alice's rating for a
movie, we can compute Bob's rating for the same movie with high accuracy (in this case, by
just adding 1).

In [9]:
eve = [5, 5, 3, 4, 2]
euclidean(eve, alice)

6.324555320336759

We get a very high score of 6.32, which indicates that the two people are very dissimilar. If
we used Euclidean distances, we would not be able to do much beyond this. However, on
inspection, we see that the sum of Alice's and Eve's ratings for a movie always add up to 6.
Therefore, although very different people, one's rating can be used to accurately predict the
corresponding rating of the other.

## Pearson Correlation

Euclidean distances place emphasis on magnitude, and in the process, are not able to gauge
the degree of similarity or dissimilarity well. This is where the Pearson correlation comes
into the picture. The Pearson correlation is a score between -1 and 1, where -1 indicates total
negative correlation (as in the case with Alice and Eve) and 1 indicates total positive
correlation (as in the case with Alice and Bob), whereas 0 indicates that the two entities are
in no way correlated with each other (or are independent of each other).

In [10]:
from scipy.stats import pearsonr
pearsonr(alice, bob)

(1.0, 0.0)

In [11]:
pearsonr(alice, eve)

(-1.0, 0.0)

The first element of our list output is the Pearson score. We see that Alice and Bob have the
highest possible similarity score, whereas Alice and Eve have the lowest possible score.

In [12]:
pearsonr(bob, eve)

(-1.0, 0.0)

## Cosine Similarity

consider two vectors, x and y, both with zero mean. We see that when this is the case,
the Pearson correlation score is exactly the same as the cosine similarity Score. In other
words, for centered vectors with zero mean, the Pearson correlation is the cosine similarity
score.