### Euclidean distance
The Euclidean distance can be defined as the length of the line segment joining the two data
points plotted on an n-dimensional Cartesian plane.

More generally, consider two n-dimensional points (or vectors):
* <b>v1</b>: $(q_1, q_2,...., q_n)$
* <b>v2</b>: $(r_1, r_2,....., r_n)$

Then, the Euclidean score is mathematically defined as:
$$d(v_1, v_2) = \sqrt{\sum_{i=1}^n (q_i-r_i)^2}$$

Euclidean scores can take any value between 0 and infinity. The lower the Euclidean score (or distance), the more similar the two vectors are to each other.

In [4]:
import numpy as np

In [1]:
#Function to compute Euclidean Distance.
def euclidean(v1, v2):
    #Convert 1-D Python lists to numpy vectors
    v1 = np.array(v1)
    v2 = np.array(v2)

    #Compute vector which is the element wise square of the difference
    diff = np.power(np.array(v1)- np.array(v2), 2)

    #Perform summation of the elements of the above vector
    sigma_val = np.sum(diff)

    #Compute square root and return final Euclidean score
    euclid_score = np.sqrt(sigma_val)
    
    return euclid_score

In [2]:
#Define 3 users with ratings for 5 movies
u1 = [5, 1, 2, 4, 5]
u2 = [1, 5, 4, 2, 1]
u3 = [5, 2, 2, 4, 4]

From the ratings, we can see that users 1 and 2 have extremely different tastes, whereas the tastes of users 1 and 3 are largely similar.

In [5]:
euclidean(u1, u2)

7.483314773547883

In [6]:
euclidean(u1, u3)

1.4142135623730951

Users 1 and 3 have a much smaller Euclidean score between them than users 1 and 2. Therefore, in this case, the Euclidean distance was able to satisfactorily capture the relationships between our users.

### Pearson correlation

Consider two users, Alice and Bob, who have rated the same five movies. Alice is extremely stingy with her ratings and never gives more than a 4 to any movie. On the other hand, Bob is more liberal and never gives anything below a 2 when rating movies. Let's define the matrices representing Alice and Bob and compute their Euclidean distance:

In [7]:
alice = [1,1,3,2,4]
bob = [2,2,4,3,5]

euclidean(alice, bob)

2.23606797749979

We get a Euclidean distance of about 2.23. However, on closer inspection, we see that Bob always gives a rating that is one higher than Alice. Therefore, we can say that Alice and Bob's ratings are extremely correlated. In other words, if we know Alice's rating for a movie, we can compute Bob's rating for the same movie with high accuracy (in this case, by just adding 1).

Consider another user, Eve, who has the polar opposite tastes to Alice:

In [8]:
eve = [5,5,3,4,2]

euclidean(eve, alice)

6.324555320336759

We get a very high score of 6.32, which indicates that the two people are very dissimilar. If we used Euclidean distances, we would not be able to do much beyond this. However, on inspection, we see that the sum of Alice's and Eve's ratings for a movie always add up to 6. Therefore, although very different people, one's rating can be used to accurately predict the corresponding rating of the other. Mathematically speaking, we say Alice's and Eve's ratings are strongly negatively correlated.

Euclidean distances place emphasis on magnitude, and in the process, are not able to gauge the degree of similarity or dissimilarity well. This is where the Pearson correlation comes into the picture. The Pearson correlation is a score between -1 and 1, where -1 indicates total negative correlation (as in the case with Alice and Eve) and 1 indicates total positive correlation (as in the case with Alice and Bob), whereas 0 indicates that the two entities are in no way correlated with each other (or are independent of each other).

Mathematically, the Pearson correlation is defined as follows:
$$r=\frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2} \sqrt{\sum_{i=1}^n (y_i-\overline{y})^2}}$$
Here, $\overline{i}$ denotes the mean of all the elements in vector i.

In [9]:
from scipy.stats import pearsonr

pearsonr(alice, bob)

PearsonRResult(statistic=1.0, pvalue=0.0)

In [10]:
pearsonr(alice, eve)

PearsonRResult(statistic=-1.0, pvalue=0.0)

The first element of our list output is the Pearson score. We see that Alice and Bob have the highest possible similarity score, whereas Alice and Eve have the lowest possible score.

In [11]:
pearsonr(bob, eve)

PearsonRResult(statistic=-1.0, pvalue=0.0)

### Cosine similarity
Mathematically, the Cosine similarity is defined as follows:
$$cosine(x, y)=\frac{x.y^T}{||x||.||y||}$$
The cosine similarity score computes the cosine of the angle between two vectors in an ndimensional space. When the cosine score is 1 (or angle is 0), the vectors are exactly similar. On the other hand, a cosine score of -1 (or angle 180 degrees) denotes that the two vectors are exactly dissimilar to each other.

Now, consider two vectors, x and y, both with zero mean. We see that when this is the case, the Pearson correlation score is exactly the same as the cosine similarity Score. In other words, for centered vectors with zero mean, the Pearson correlation is the cosine similarity score.