# Similarity and Distance Metrics

In this notebook we'll cover the three key similarity and distance metrics used in NLP, *Euclidean distance*, *cosine similarity*, and *dot product similarity*.

First, let's define three vectors - `a`, `b`, and `c`.

In [2]:
a = [0.01, 0.07, 0.1]
b = [0.01, 0.08, 0.11]
c = [0.91, 0.57, 0.6]

## Euclidean Distance

Euclidean distance is the simplest *similarity* metric - it is the only metric to measure *distance* between two points. We also call this the **L2 norm** metric. Given two vectors **u** and **v** it is calculated using:

$$
d(u, v) = \sqrt{\sum_{i=1}^{n}(u_i - v_i)^2}
$$

So for our vectors **a** and **b** this would look like:

$$
d(a, b) = \sqrt{(b_1 - a_1)^2 + (b_2 - a_2)^2 + (b_3 - a_3)^2} = \sqrt{(0.01 - 0.01)^2 + (0.08 - 0.07)^2 + (0.11 - 0.1)^2} = 0.0141
$$

In Python (using Numpy) we would calculate the Euclidean distance like so:

In [8]:
import numpy as np

np.sqrt(sum(np.square(np.subtract(a, b))))

0.014142135623730944

We can confirm that our approach is correct by using the `scipy.spatial` `distance` module:

In [10]:
np.sqrt(sum(np.square(np.subtract(b, c))))

1.1358697108383513

In [2]:
from scipy.spatial import distance

distance.euclidean(a, b)

0.014142135623730944

## Dot Product

The dot product considers both direction, and magnitude. It is calculated as:

$$
u \cdot v = \vert u \vert \vert v \vert cos \theta = \sum_{i=1}^{n}a_n b_n
$$

For our vectors **a** and **b**:

$$
a \cdot b = (a_1 b_1) + (a_2 b_2) + (a_3 b_3) = (0.01 * 0.01) + (0.07 * 0.08) + (0.1 * 0.11) = 0.0167
$$

We calculate the dot product easily with Numpy:

In [6]:
np.dot(a, b)

0.0167

*(Which is the same as `np.matmul` when transposing one of the vectors - `np.dot` performs this transpose operation automatically)*

In [6]:
np.matmul(a, np.array(b).T)

0.0167

Which is written in plain Python as:

In [7]:
a[0]*b[0] + a[1]*b[1] + a[2]*b[2]

0.016700000000000003

The only drawback of using dot product is that it is not normalized by scale, so larger vectors will tend to score higher dot products, even if they are less similiar. For example, vectors `a` and `c` are exactly similar to themselves - but dot product sees `c` as being more similar:

In [13]:
np.dot(a, a)

0.015000000000000003

In [11]:
np.dot(a, c)

0.10899999999999999

In [14]:
np.dot(c, c)

1.513

And so we must find a way to normalize...

## Cosine Similarity

Cosine similarity is through-and-through a *similarity* metric. This is because, if two vectors are oriented in the same direction, the angle between them will be *very large* - meaning the cosine similarity will be *very small* (eg they are not similar).

We calculate it like so:

$$
sim(u, v) = \frac{u \cdot v}{\lVert u \rVert \lVert v \rVert} = \frac{\sum_{i=1}^{n}a_n b_n}{\sqrt{\sum_{i=1}^{n}u_{n}^2}\sqrt{\sum_{i=1}^{n}v_{n}^2}}
$$

The cosine similarity calculation takes the dot product between two vectors (which considers both magnitude and direction), and divides it by the cross product of both vectors (the length of both, multiplied together). This process means that we calculate the `(magnitude and direction) / magnitude` - leaving us with just the direction - eg the angular/directional similarity.

So this metric is like a *normalized* dot product!

We can apply to to our vectors **a** and **b**:

$$
sim(a, b) = \frac{(a_1 * b_1) + (a_2 * b_2) + (a_3 * b_3)}{\sqrt{a_{1}^2+a_{2}^2+a_{3}^2}\sqrt{b_{1}^2+b_{2}^2+b_{3}^2}} = \frac{(0.01 * 0.01) + (0.07 * 0.08) + (0.1 * 0.11)}{\sqrt{0.01^2+0.07^2+0.1^2}\sqrt{0.01^2+0.08^2+0.11^2}} = \frac{0.0167}{0.016703} = 0.9998
$$

And in Python with Numpy:

In [8]:
np.dot(a, b) / (np.sqrt(sum(np.square(a))) * np.sqrt(sum(np.square(b))))

0.9998028479490471

Again, we can confirm this using another implementation, this time from `sklearn`:

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([a], [b])

array([[0.99980285]])

Let's compare this two two of the same vector (eg exactly similar):

In [10]:
cosine_similarity([a], [a])

array([[1.]])

And we will get the exact same value for `c`:

In [15]:
cosine_similarity([c], [c])

array([[1.]])

In [15]:
cosine_similarity([a], [a])

array([[1.]])

In [14]:
cosine_similarity([a], [c])

array([[0.7235381]])

So, it seems that *cosine similarity* is the metric to use at all times? Well, no. We will still often use *dot product* similarity because it is less computationally expensive (important for large datasets). As with cosine similarity we compute the dot product, and then normalize - which increases calculation complexity.

Here's a little walkthrough of dot product and cosine similarity calculations for our three vectors:

![Dot product and cosine similarity](../../assets/images/dot_product_and_cosine_similarity_workthrough.png)