# Similarity and Distance Metrics

In this notebook we'll cover the three key similarity and distance metrics used in NLP, *Euclidean distance*, *cosine similarity*, and *dot product similarity*.

First, let's define three vectors - `a`, `b`, and `c`.

In [1]:
a = [0.01, 0.07, 0.1]
b = [0.01, 0.08, 0.11]
c = [0.91, 0.57, 0.6]

## Euclidean Distance

Euclidean distance is the simplest *similarity* metric - although we would more accurately call it a *difference*.

In Python (using Numpy) we would calculate the Euclidean distance like so:

In [5]:
import numpy as np

np.sqrt(sum(np.square(np.subtract(a, b))))

0.014142135623730944

We can confirm that our approach is correct by using the `scipy.spatial` `distance` module:

In [2]:
from scipy.spatial import distance

distance.euclidean(a, b)

0.014142135623730944

## Dot Product

The dot product considers both direction, and magnitude.

We calculate the dot product easily with Numpy:

In [6]:
np.dot(a, b)

0.0167

Which is written in plain Python as:

In [7]:
a[0]*b[0] + a[1]*b[1] + a[2]*b[2]

0.016700000000000003

The only drawback of using dot product is that it is not normalized by scale, so larger vectors will tend to score higher dot products, even if they are less similiar. For example, vectors `a` and `c` are exactly similar to themselves - but dot product sees `c` as being more similar:

In [13]:
np.dot(a, a)

0.015000000000000003

In [14]:
np.dot(c, c)

1.513

And so we must find a way to normalize...

## Cosine Similarity

Cosine similarity is through-and-through a *similarity* metric. This is because, if two vectors are oriented in the same direction, the angle between them will be *very large* - meaning the cosine similarity will be *very small* (eg they are not similar).

The cosine similarity calculation takes the dot product between two vectors (which considers both magnitude and direction), and divides it by the cross product of both vectors (the length of both, multiplied together). This process means that we calculate the `(magnitude and direction) / magnitude` - leaving us with just the direction - eg the angular/directional similarity.

So this metric is a *normalized* dot product!

In [8]:
np.dot(a, b) / (np.sqrt(sum(np.square(a))) * np.sqrt(sum(np.square(b))))

0.9998028479490471

Again, we can confirm this using another implementation, this time from `sklearn`:

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([a], [b])

array([[0.99980285]])

Let's compare this two two of the same vector (eg exactly similar):

In [10]:
cosine_similarity([a], [a])

array([[1.]])

And we will get the exact same value for `c`:

In [15]:
cosine_similarity([c], [c])

array([[1.]])

So, it seems that *cosine similarity* is the metric to use at all times? Well, no. We will still often use *dot product* similarity because it is less computationally expensive (important for large datasets). As with cosine similarity we compute the dot product, and then normalize - which increases calculation complexity.

Here's a little walkthrough of dot product and cosine similarity calculations for our three vectors:

![Dot product and cosine similarity](../../assets/images/dot_product_and_cosine_similarity_workthrough.png)