# Unsupervised ML: distances

In this notebook we will learn about distances. Distances are important in unsupervised ML as they tell us whether something is similar or not.

For example, we have 3 people: Raju, Soniya, and Amabel. We ask them to score how much they like burgers from 1 (bad) to 10 (amazing).
- Raju: 8
- Soniya: 9
- Amabel: 3

We can see that Raju and Soniya have a similar taste for burgers, there is only a distance of 1 between their scores(9 - 8). We can also see that Soniya and Amabel do not share similar taste in burgers, there is a distance of 6 between their scores (9 - 3). 

Now if we repeat this for multiple foods such as peanuts, olives, eggs etc... we can get an idea of whose tastes are simiar, and whose are not - we simply calculate the distances between all the scores.

Let's go through this example of using distances to find the similaraties (and differences) of food preferences together.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
url = "https://drive.google.com/file/d/13OaXbtc_hFtdCI5vEPzHbR7CDx1WH97Q/view?usp=share_link" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]

df = pd.read_csv(path, index_col='student')
df.drop(['age','favorite_color'], axis=1, inplace=True)
df

FileNotFoundError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Euclidean distance
The euclidean distance is simply a straight line between 2 points. 
 
There are other ways of measuring distaces other than a straight line (mathmeticians love tweakings things!), however, for this week we will only be using the euclidean distance. If you remember Pythagorean theorem from school, then this is the hypotenuse.

$$ euclideanDistance = \sqrt {\left( {x_1 - x_2 } \right)^2 + \left( {y_1 - y_2 } \right)^2 }$$


We might not all be familiar with mathmatical notation, so let's see how the above formula relates to a DataFrame.

Above we explored the distance between students tastes on one food. Now let's expand on that and look at the first 2 students and 2 foods.

### 2 Students 2 Foods

In [None]:
sample1 = df.iloc[0:2, 0:2].copy()
sample1

In [None]:
student1_food1 = sample1.iloc[0, 0]
student1_food2 = sample1.iloc[0, 1]

student2_food1 = sample1.iloc[1, 0]
student2_food2 = sample1.iloc[1, 1]

Now let's plot these points on a scatter plot to highlight what the x's and y's in the formula relate to.

In [None]:
#@title Plot euclidean
# Create the scatter plot
sample1.plot(kind="scatter", x=sample1.columns[0], y=sample1.columns[1], figsize=(12, 8));

# Add a line between the two points
x = [student1_food1, student2_food1]
y = [student1_food2, student2_food2]
plt.plot(x, y, c='r');

# Add names and coordinates
plt.annotate(sample1.index[0] + f" ({student1_food1}, {student1_food2})", (student1_food1+0.01, student1_food2+0.01));
plt.annotate(sample1.index[1] + f" ({student2_food1}, {student2_food2})", (student2_food1+0.01, student2_food2+0.01));

# Add title
plt.title('Plotting the euclidean distance with 2 dimensions');

Each student has an x-value (their food1 score), and a y-value (their food2 score). We can put these values into the formula above to caluculate the euclidean distance between the 2 students.

In [None]:
e_dist_2_foods = ((student1_food1 - student2_food1)**2 + (student1_food2 - student2_food2)**2)**0.5

In [None]:
print(f"The euclidean distance between {sample1.index[0]} and {sample1.index[1]} is {e_dist_2_foods}")

### 2 Students 3 Foods

Now let's have a look at how we can plot 3 foods for 2 students.

Before we only had 2 foods, which could be plotted on a chart with x and y. Now that we have 3 foods, we need to plot them in 3 dimensions, which means that our formula now includes x, y, and, z.

$$ euclideanDistance = \sqrt {\left( {x_1 - x_2 } \right)^2 + \left( {y_1 - y_2 } \right)^2  + \left( {z_1 - z_2 } \right)^2 }$$

 

In [None]:
sample2 = df.iloc[0:2, 0:3].copy()
sample2

In [None]:
student1_food1 = sample2.iloc[0, 0]
student1_food2 = sample2.iloc[0, 1]
student1_food3 = sample2.iloc[0, 2]

student2_food1 = sample2.iloc[1, 0]
student2_food2 = sample2.iloc[1, 1]
student2_food3 = sample2.iloc[1, 2]

In [None]:
#@title Plot euclidean 3D
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Initialise plot
fig = plt.figure(figsize=(16, 12))
ax = fig.add_subplot(projection = '3d')

# Add dots to represent students, in the brackets are coordinates (x, y, z)
ax.scatter(student1_food1, student1_food2, student1_food3)
ax.scatter(student2_food1, student2_food2, student2_food3)

# Add a line between the two points
x = [student1_food1, student2_food1]
y = [student1_food2, student2_food2]
z = [student1_food3, student2_food3]
plt.plot(x, y, z, c='r');

# Label data points
ax.text(student1_food1+0.02, student1_food2+0.02, student1_food3+0.02, sample2.index[0] + f" ({student1_food1}, {student1_food2}, {student1_food3})")
ax.text(student2_food1+0.02, student2_food2+0.02, student2_food3+0.02, sample2.index[1] + f" ({student2_food1}, {student2_food2}, {student2_food3})")

# Label axes
ax.set_xlabel(sample2.columns[0])
ax.set_ylabel(sample2.columns[1])
ax.set_zlabel(sample2.columns[2])

# Add title
plt.title('Plotting the euclidean distance with 3 dimensions').set_position([.5, 0.97]);

plt.show()

In [None]:
e_dist_3_foods = ((student1_food1 - student2_food1)**2 + (student1_food2 - student2_food2)**2 + (student1_food3 - student2_food3)**2)**0.5

In [None]:
print(f"The euclidean distance between {sample2.index[0]} and {sample2.index[1]} is {e_dist_3_foods}")

#### CHALLENGE: Calculating euclidean distance between all the students for three foods
We'll take 20 minutes for you all to prooduce a DataFrame showing the distances between all of the students for the first 3 foods.

There are a multitude of ways to complete this, there is not one correct answer. Give it a go and see how far you get.

In [None]:
#you solution here


In [None]:
#here you can create a heatmap to visualise the distances

### 2 Students 4 Foods

$$ euclideanDistance = \sqrt {\left( {x_1 - x_2 } \right)^2 + \left( {y_1 - y_2 } \right)^2  + \left( {z_1 - z_2 } \right)^2 + \left( {a_1 - a_2 } \right)^2 }$$

The next obvious step after plotting 2 Students 2 foods, then 2 Students 3 foods, would be to plot 2 Studetns 4 foods. However, there is a problem. When we make these plots, each column becomes a dimension. And we can plot in 2 dimensions, and we cna kind of plot in 3 dimensions, however, it is physically impossible for us to plot in 4 dimensions. This is where our concept of space and mathematics diverge. It is mathmatically possible to plot in 4 or more dimension, but it is impossible for us to represent this in a chart.

### All students and all foods

In [None]:
# your solution here

### SciKit Learn
There are many python libraries and modules that can caluculate the euclidean distance for you, `numpy`, `scipy`, `math` are just a few. For this project and the next few weeks we'll start to use a python library called `SciKit Learn`. SciKit Learn is a great open-source machine learning library. Here's all the code you need to know to calculate the euclidean distance with SKLearn:

In [None]:
from sklearn.metrics import pairwise_distances

pd.DataFrame(pairwise_distances(df), index=df.index, columns=df.index)