# Euclidean Distances 

Euclidean distance is a measure of the distance between two points in a multi-dimensional space. It is calculated as the square root of the sum of the squares of the differences between the coordinates of the two points. The importance of Euclidean distance in machine learning is that it is often used as a similarity metric between two data points. This can be used in clustering algorithms, for example, where data points that are closer together in terms of Euclidean distance are more likely to be grouped together in the same cluster. Additionally, it is used in K-nearest neighbors algorithm and linear regression model.

We will calculate the Euclidean Distance of one object and display its Sillhouette Score, in order to illustrate the math behind <code>euclidean_distances</code> behind <code>sklearn.metrics</code>

For the purposes of exemplification, we will only calculate the distance between one object within the cluster and the other others objects that belong cluster and also the distance between the aforementioned point and the other elements in the other clusters and the Sillhoutte Score of the object.

Let's imagine we have 3 cluster, containing the three people each. In order to name these people (our elements within the cluster) we will install and import the library <code>names</code>

In [1]:
! pip install names



The formula to calculate the Sillhouette Score is the following: 
# <center> $ s = \frac{\beta - \alpha}{max_{(\alpha, \beta)}} $

$\alpha$ (alpha) = the average distance between our element and the others within the cluster

$\beta$ (beta) = the distance between our element and the closest cluster

The Euclidean distance can be found below. In order to calculate it, we will import the module <code>sqrt</code> (squared root) from the library <code>math</code> 

And in order to make the results visually explanatory we will return dataframes using <code>pandas</code> 

<img src = 'alfa.png' width = 200> <img src = 'beta.png' width = 200> 

<img src = 'distancia.png'> 

In [2]:
import names 
import pandas as pd
from math import sqrt

We will use Object Orient Programming. After creating the class <code>Euclidean</code> we will instatiate an object receiving the parameters <code>a</code>, <code>b</code>, <code>c</code>, <code>d</code>, <code>e</code>, <code>f</code>, <code>g</code>, <code>h</code> and <code>i</code>.

These parameters will be names randomly generated by <code>names.get_first_name()</code> and the distances will be arbirtrary, given the purpose of showing how to calculate the Sillhouette Score

The function <code>clusters</code> will return three lists, containing the elements in each cluster and A

The function <code>cluster_a</code> will return the Euclian distances between A - B and A - C, as well as <code>self.alpha</code> and a DataFrame with the elements of cluster and its respectives distances to <code>x</code> and <code>y</code>

The function <code>cluster_b</code> will return the Euclian distances between A - D, A - E and A - F, as well as the average distance between A - Cluster2 and a DataFrame with the elements of cluster and its respectives distances to <code>x</code> and <code>y</code>

The function <code>cluster_c</code> will return the Euclian distances between A - G, A - H and A - I, as well as the average distance between A - Cluster3 and a DataFrame with the elements of cluster and its respectives distances to <code>x</code> and <code>y</code>

The function <code>sillhouette</code> will return the values of $\beta$ (beta), $\alpha$ (alpha) and the Sillhoute Score of A.

The Silhoutte Score of a cluster is the average of the score of all of its elements. We will calculate the score of A for purposes of demonstration

In [3]:
class Euclidian:
    def __init__ (self,a,b,c,d,e,f,g,h,i):
        self.a = a
        self.b = b
        self.c = c
        self.d = d
        self.e = e
        self.f = f
        self.g = g
        self.h = h
        self.i = i
    
    def clusters(self):
        self.cluster1 = [self.a, self.b, self.c]
        self.cluster2 = [self.d, self.e, self.f]
        self.cluster3 = [self.g, self.h, self.i]
        return f"A: {self.a}, Cluster 1: {self.cluster1}", f"Cluster 2: {self.cluster2}", f"Cluster 3: {self.cluster3}"
    def cluster_a(self):
        self.dist_a_x = 1.0 # distance from A to x
        self.dist_a_y = 0.9 # distance from A to u
        dist_b_x = 1.0 # distance from B to x
        dist_b_y = 1.7 # distance from B to y
        dist_c_x = 1.3 # distance from C to x
        dist_c_y = 1.5 # distance from C to y
        index = self.cluster1
        data = {'x' : [self.dist_a_x, dist_b_x, dist_c_x], 'y':[self.dist_a_y, dist_b_y, dist_c_y]}
        self.df1 = pd.DataFrame(data=data, index=index)
        euclidean_a_b = round(sqrt((dist_b_x - self.dist_a_x)**2 + (dist_b_y - self.dist_a_y)**2),2)
        euclidean_a_c = round(sqrt((dist_c_x - self.dist_a_x)**2 + (dist_c_y - self.dist_a_y)**2),2)
        self.alpha = round((euclidean_a_b + euclidean_a_c)/2, 2) 
        print(f"Euclidean distance {self.a} to {self.b}: {euclidean_a_b}")
        print(f"Euclidean distance {self.a} to {self.c}: {euclidean_a_c}")
        print(f"Silhouette Alpha: {self.alpha}")
        return self.df1
    
    def cluster_b(self):
        dist_d_x = 1.3 # distance from D to x
        dist_d_y = 3.9 # distance from D to y
        dist_e_x = 1.5 # distance from E to x
        dist_e_y = 4.3 # distance from E to y
        dist_f_x = 1.6 # distance from F to x
        dist_f_y = 3.7 # distance from F to y
        index = self.cluster2 
        data = {'x' : [dist_d_x, dist_e_x, dist_f_x], 'y':[dist_d_y, dist_e_y,  dist_f_y]}
        self.df2 = pd.DataFrame(data=data, index=index)
        euclidean_a_d = round(sqrt((dist_d_x - self.dist_a_x)**2 + (dist_d_y - self.dist_a_y)**2),2)
        euclidean_a_e = round(sqrt((dist_e_x - self.dist_a_x)**2 + (dist_e_y - self.dist_a_y)**2),2)
        euclidean_a_f = round(sqrt((dist_f_x - self.dist_a_x)**2 + (dist_f_y - self.dist_a_y)**2),2)
        self.avg_dist_a_c2 = round((euclidean_a_d + euclidean_a_e + euclidean_a_f)/3,2)
        print(f"Euclidean distance {self.a} to {self.d}: {euclidean_a_d}")
        print(f"Euclidean distance {self.a} to {self.e}: {euclidean_a_e}")
        print(f"Euclidean distance {self.a} to {self.f}: {euclidean_a_f}")
        print(f"Average distance {self.a} to Cluster 2: {self.avg_dist_a_c2}")
        print("\n")
        print(self.df1.iloc[0].to_frame())
        return self.df2
    
    def cluster_c(self):
        dist_g_x = 2.1 # distance from G to x
        dist_g_y = 0.8 # distance from G to y
        dist_h_x = 2.2 # distance from H to x
        dist_h_y = 1.5 # distance from H to y
        dist_i_x = 2.5 # distance from I to x
        dist_i_y = 0.9 # distance from I to y
        index = self.cluster3 
        data = {'x' : [dist_g_x, dist_h_x, dist_i_x], 'y':[dist_g_y, dist_h_y,  dist_i_y]}
        self.df3 = pd.DataFrame(data=data, index=index)
        euclidean_a_g = round(sqrt((dist_g_x - self.dist_a_x)**2 + (dist_g_y - self.dist_a_y)**2),2)
        euclidean_a_h = round(sqrt((dist_h_x - self.dist_a_x)**2 + (dist_h_y - self.dist_a_y)**2),2)
        euclidean_a_i = round(sqrt((dist_i_x - self.dist_a_x)**2 + (dist_i_y - self.dist_a_y)**2),2)
        self.avg_dist_a_c3 = round((euclidean_a_g + euclidean_a_h + euclidean_a_i)/3,2)
        print(f"Euclidean distance {self.a} to {self.g}: {euclidean_a_g}")
        print(f"Euclidean distance {self.a} to {self.h}: {euclidean_a_h}")
        print(f"Euclidean distance {self.a} to {self.i}: {euclidean_a_i}")
        print(f"Average distance {self.a} to Cluster 3: {self.avg_dist_a_c3}")
        print("\n")
        print(self.df1.iloc[0].to_frame())
        return self.df3
    
    def silhouette(self):
        beta = min([self.avg_dist_a_c2,self.avg_dist_a_c3])
        max_alpha_beta = max([beta, self.alpha])
        silhouette = (beta - self.alpha)/max_alpha_beta
        print(f"Alpha: {self.alpha}")
        print(f"Beta: {beta}")
        return f"Silhouette Score of {self.a}: {silhouette}"
    
    
    

In [4]:
test = Euclidian(*[names.get_first_name() for i in range(0,9)]) 

In [5]:
test.clusters()

("A: Elizabeth, Cluster 1: ['Elizabeth', 'Beverly', 'Matthew']",
 "Cluster 2: ['John', 'Glenna', 'Rene']",
 "Cluster 3: ['Burl', 'Raymond', 'Glenn']")

In [6]:
test.cluster_a()

Euclidean distance Elizabeth to Beverly: 0.8
Euclidean distance Elizabeth to Matthew: 0.67
Silhouette Alpha: 0.74


Unnamed: 0,x,y
Elizabeth,1.0,0.9
Beverly,1.0,1.7
Matthew,1.3,1.5


In [7]:
test.cluster_b()

Euclidean distance Elizabeth to John: 3.01
Euclidean distance Elizabeth to Glenna: 3.44
Euclidean distance Elizabeth to Rene: 2.86
Average distance Elizabeth to Cluster 2: 3.1


   Elizabeth
x        1.0
y        0.9


Unnamed: 0,x,y
John,1.3,3.9
Glenna,1.5,4.3
Rene,1.6,3.7


In [8]:
test.cluster_c()

Euclidean distance Elizabeth to Burl: 1.1
Euclidean distance Elizabeth to Raymond: 1.34
Euclidean distance Elizabeth to Glenn: 1.5
Average distance Elizabeth to Cluster 3: 1.31


   Elizabeth
x        1.0
y        0.9


Unnamed: 0,x,y
Burl,2.1,0.8
Raymond,2.2,1.5
Glenn,2.5,0.9


In [9]:
test.silhouette()

Alpha: 0.74
Beta: 1.31


'Silhouette Score of Elizabeth: 0.4351145038167939'