# <center> Similarity Measures in Python

1. **Similarity is the basic building block for techniques such as Recommendation engines, clustering, classification and anomaly detection. **
2. **Similarity functions are used to measure the ‘distance’ between two vectors or numbers or pairs. **
3. **The two objects are deemed to be similar if the distance between them is small, and vice-versa.**

### Libraries

In [1]:
import pandas as pd #for dataframe
import numpy as np #for numerical calculations
import math #for mathematical functions like sqrt, pow
from sklearn import preprocessing #for standardising

# To print multiple outputs together
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


### Create Data

In [2]:
# initialize list of lists 
data = [[64.0, 580.0, 29.0],
        [66.0, 570.0, 33.0],
        [68.0, 590.0, 37.0],
        [69.0, 660.0, 46.0],
        [73.0, 600.0, 55.0]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Height', 'Score', 'Age']) 
  
# print dataframe. 
df 

Unnamed: 0,Height,Score,Age
0,64.0,580.0,29.0
1,66.0,570.0,33.0
2,68.0,590.0,37.0
3,69.0,660.0,46.0
4,73.0,600.0,55.0


### Measures of Similarity

### 1. Eucledian Distance

The most popular distance measure is the Euclidean distance. The Euclidean distance $d_{ij}$ between
two cases, $i$ and $j$ is defined by: 

<center>$d_{ij}$ = $\sqrt{\sum_{i=1}^n (x_i-y_i)^2}$ , $n$ = vector dimension   </center>  



In [3]:
def euclidean_distance(x,y):
  return math.sqrt(sum(math.pow(a-b,2) for a, b in zip(x, y)))

#calculate Euclidean distance between first two rows
euclidean_distance(df.loc[0,], df.loc[1,])

10.954451150103322

### Normalised Eucledian Distance

- The measure computed above is highly influenced by the scale of each variable, so that variables with larger scales (like Score) have a much higher influence over the total distance. 
- It is therefore customary to normalize (or, standardize) continuous measurements before computing the Euclidean distance. This converts all measurements to the same scale. 
- Normalizing a measurement means subtracting the average and dividing by the standard deviation (normalized values
are also called z-scores).

<center>
<img src="../Images/Standardise.jpg" width=20%, height = 20%/>

In [4]:
# Get column names first
names = df.columns

# Create the Scaler object
scaler = preprocessing.StandardScaler()

# Fit your data on the scaler object
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=names)
scaled_df

Unnamed: 0,Height,Score,Age
0,-1.318761,-0.632456,-1.172604
1,-0.65938,-0.948683,-0.746203
2,0.0,-0.316228,-0.319801
3,0.32969,1.897367,0.639602
4,1.648451,0.0,1.599005


In [5]:
#calculate Euclidean distance between first two rows
euclidean_distance(scaled_df.loc[0,], scaled_df.loc[1,])

0.8465227643210985

### 2. Mahalonobis Distance

- A more sophisticated technique is the Mahalanobis Distance, which takes into account the variability in dimensions.
- Better metric when real-world datasets have columns which are correlated
- Very useful in outlier detetction

The Mahalanobis distance of an observation ${\vec {x}}=(x_{1},x_{2},x_{3},\dots ,x_{N})^{T}$ from a set of observations with mean ${\vec {\mu }}=(\mu _{1},\mu _{2},\mu _{3},\dots ,\mu _{N})^{T}$ and covariance matrix $S$ is defined as shown in 1st equation.

It can also be defined as a similarity measure between two random vectors ${\vec {x}} $ and ${\vec {y}}$ with the covariance matrix S as shown in 2nd equation.

<center>
<img src="../Images/Mahalanobis.jpg" width=50%, height = 40%/>

In [6]:
def mahalonobis_distance(x,y,df):
    V = np.cov(df.T) # Covariance matrix of the dataset
    VI= np.linalg.inv(V) # Inverse Covariance matrix
    return np.sqrt(np.dot(np.dot((x-y),VI),(x-y).T))

#calculate Mahalonobis distance between first two rows
mahalonobis_distance(df.loc[0,], df.loc[1,], df)

1.0995372903212337

### 3. Manhattan Distance

- Manhattan distance is an metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates
- It is also known as taxicab metric, rectilinear distance, L1 distance, L1 norm, snake distance, city block distance or Manhattan length

<center>
<img src="../Images/Manhattan.jpg" width=50%, height = 40%/>

In [7]:
def manhattan_distance(x,y):
  return sum(abs(a-b) for a,b in zip(x,y))

manhattan_distance(df.loc[0,], df.loc[1,])

16.0

### 4. Cosine Similarity

- Cosine similarity metric finds the normalized dot product of two objects (cosine of the angle between the two objects)
- It determines the orientation and not magnitude between two objects
- Two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

In [8]:
def square_rooted(x):
   return round(math.sqrt(sum([a*a for a in x])),3)
  
def cosine_similarity(x,y):
 numerator = sum(a*b for a,b in zip(x,y))
 denominator = square_rooted(x)*square_rooted(y)
 return round(numerator/float(denominator),3)
  
cosine_similarity(df.loc[0,], df.loc[1,])

1.0

### 5. Distances for Binary Data

<center>
<img src="../Images/Binary.jpg" width=60%, height = 60%/>
    
#### Binary Euclidean Distance: $\frac{(b+c)}{(a+b+c+d)}$ 
#### Simple Matching Coefficient: $\frac{(a+d)}{(a+b+c+d)}$ 
#### Jaccard coefficient: $\frac{(d)}{(b+c+d)}$ 

