## Introduction:
* Many supervised and unsupervised machine learning models such as K-NN and K-Means depend upon the distance between two data points to predict the output. Therefore, the metric we use to compute distances plays an important role in these models.
* Knowing when to use which distance measure can help you go from a poor classifier to an accurate model.
* There can be many distance metrics, some of the common ones are:

## 1. Euclidean Distance
![ed.png](attachment:ed.png)
* Is is the most common distance measure.
* It is a distance measure that best can be explained as the length of a segment connecting two points.
* We are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating point or integer values.
* If columns have values with differing scales, it is common to normalize or standardize the numerical values across all columns prior to calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.
* **Advantages**:
    * Euclidean distance works great when we have low-dimensional data and the magnitude of the vectors is important to be measured. Methods like kNN and HDBSCAN show great results out of the box if Euclidean distance is used on low-dimensional data.
    * It is very simple to use and visualize and it is one of the most used distance measures.
* **Disadvantages**:
    * As the dimensionality increases of the data, the less useful Euclidean distance becomes. This has to do with the curse of dimensionality which relates to the notion that higher-dimensional space does not act as we would, intuitively, expect from 2- or 3-dimensional space.
    * Euclidean distance is not scale in-variant which means that distances computed might be skewed depending on the units of the features. Typically, one needs to normalize the data before using this distance measure.    

In [4]:
# Implementation:

import math

def euclidean_distance(x, y):
    return math.sqrt(sum((e1-e2)**2 for e1, e2 in zip(x,y)))

row1 = [9, 16, 20, 11, 7]
row2 = [12, 24, 11, 9, 8]
# calculate distance
dist1 = euclidean_distance(row1, row2)
print(dist1)

# We can also perform the same calculation SciPy
from scipy.spatial.distance import euclidean
dist2 = euclidean(row1, row2)
print(dist2)

12.609520212918492
12.609520212918492


## 2. Manhattan Distance
* The Manhattan distance, often called Taxicab distance or City Block distance, calculates the distance between real-valued vectors.
* It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).
* It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.
* Manhattan distance then refers to the distance between two vectors if they could only move right angles. There is no diagonal movement involved in calculating the distance.
![manhattan_distance.jpeg](attachment:manhattan_distance.jpeg)

* Formula(for two points):
![manhattan.jpg](attachment:manhattan.jpg)

* More generalized formula:
![0*m9AgwqgzAZsdcf-z.gif](attachment:0*m9AgwqgzAZsdcf-z.gif)

* **Advantages** :
    * It works fine for high-dimensional data.
    * When the dataset has discrete and/or binary attributes, this distance seems to work quite well since it takes into account the paths that realistically could be taken within values of those attributes. While Euclidean distance, would create a straight line between two vectors when in reality this might not actually be possible.
    
* **Disadvantages** :
    * It is a measure that is somewhat less intuitive than euclidean distance, especially when using in high-dimensional data.
    * Moreover, it is more likely to give a higher distance value than euclidean distance since it does not the shortest path possible.

In [5]:
# Implementation:

import math
def manhattan_distance(x, y):
    return sum(abs(e1-e2) for e1, e2 in zip(x, y))

row1 = [9, 16, 20, 11, 7]
row2 = [12, 24, 11, 9, 8]
# calculate distance
dist1 = manhattan_distance(row1, row2)
print(dist1)

# We can also perform the same calculation SciPy
from scipy.spatial.distance import cityblock

dist2  = cityblock(row1, row2)
print(dist2)

23
23


## 3. Hamming Distance
* Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.
* It is the number of values that are different between two binary vectors. It is typically used to compare two binary strings of equal length. It can also be used for strings to compare how similar they are to each other by calculating the number of characters that are different from each other.
* We are most likely going to encounter bitstrings when we one-hot encode categorical columns of data.
* For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ we might one hot encode each example as a bitstring with one bit for each column.

    red = [1, 0, 0]
    green = [0, 1, 0]
    blue = [0, 0, 1]

* The distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is the Hamming distance.
* For a one-hot encoded string, it might make more sense to summarize to the sum of the bit differences between the strings, which will always be a 0 or 1.

![hd.jpeg](attachment:hd.jpeg)
* **Use Cases** :
    * Typical use cases include error correction/detection when data is transmitted over computer networks. It can be used to determine the number of distorted bits in a binary word as a way to estimate error.
    * We can also use Hamming distance to measure the distance between categorical variables.

* **Disadvantages**:
    * Hamming distance is difficult to use when two vectors are not of equal length.
    * It does not take the actual value into account as long as they are different or equal. Therefore, it is not advised to use this distance measure when the magnitude is an important measure.


In [7]:
def hamming_distance(a, b):
    return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a) # Dividing to get average
 
# Data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist1 = hamming_distance(row1, row2)
print(dist1)

# calculating hamming distance using scipy
from scipy.spatial.distance import hamming
# calculate distance
dist2 = hamming(row1, row2)
print(dist2)

0.3333333333333333
0.3333333333333333
