# Distance Measures in Data Science 

From https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa

In [3]:
import numpy as np
import pandas as pd

## Euclidian Distance

 $$ D(x,y)= \sqrt{\sum \limits_{i=1} ^n (x_{i}-y_{i})²}$$

In [2]:
# https://stackoverflow.com/questions/56115205/euclidean-distance-between-two-pandas-dataframes

def Euclidean_Dist(df1, df2, cols=['x_coord','y_coord']):
    return np.linalg.norm(df1[cols].values - df2[cols].values,
                   axis=1)

## Cosine Similarity

 $$ D(x,y)= \cos{\theta} = \frac{{x}\cdot{y}}{\|x\|\cdot\|y\|}$$

In [1]:
from scipy.spatial import distance
distance.cosine([1,0,0],[0,1,0])
    

1.0

## Hamming Distance

Number of element which are different between two vectors.
Use Case : distance between categorical variables


## Manhattan Distance

 $$ D(x,y)= \sum \limits _{i=1} ^{n} {|x_{i}-y_{i}|}$$

In [13]:
from scipy.spatial.distance import cdist
import numpy as np
a = np.array([[ 0.1,  0.2,  0.4]])
b = np.array([[ 0.3,  0.2,  0.4]])
out = cdist(a, b, metric='cityblock')
print(out)

[[0.2]]


## Chebyshev Distance

 $$ D(x,y)= \max_{i} ({|x_{i}-y_{i}|})$$

In [16]:
from scipy.spatial.distance import cdist

out = cdist(a, b, metric='chebyshev')
print(out)

[[0.2]]


## Minkowsky Distance

 $$ D(x,y)= (\sum \limits _{i=1} ^{n} {|x_{i}-y_{i}|}^{p})^{\frac{1}{p}} $$

In [17]:
out = cdist(a, b, 'minkowski', p=0.5)
print(out)

[[0.2]]


## Jaccard Index

 $$ D(x,y)= 1- {\frac{|{x}\cap{y}|}{|{y}\cup{x}|}} $$

In [20]:
cdist(a, b, 'jaccard')

array([[0.33333333]])

## Haversine

## Sorensen-Dice Index

 $$ D(x,y)= {\frac{2 |{x}\cap{y}|}{|{y}|+|{x}|}} $$

In [25]:
#https://stackoverflow.com/questions/31273652/how-to-calculate-dice-coefficient-for-measuring-accuracy-of-image-segmentation-i
import numpy as np

k=1

# segmentation
seg = np.zeros((100,100), dtype='int')
seg[30:70, 30:70] = k

# ground truth
gt = np.zeros((100,100), dtype='int')
gt[30:70, 40:80] = k

dice = np.sum(seg[gt==k])*2.0 / (np.sum(seg) + np.sum(gt))

print( 'Dice similarity score is {}'.format(dice))

Dice similarity score is 0.75
