<a href="https://colab.research.google.com/github/nluizsoliveira/IA-Applications-Notes/blob/main/0_proximity_measurements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Proximity Measures 

## Motivation
Datasets are often represented as a matrix in which each row is an instance of data and each column represents a feature of that data. 

$$
M = 
\begin{bmatrix}
x11 & x12 & x13 & ... & x1d \\
x21 & x22 & x23 & ... & x2d \\
... & ... & ... & ... & ... \\
xn1 & xn2 & xn3 & ... & xnd 
\end{bmatrix}
$$

For example, the dataset  

  
| Item | Feature #1 | Feature #1 |
|------|------------|------------|
| x1   | 1.5        | 5055       |
| x2   | 2.3        | 5943       |
| x3   | 5.4        | 7100       |
| x4   | 3.2        | 8590       |
  

Can be represented as: 

$$
M = 
\begin{bmatrix}
1.5 & 5055 \\
2.3 & 5943 \\
5.4 & 7100 \\
3.2 & 8590
\end{bmatrix}
$$


Proximity Measures allow generating a real number, preferebly but not necessarely in range `[0,1]`,  that represents the similarity (or dissimilarity) of instances in a dataset. This allow a wide number of statistical and DS techniques.

For example, it's possible to group items in **clusters**, a set of similar items that have internal cohesion and external separation. 


Proximity measures are denoted by `d(xi, xj)`, being `x` an instance of a dataset (row of matrix). 


# Properties of a dissimilarity measure

1. Simetry: `d(xi, xj) = d (xj, xi)`
2. Positivity: `d(xi, xj) >= 0 for all pair xi, xj`
3. Reflexivity: `d(xi, xj) = 0 if, and only if xi = xj`
4. Triangle Inequality applies 

# Euclidian Distance

Euclidian distance can be applied to **continuous data** (numeric, real numbers). It's calculated as: 

![image](https://user-images.githubusercontent.com/49366837/193611845-6e0f314f-62ec-4b73-89f7-21b3a6c2a547.png)

It is noticable that:
1. features with bigger values/variances can "dominate" the euclidian distance. Therefore it may be interesting to normalize data. 

In [1]:
import requests
import pandas as pd 
from io import StringIO


data = pd.read_csv('https://raw.githubusercontent.com/nluizsoliveira/'
                   'Data-Notes/main/csvs/1_bank.csv')
data

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,blue-collar,single,primary,no,1,yes,no,cellular,20,apr,257,1,-1,0,unknown,no
11158,39,services,married,secondary,no,733,no,no,unknown,16,jun,83,4,-1,0,unknown,no
11159,32,technician,single,secondary,no,29,no,no,cellular,19,aug,156,2,-1,0,unknown,no
11160,43,technician,married,secondary,no,0,no,yes,cellular,8,may,9,2,172,5,failure,no
