## Pablo Valdunciel Sánchez 
## 26th September, 2019

Sean dos muestras X e Y definidas por 10 atributos binarios, calcular:

- Tabla de contingencia
- Similaridades de Sokal-Michel y Jaccard
- Distancia, a través de la transformación de Gower
- Distancia de Hamming (número de discrepancias)

In [23]:
import numpy as np 
import math
from scipy.spatial.distance import hamming

In [10]:
X = np.array([1, 0, 0, 0, 1, 1, 0, 1, 0, 0])
X

array([1, 0, 0, 0, 1, 1, 0, 1, 0, 0])

In [11]:
Y = np.array([0, 0, 1, 0, 1, 1, 1, 1, 0, 1])
Y

array([0, 0, 1, 0, 1, 1, 1, 1, 0, 1])

#### Tabla de contingencia

In [13]:
def contingency_table(X, Y):
    table = {"a":0, "b":0, "c":0, "d":0}
    for i in range(X.shape[0]):
        if X[i] == 0 and Y[i] == 0:
            table["a"] += 1 
        elif X[i] == 0 and Y[i] == 1:
            table["b"] += 1
        elif X[i] == 1 and Y[i] == 0:
            table["c"] += 1 
        else:
            table["d"] += 1
    return table 

#### Similaridades

In [17]:
n = X.shape[0]
c_table = contingency_table(X,Y)
s_sokal_michel = (c_table["a"] + c_table["d"]) / n
s_jaccard = c_table["d"] / (c_table["d"] + c_table["b"] + c_table["c"])

print("- Sokal-Michel = " + str(s_sokal_michel))
print("- Jaccard = " + str(s_jaccard))

- Sokal-Michel = 0.6
- Jaccard = 0.42857142857142855


#### Distancias

In [21]:
d_XY = math.sqrt(2 - 2*s_sokal_michel)

print("- Distancia = ", str(d_XY))

- Distancia =  0.8944271909999159


In [26]:
d_hamming = hamming(X, Y)
print("- Distancia de Hamming = " + str(d_hamming))

- Distancia de Hamming = 0.4


#### Similaridad mixta

Calcular la similaridad entre las instancias 2 y 7. 

| Sample | X1   | X2  | X3 | X4 | X5 | X6 |
|------|------|-----|----|----|----|----|
| 1    | 52.5 | 3.8 | 1  | 1  | 1  | 1  |
| 2    | 50.2 | 2.9 | 0  | 1  | 1  | 1  |
| 3    | 53.4 | 4.2 | 0  | 1  | 3  | 2  |
| 4    | 49.8 | 2.8 | 0  | 0  | 1  | 1  |
| 5    | 53.4 | 3.9 | 1  | 1  | 2  | 2  |
| 6    | 54.1 | 4.6 | 0  | 1  | 1  | 1  |
| 7    | 52.3 | 3.7 | 1  | 1  | 1  | 2  |
| 8    | 53.8 | 3.9 | 0  | 1  | 4  | 1  |
| 9    | 50.7 | 2.6 | 1  | 0  | 2  | 1  |
| 10   | 51.6 | 3.5 | 1  | 1  | 1  | 3  |

donde 

- X1: altura
- X2: peso
- X3: sexo (1:mujer, 0:hombre)
- X4: tiempo de gestación (1: más de 35 semanas, 0: menos)
- X5: grupo sanguíneo (1:O, 2:A, 3:B, 4:AB)
- X6: raza (1:blanca, 2:negra, 3:otras)

In [28]:
X = np.array([
    [52.5 , 3.8 , 1  , 1  , 1  , 1], 
    [50.2 , 2.9 , 0  , 1  , 1  , 1], 
    [53.4 , 4.2 , 0  , 1  , 3  , 2], 
    [49.8 , 2.8 , 0  , 0  , 1  , 1], 
    [53.4 , 3.9 , 1  , 1  , 2  , 2], 
    [54.1 , 4.6 , 0  , 1  , 1  , 1], 
    [52.3 , 3.7 , 1  , 1  , 1  , 2], 
    [53.8 , 3.9 , 0  , 1  , 4  , 1], 
    [50.7 , 2.6 , 1  , 0  , 2  , 1], 
    [51.6 , 3.5 , 1  , 1  , 1  , 3]
])

In [30]:
n1 = [0, 1]
n2 = [2, 3]
n3 = [4, 5]

X_continuos = X[:,n1]
X_binary = X[:, n2]
X_cualitative = X[:, n3]

In [34]:
def coincidences(X,Y):
    c = 0
    for i in range(X.shape[0]):
        if X[i] == Y[i]: c+=1
    return c 

In [72]:
c_table = contingency_table(X_binary[1], X_binary[6])
alpha = coincidences(X_cualitative[1,:], X_cualitative[6])
ranges = np.ptp(X_continuos, axis= 0)
aux = np.ones([1,2]) - abs(X_continuos[1] - X_continuos[6])/ ranges
sum_continuos = sum(aux[0])

In [70]:
mixed_similarity = (sum_continuos +  c_table["d"] + alpha) / (len(n1) + len(n2) - c_table["a"] + len(n3))

In [71]:
print("Similaridad mixta = " + str(mixed_similarity))

Similaridad mixta = 0.518604651162791
