# Clasificción 'Chaleco' vs 'No Chaleco'

En este ejemplo se crear un clasificador usando *Random Forest*, que tomando como entrada las *features* obtenidas mediante la CNN nos permita detectar cuando una persona lleva chaleco y cuando no.

Los pasos necesarios son:
1. Cargar los datos.
2. Generar los *data sets* a partir de los datos cargados.
3. Entrenar el modelo
4. Validar el modelos
5. Se muestra un ejemplo

## Carga de los datos
Cargamos los datos con las *features* extraidas por la CNN. Estos datos se encuentran en el fichero *./datasets/Chalecos.csv*.

A continuación se realiza al carga medianta pandas y se muestra estos datos.

In [1]:
import pandas as pd

datos = pd.read_csv('./datasets/Chalecos.csv', sep=';')
datos.head(5)

Unnamed: 0,Path,signature_1,signature_2,signature_3,signature_4,signature_5,signature_6,signature_7,signature_8,signature_9,...,B_25,B_26,B_27,B_28,B_29,B_30,B_31,B_32,B_33,label
0,chaleco_0000000643_00001.jpg,25528610.0,1289.115,32648.097664,963.204223,136.339264,113.0038,2033.042236,1483872.0,233392.093469,...,0.002785,0.012823,0.009986,0.011726,0.01255,0.013856,0.014765,0.001652,0.010662,0
1,chaleco_0000000371_00001.jpg,-5.395348e-315,83093.39,1936.38208,371277.312545,179126.781705,2066947.0,62801.101632,21.04087,31.789776,...,0.001035,0.001451,0.001583,0.007212,0.006741,0.001099,0.001435,0.000634,0.001503,0
2,chaleco_0000001592_00001.jpg,-5.391256e-315,29.36839,1.802548,213.491974,58.976784,184.2215,2724.506351,0.5980717,2711.857893,...,0.0,0.005006,0.003638,0.005729,0.00754,0.004387,0.005235,0.0,0.004612,1
3,chaleco_0000000202_00001.jpg,606.2828,2013213.0,373301.812739,9463.767583,6387.141616,159433.1,56676.851604,1687501.0,1910.136471,...,0.03586,0.007249,0.002453,0.00985,0.008871,0.002997,0.00445,0.001721,0.002858,0
4,chaleco_0000000303_00001.jpg,-5.375468e-315,171546.2,1117.738524,6708.946291,388763.063308,200520.3,20229.128951,3.094792,23.038914,...,0.005268,0.008404,0.001363,0.014296,0.014993,0.002415,0.003174,0.001725,0.001411,0


In [2]:
tam = datos.shape

print("Filas: " + str(tam[0]) + ". Columnas: " + str(tam[1]))

Filas: 785. Columnas: 201


## Creación de los datasets
Se crea el *data frame* Xfd, en el cual se tiene las variables observadas, es decir todas menos la primera (que es el path a la fotografia) y la última (que es al clase). Ademas se genera el *data frame* Ydf con los valores de las observaciones.

In [3]:
Xdf = datos.iloc[: , 1:200]
Xdf.head(5)

Unnamed: 0,signature_1,signature_2,signature_3,signature_4,signature_5,signature_6,signature_7,signature_8,signature_9,signature_10,...,B_24,B_25,B_26,B_27,B_28,B_29,B_30,B_31,B_32,B_33
0,25528610.0,1289.115,32648.097664,963.204223,136.339264,113.0038,2033.042236,1483872.0,233392.093469,333.065125,...,0.017265,0.002785,0.012823,0.009986,0.011726,0.01255,0.013856,0.014765,0.001652,0.010662
1,-5.395348e-315,83093.39,1936.38208,371277.312545,179126.781705,2066947.0,62801.101632,21.04087,31.789776,5965.858402,...,0.0042,0.001035,0.001451,0.001583,0.007212,0.006741,0.001099,0.001435,0.000634,0.001503
2,-5.391256e-315,29.36839,1.802548,213.491974,58.976784,184.2215,2724.506351,0.5980717,2711.857893,12.534834,...,0.011086,0.0,0.005006,0.003638,0.005729,0.00754,0.004387,0.005235,0.0,0.004612
3,606.2828,2013213.0,373301.812739,9463.767583,6387.141616,159433.1,56676.851604,1687501.0,1910.136471,3008.183111,...,0.008654,0.03586,0.007249,0.002453,0.00985,0.008871,0.002997,0.00445,0.001721,0.002858
4,-5.375468e-315,171546.2,1117.738524,6708.946291,388763.063308,200520.3,20229.128951,3.094792,23.038914,7719.600588,...,0.009811,0.005268,0.008404,0.001363,0.014296,0.014993,0.002415,0.003174,0.001725,0.001411


In [4]:
Ydf = datos['label']
Ydf.head(5)

0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64

Una vez se tiene separado los valores de X e Y, se generar los conjuntos de *train* y *test*. Partiendo un 70% de los datos para el conjunto de entrenamiento y el 30% restante para el de validación.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(Xdf, Ydf, test_size=0.3) # 70% para entrenamiento y 30% para test

## Entrenamos el modelo
Entrnmos el modelo de *random forest* con hasta 50 arboles

In [6]:
from sklearn.ensemble import RandomForestClassifier

modelo = RandomForestClassifier(n_estimators=50)
modelo = modelo.fit(X_train,Y_train)

# Validación
Utilizamos los datos de test para obtener metricas de bondad del modelo

In [7]:
import sklearn.metrics


predicciones = modelo.predict(X_test)
print (sklearn.metrics.confusion_matrix(Y_test,predicciones)) #arriba a la izquierda TN y abajo a la derecha TP.
print ("Accuracy: ", sklearn.metrics.accuracy_score(Y_test,predicciones))

[[145   1]
 [ 19  71]]
Accuracy:  0.9152542372881356


## Mostrando un ejemplo

A contiuación hacemos una prueba manual para ver los resultados

In [10]:
import numpy as np

X = Xdf.iloc[123, :]
Y = Ydf.iloc[123]
print("Clase real: ", Y)
Y_p = modelo.predict(np.array(X).reshape(1, -1))
print("Clase estimada: ", Y_p[0])
print("Nombre:", datos.iloc[256,0])

Clase real:  1
Clase estimada:  1
Nombre: chaleco_0000000523_00001.jpg
