<br>
<img align="center" src="imagenes/logo.png"  width="200" height="141">
<font size=36><center> Machine Learning con Python </center> </font>
<br>

<h1 align='center'> Modulo IV: Machine Learning </h1>
<h2 align='center'>  Evaluación de Modelos de Clasificación </h2> 

---

# Introducción

En esta sección veremos un conjunto de métricas que nos permitirán evaluar un modelo de clasificación, de tal manera que podamos comprar con otros modelos, hacer ajustes, tomar decisiones, etc.

# Tasas de clasificación de un modelo

### Exactitud (Accuracy)

En un modelo de clasificación binaria tenemos cuatro posibles resultados a la hora de clasificar. Supongamos que tenemos una clase $C$ y queremos determinar la pertenencia o no a dicha clase, entonces tendremos los posibles resultados:

* **Verdaderos Positivos (True positives):** Son aquellas observaciones que pertenecen a la clase $C$ y que el modelo acierta identificando que pertenece a la clase $C$.


* **Falsos Positivos (False positives):** Son aquellas observaciones que no pertenecen a la clase $C$, pero sin embargo el modelo determina que si pertenecen a esta.


* **Verdaderos Negativos (True negatives):**  Son aquellas observaciones que no   pertenecen a la clase $C$ y el modelo determina que no pertenecen a la clase $C$.


* **Falsos Negativos (False negatives):** Son aquellos casos en los que las observaciones pertenecen a la clase $C$ y el modelo determina que no pertenece a la clase $C$.


La **exactitud** es una medida general de cómo se comporta el modelo, mide el porcentaje de casos que han clasificados correctamente. Viene dado por la siguiente relación:

$$ \mbox{Exactitud} = \frac{\mbox{Nro de observaciones clasificadas correctamente}}{\mbox{Nro total de observaciones}} = \frac{VP+VN}{VP+VN+FP+FN}$$

### Precisión (Precission)

La precisión de un modelo viene dada por su capacidad de identificar a los modelos que pertenecen a una clase, en este caso la precisión se determina de la siguiente forma:

$$\mbox{Precisión} = \frac{\mbox{Nro de observaciones positivas clasificadas correctamente}}{\mbox{Nro de observaciones clasificadas como positivas}} = \frac{VP}{VP+FP}$$

### Sensibilidad

La sensibilidad nos da una medida de la habilidad del modelo para encontrar todos los valores perteneciente a una clase. La sensibilidad se mide en función de una clase.

$$\mbox{Sensibilidad} = \frac{\mbox{Nro de observaciones positivas clasificadas correctamente}}{\mbox{Nro de observaciones positivas totales}} = \frac{VP}{VP+FN}$$

### Matriz de Confusión

La **matriz de confusión** contabiliza y agrupa cómo han sido clasificadas las observaciones por el modelo. 

$$\left(\begin{array}{cc} VP & FP \\ FN & VN \end{array}\right)$$


### F1 score

Esta medida se define como la media armónica de la precisión y la sensibilidad, esto es:

$$\mbox{f1 score} = 2\frac{1}{\frac{1}{\mbox{precisión}}+ \frac{1}{\mbox{sensibilidad}}} = 2\frac{(precisión)(sensibilidad)}{precisión + sensibilidad}$$

# Práctica

##  Importamos librerias

In [112]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

## Dataset

In [66]:
data = load_breast_cancer()
X = data.data
y = data.target

In [136]:
# Se cambio el orden de los datos, para que el valor 1 representara la presencia de cancer y 0 para que no
y_new = np.array([1 if val==0 else 0 for val in y])

## Creamos un DataFrame para visualizar los datos

In [137]:
df =pd.DataFrame(data=X, columns=list(data.feature_names))

In [138]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Agregamos la variable objeetivo a la tabla

In [139]:
df['target']=y_new

In [76]:
df.head(10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,0.3345,0.8902,2.217,27.19,0.00751,0.03345,0.03672,0.01137,0.02165,0.005082,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,0.4467,0.7732,3.18,53.91,0.004314,0.01382,0.02254,0.01039,0.01369,0.002179,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,0.5835,1.377,3.856,50.96,0.008805,0.03029,0.02488,0.01448,0.01486,0.005412,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,0.3063,1.002,2.406,24.32,0.005731,0.03502,0.03553,0.01226,0.02143,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,0.2976,1.599,2.039,23.94,0.007149,0.07217,0.07743,0.01432,0.01789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,0


## Dividimos el dataset

In [140]:
xtrain, xtest, ytrain, ytest = train_test_split(X,y_new,test_size = 0.3, random_state=42)

In [141]:
print(xtrain.shape,xtest.shape)

(398, 30) (171, 30)


## Normalizamos los datos

In [142]:
scaler = MinMaxScaler()

In [143]:
xtrain_normal = scaler.fit_transform(xtrain)
xtest_normal = scaler.fit_transform(xtest)

## Creamos el modelo de Regresión Logística

In [144]:
modelo = LogisticRegression()

In [145]:
modelo.fit(xtrain_normal,ytrain)

LogisticRegression()

## Predicción del modelo

In [146]:
y_pred = modelo.predict(xtest_normal)

## Metricas 

### Matriz de confusión

In [150]:
MC = confusion_matrix(ytest,y_pred, labels=[0,1])
print(MC)

[[100   8]
 [  1  62]]


In [161]:
print('Verdaderos Negativos (Cancer benigno) = {}'.format(MC[0,0]))
print('Falsos Positivos (Cancer beligno calificado como maligno) = {}'.format(MC[1,0]))
print('Falsos Negativos (Carcer maligno calificado como benigno) = {}'.format(MC[0,1]))
print('Verdaderos Positivos (Cancer maligno) = {}'.format(MC[1,1]))

Verdaderos Negativos (Cancer benigno) = 100
Falsos Positivos (Cancer beligno calificado como maligno) = 1
Falsos Negativos (Carcer maligno calificado como benigno) = 8
Verdaderos Positivos (Cancer maligno) = 62


### Exactitud

In [157]:
metrics.accuracy_score(ytest,y_pred)

0.9473684210526315

### Precisión

In [158]:
metrics.average_precision_score(ytest,y_pred)

0.8775032820145602

### Sensibilidad

In [159]:
metrics.recall_score(ytest,y_pred)

0.9841269841269841

### F1_score

In [160]:
metrics.f1_score(ytest,y_pred)

0.9323308270676691