# MAT281 - Laboratorio N°10



<a id='p1'></a>
## I.- Problema 01


<img src="https://www.goodnewsnetwork.org/wp-content/uploads/2019/07/immunotherapy-vaccine-attacks-cancer-cells-immune-blood-Fotolia_purchased.jpg" width="360" height="360" align="center"/>


El **cáncer de mama**  es una proliferación maligna de las células epiteliales que revisten los conductos o lobulillos mamarios. Es una enfermedad clonal; donde una célula individual producto de una serie de mutaciones somáticas o de línea germinal adquiere la capacidad de dividirse sin control ni orden, haciendo que se reproduzca hasta formar un tumor. El tumor resultante, que comienza como anomalía leve, pasa a ser grave, invade tejidos vecinos y, finalmente, se propaga a otras partes del cuerpo.

El conjunto de datos se denomina `BC.csv`, el cual contine la información de distintos pacientes con tumosres (benignos o malignos) y algunas características del mismo.


Las características se calculan a partir de una imagen digitalizada de un aspirado con aguja fina (FNA) de una masa mamaria. Describen las características de los núcleos celulares presentes en la imagen.
Los detalles se puede encontrar en [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].


Lo primero será cargar el conjunto de datos:

In [22]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets

%matplotlib inline
sns.set_palette("deep", desat=.6)
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [23]:
# cargar datos
df = pd.read_csv(os.path.join("data","BC.csv"), sep=",")
df['diagnosis'] = df['diagnosis'] .replace({'M':1,'B':0}) # target 
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Basado en la información presentada responda las siguientes preguntas:

1. Realice un análisis exploratorio del conjunto de datos.
1. Normalizar las variables numéricas con el método **StandardScaler**.
3. Realizar un método de reducción de dimensionalidad visto en clases.
4. Aplique al menos tres modelos de clasificación distintos. Para cada uno de los modelos escogidos, realice una optimización de los hiperparámetros. además, calcule las respectivas métricas. Concluya.




In [24]:
def resumen_por_columna(df,cols):
    pd_serie=df[cols]
    unico=pd_serie.unique()
    vacio=pd_serie[pd_serie.isna()]
    df_info=pd.DataFrame({'columna':[cols],'unicos':[len(unico)],'vacios':[len(vacio)]})
    return df_info

In [25]:
frame=[]
for col in df:
    aux_df=resumen_por_columna(df,col)
    frame.append(aux_df)
df_i=pd.concat(frame).reset_index(drop=True)
df_i['% vacios']=df_i['vacios']/len(df)
df_i

Unnamed: 0,columna,unicos,vacios,% vacios
0,id,569,0,0.0
1,diagnosis,2,0,0.0
2,radius_mean,456,0,0.0
3,texture_mean,479,0,0.0
4,perimeter_mean,522,0,0.0
5,area_mean,539,0,0.0
6,smoothness_mean,474,0,0.0
7,compactness_mean,537,0,0.0
8,concavity_mean,537,0,0.0
9,concave points_mean,542,0,0.0


### Entrenamiento PCA con StandardScaler

In [26]:
from sklearn.pipeline import make_pipeline

pca_pipe=make_pipeline(StandardScaler(),PCA()).fit(df)
modelo_pca=pca_pipe.named_steps['pca']

In [27]:
pd.DataFrame(data=modelo_pca.components_,columns=df.columns,index=['PC1', 'PC2', 'PC3', 'PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12','PC13','PC14','PC15','PC16','PC17','PC18','PC19','PC20','PC21','PC22','PC23','PC24','PC25','PC26','PC27','PC28','PC29','PC30','PC31','PC32'])

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
PC1,0.022013,0.216836,0.216404,0.103706,0.224541,0.218007,0.137491,0.231311,0.251115,0.255161,...,0.225597,0.105058,0.233636,0.222022,0.125188,0.204288,0.222927,0.246161,0.120461,0.126524
PC2,-0.032368,-0.077035,-0.226045,-0.058587,-0.207125,-0.222897,0.189042,0.158967,0.068175,-0.026702,...,-0.212408,-0.045164,-0.192094,-0.211595,0.172729,0.148267,0.103591,-0.001785,0.142765,0.276795
PC3,0.097903,-0.104562,-0.000271,0.057256,-0.000892,0.037809,-0.101731,-0.067796,0.009542,-0.019013,...,-0.03921,-0.050352,-0.039889,-0.002394,-0.257203,-0.229456,-0.165984,-0.162802,-0.271611,-0.229383
PC4,0.0273,0.098214,-0.051257,0.599487,-0.052045,-0.061834,-0.149217,-0.040524,-0.028097,-0.070732,...,-0.024516,0.627975,-0.023827,-0.033993,-0.010667,0.075525,0.058062,-0.016847,0.035909,0.067294
PC5,-0.009117,-0.08068,0.04202,-0.020158,0.041831,0.014292,-0.367699,0.017037,0.089721,-0.04138,...,0.001651,-0.060463,0.014025,-0.021243,-0.321004,0.133442,0.197066,0.04898,-0.235423,0.101783
PC6,0.31631,0.005288,-0.029793,0.030418,-0.028623,-0.006413,0.261919,0.004499,0.002251,0.034264,...,-0.004876,0.044032,-0.013302,0.023557,0.364404,-0.034475,-0.018082,0.029356,-0.452128,0.092201
PC7,0.906762,-0.028818,-0.041255,0.019096,-0.042348,-0.027189,-0.13756,-0.042472,-0.031721,-0.080222,...,-0.014912,-0.004676,-0.012529,0.001915,-0.064929,0.053367,0.035847,-0.020216,0.235544,0.036969
PC8,-0.099014,-0.145998,-0.105531,0.018122,-0.094321,-0.035776,-0.099449,0.061094,-0.090455,-0.122745,...,-0.000748,0.025191,0.010757,0.075227,-0.108297,0.142236,-0.061912,-0.158853,-0.041253,0.362993
PC9,-0.15036,0.198361,0.016627,0.061302,0.005334,0.049416,-0.303182,-0.128519,-0.095406,-0.137329,...,0.057153,-0.010226,0.042817,0.080026,0.155876,0.09884,0.027892,-0.05744,0.182309,0.114979
PC10,-0.15862,0.076739,-0.23256,0.100103,-0.235151,-0.198613,-0.056163,-0.200352,0.028392,-0.141259,...,-0.105098,0.09728,-0.104645,-0.071884,0.152487,-0.074612,0.188064,0.054002,0.091539,-0.104132


### Realizamos las proyecciones

In [28]:
proyec=pca_pipe.transform(X=df)
proyecc=pd.DataFrame(proyec,columns=['PC1', 'PC2', 'PC3', 'PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12','PC13','PC14','PC15','PC16','PC17','PC18','PC19','PC20','PC21','PC22','PC23','PC24','PC25','PC26','PC27','PC28','PC29','PC30','PC31','PC32'],index = df.index)
proyecc.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31,PC32
0,9.216724,2.13675,-0.996666,-3.701989,-1.254759,-1.375762,0.395573,2.211307,-0.160213,-0.060519,...,0.172314,0.082037,0.085714,-0.177079,-0.156238,-0.186689,-0.26274,-0.033102,0.043804,-0.046911
1,2.651244,-3.770741,-0.554319,-1.133416,0.553634,-0.122295,-0.308984,-0.067303,0.560147,-0.608057,...,-0.059225,-0.08547,-0.21172,0.007385,-0.17113,-0.044703,0.181939,0.03193,-0.004078,-0.002272
2,5.900697,-1.010916,-0.467841,-0.933755,-0.194574,-0.402068,0.460244,-0.70804,-0.046968,-0.088025,...,0.206359,-0.049831,-0.073666,0.110708,0.175046,-0.005597,0.045919,0.047545,0.001681,0.001097
3,7.137376,10.315636,-3.256828,-0.053782,-2.944418,-2.566999,1.962902,1.231422,1.311039,-1.197071,...,0.239471,-0.196805,-0.136421,0.162871,0.080311,-0.288804,0.167094,0.042934,-0.070615,-0.019377
4,4.139263,-1.916891,1.46528,-2.877471,0.36075,1.240987,-0.242104,-1.093712,0.713389,-0.156483,...,-0.082677,-0.025528,0.136264,-0.01683,0.001169,0.045126,0.038431,-0.035528,0.007469,0.020662


In [34]:
proyeccion = np.dot(modelo_pca.components_, scale(df).T)
proyecciones = pd.DataFrame(proyeccion,index = ['PC1', 'PC2', 'PC3', 'PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12','PC13','PC14','PC15','PC16','PC17','PC18','PC19','PC20','PC21','PC22','PC23','PC24','PC25','PC26','PC27','PC28','PC29','PC30','PC31','PC32'])
proyecciones = proyecciones.transpose().set_index(df.index)
proyecciones.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31,PC32
0,9.216724,2.13675,-0.996666,-3.701989,-1.254759,-1.375762,0.395573,2.211307,-0.160213,-0.060519,...,0.172314,0.082037,0.085714,-0.177079,-0.156238,-0.186689,-0.26274,-0.033102,0.043804,-0.046911
1,2.651244,-3.770741,-0.554319,-1.133416,0.553634,-0.122295,-0.308984,-0.067303,0.560147,-0.608057,...,-0.059225,-0.08547,-0.21172,0.007385,-0.17113,-0.044703,0.181939,0.03193,-0.004078,-0.002272
2,5.900697,-1.010916,-0.467841,-0.933755,-0.194574,-0.402068,0.460244,-0.70804,-0.046968,-0.088025,...,0.206359,-0.049831,-0.073666,0.110708,0.175046,-0.005597,0.045919,0.047545,0.001681,0.001097
3,7.137376,10.315636,-3.256828,-0.053782,-2.944418,-2.566999,1.962902,1.231422,1.311039,-1.197071,...,0.239471,-0.196805,-0.136421,0.162871,0.080311,-0.288804,0.167094,0.042934,-0.070615,-0.019377
4,4.139263,-1.916891,1.46528,-2.877471,0.36075,1.240987,-0.242104,-1.093712,0.713389,-0.156483,...,-0.082677,-0.025528,0.136264,-0.01683,0.001169,0.045126,0.038431,-0.035528,0.007469,0.020662


### Realizamos la reconstrucción

In [36]:
rec=pca_pipe.inverse_transform(X=proyecciones)
reconstruccion=pd.DataFrame(rec,columns=df.columns,index=df.index)
print('Los valores originales son:')
display(reconstruccion.head())
print('Los valores reconstruidos son:')
display(df.head())

Los valores originales son:


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302.0,1.0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517.0,1.0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903.0,1.0,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301.0,1.0,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402.0,1.0,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Los valores reconstruidos son:


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Aplicando modelos de clasificación

In [40]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X=df
X=StandardScaler().fit_transform(X)
Y=df['diagnosis']

#Split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=2)

In [41]:
print('Separamos la información:\n')
print('N° filas data original:',len(X))
print('N° filas train set:',len(X_train))
print('N° filas test set:',len(X_test))

Separamos la información:

N° filas data original: 569
N° filas train set: 455
N° filas test set: 114


In [42]:
slog=SVC()
slog.fit(X_train,Y_train)

SVC()

In [43]:
slog.score(X_train,Y_train)

1.0

In [44]:
y_true=list(Y_test)
y_pred=list(slog.predict(X_test))
print('Valores originales:',y_true)
print('Valores predichos:',y_pred)

Valores originales: [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
Valores predichos: [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


In [47]:
from metrics_classification import *
from sklearn.metrics import confusion_matrix

print('Matriz de confusión con SVC:')
print(confusion_matrix(y_true,y_pred))

Matriz de confusión con SVC:
[[68  1]
 [ 0 45]]


In [48]:
df_temp=pd.DataFrame({'y':y_true,'yhat':y_pred})
df_metrics=summary_metrics(df_temp)
df_metrics

Unnamed: 0,accuracy,recall,precision,fscore
0,0.9912,0.9928,0.9891,0.9909


In [49]:
from sklearn.linear_model import LogisticRegression

rlog=LogisticRegression()
rlog.fit(X_train,Y_train)

LogisticRegression()

In [50]:
rlog.score(X_train,Y_train)

1.0

In [52]:
y_pred=list(rlog.predict(X_test))
print('Valores originales:',y_true)
print('Valores predichos:',y_pred)

Valores originales: [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
Valores predichos: [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


In [53]:
print('Matriz de confusión con LogisticRegression:')
print(confusion_matrix(y_true,y_pred))

Matriz de confusión con LogisticRegression:
[[69  0]
 [ 0 45]]


In [54]:
df_temp=pd.DataFrame({'y':y_true,'yhat':y_pred})
df_metrics=summary_metrics(df_temp)
df_metrics

Unnamed: 0,accuracy,recall,precision,fscore
0,1.0,1.0,1.0,1.0


In [55]:
from sklearn.ensemble import RandomForestClassifier

flog=RandomForestClassifier()
flog.fit(X_train,Y_train)

RandomForestClassifier()

In [56]:
flog.score(X_train,Y_train)

1.0

In [61]:
y_pred=list(flog.predict(X_test))
print('Valores originales:',y_true)
print('Valores predichos:',y_pred)

Valores originales: [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
Valores predichos: [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]


In [62]:
print('Matriz de confusión con RandomForestClassifier:')
print(confusion_matrix(y_true,y_pred))

Matriz de confusión con RandomForestClassifier:
[[68  1]
 [ 1 44]]


In [63]:
df_temp=pd.DataFrame({'y':y_true,'yhat':y_pred})
df_metrics=summary_metrics(df_temp)
df_metrics

Unnamed: 0,accuracy,recall,precision,fscore
0,0.9825,0.9816,0.9816,0.9816


Finalmente el modelo LogisticRegression es el más acertado ya que su matriz de confusión no posee falsos negativos ni falsos positivos y además su tabla de métricas posee todos sus valores iguales a 1.
