# LAB: SVM: Continuamos con las predicciones sobre datos de cáncer de mama

## 1. Introducción

Continuamos con la tarea encarada en otros labs de clasificación de predecir el diagnóstico de cáncer de mama a partir de las características de las células.


* class_t es la variable target

* el resto son variables con valores normalizados de 1 a 10


[Aquí](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names) pueden encontrar más información sobre el dataset.

**Nota:** se eliminaron del dataset original 16 casos con valores perdidos en algunos campos.

In [1]:
# Importamos las librerías a utilizar
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [2]:
### Importamos los datos

df = pd.read_csv('breast-cancer.csv', header = None)
df.columns = ['ID', 'clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses', 'class_t']
df.sample(3)

Unnamed: 0,ID,clump_Thickness,unif_cell_size,unif_cell_shape,adhesion,epith_cell_Size,bare_nuclei,bland_chromatin,norm_nucleoli,mitoses,class_t
364,657753,3,1,1,4,3,1,2,2,1,2
16,1048672,4,1,1,1,2,1,2,1,1,2
567,1107684,6,10,5,5,4,10,6,10,1,4


## 2. Workflow de clasificación

La idea es implementar el modelo sobre el dataset utilizando cross validation para hacer una búsqueda sobre los hiperparámetros. 
Para tener tanto una estimación de los hiperparámetros óptimos como una idea de la performance sobre el modelo sobre datos nuevos vamos a completar el siguiente workflow:

<img src='worflowtt.png'></img>

Realicen el workflow teniendo en cuenta las siguientes condiciones:

1) Hacer un split train/test incial dejando 75% y 25% de los datos respectivamente

2) A la hora de explorar los hiperparámetros probar los valores de C = [0.1, 0.2, 0.3] y probar con los kernels de tipo radial y lineal

Normalizar los features. ¿Por qué?  ¿Hace falta normalizar en SVM ?

3) Utilizar 5 folds para el calcular el score de cross validation



In [3]:
X = df[['clump_Thickness', 'unif_cell_size', 'unif_cell_shape', 'adhesion', 'epith_cell_Size', 'bare_nuclei',
              'bland_chromatin ','norm_nucleoli', 'mitoses']]

In [4]:
y = df['class_t']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=53)

In [6]:
# Utilizamos sklearn para estandarizar la matriz de Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Xtrain = scaler.fit_transform(X_train)

In [10]:
scores = []
for c in [0.1,0.2,0.3]:
    for krn in ['linear','rbf']:
        score = {}
        score['kernel'] = krn
        score['c'] = c
        model = SVC(C=c,kernel=krn)
        cv_scores = cross_val_score(model,X,y,cv=5)
        score['mean'] = np.mean(cv_scores)
        score['std'] = np.std(cv_scores)
        scores.append(score)



In [11]:
df_scores = pd.DataFrame(scores)

In [12]:
df_scores

Unnamed: 0,c,kernel,mean,std
0,0.1,linear,0.96784,0.015675
1,0.1,rbf,0.93714,0.03275
2,0.2,linear,0.96638,0.014976
3,0.2,rbf,0.940059,0.029615
4,0.3,linear,0.96638,0.014976
5,0.3,rbf,0.944461,0.031104


In [13]:
best_model = SVC(C=0.1,kernel='linear')

In [14]:
best_model.fit(X_train,y_train)

SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [15]:
# Primero pasamos los datos de test por el mismo proceso de normalización 
X_test = scaler.transform(X_test)


In [16]:
y_pred = best_model.predict(X_test)

In [17]:
accuracy_score(y_test,y_pred)

0.6783625730994152

# La manera más prolija

Vamos a realizar un grid-search con cross validation para optimizar los parámetros del SVM.

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

parameters = {'kernel':('linear', 'rbf'), 'C':[0.1,0.2,0.3]}

svc = SVC()
clf = GridSearchCV(svc, parameters, cv = 5, n_jobs = -1, iid = True, scoring='accuracy', return_train_score=True) ### completar
clf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid=True, n_jobs=-1,
             param_grid={'C': [0.1, 0.2, 0.3], 'kernel': ('linear', 'rbf')},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=0)

In [19]:
#Recuperamos la información del CV
scores = clf.cv_results_
scores = pd.DataFrame(scores)
scores.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.006314,0.00051,0.002213,0.000255,0.1,linear,"{'C': 0.1, 'kernel': 'linear'}",0.980583,0.941748,0.92233,...,0.964844,0.028508,1,0.97555,0.977995,0.987775,0.973171,0.975669,0.978032,0.005105
1,0.014483,0.002697,0.003625,0.000376,0.1,rbf,"{'C': 0.1, 'kernel': 'rbf'}",0.902913,0.92233,0.941748,...,0.935547,0.022398,6,0.94132,0.93643,0.933985,0.934146,0.924574,0.934091,0.005446
2,0.005439,0.000832,0.001778,0.000167,0.2,linear,"{'C': 0.2, 'kernel': 'linear'}",0.980583,0.941748,0.92233,...,0.962891,0.026285,2,0.97555,0.977995,0.987775,0.973171,0.975669,0.978032,0.005105
3,0.011962,0.001416,0.00347,0.000327,0.2,rbf,"{'C': 0.2, 'kernel': 'rbf'}",0.902913,0.941748,0.941748,...,0.943359,0.024687,5,0.95599,0.9511,0.9511,0.946341,0.941606,0.949228,0.004882
4,0.006117,0.000741,0.002044,0.00022,0.3,linear,"{'C': 0.3, 'kernel': 'linear'}",0.980583,0.941748,0.92233,...,0.962891,0.026285,2,0.977995,0.977995,0.987775,0.973171,0.973236,0.978034,0.005321


In [20]:
param = clf.best_params_
print(param)

{'C': 0.1, 'kernel': 'linear'}


In [21]:
best_model = SVC(C=0.1,kernel='linear')
best_model.fit(X_train,y_train)

SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [22]:
# Primero pasamos los datos de test por el mismo proceso de normalización 
X_test = scaler.transform(X_test)

In [23]:
y_pred = best_model.predict(X_test)
accuracy_score(y_test,y_pred)

0.6783625730994152