# Machine Learning Workshop
## Ejercicio - Breast Cancer Diagnosis
Dataset: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data<br>
Tarea: Preprocesamiento de un conjunto de datos de cancer de mama y entrenamiento de un modelo de Machine Learning para diagnosticar la enfermedad.

In [39]:
import pandas as pd
import numpy as np
import seaborn as sns

# | remove max column restriction for printing DataFrames (default=20)
pd.options.display.max_columns = None 

In [40]:
dataset_path = '../Datasets/breast-cancer.csv'

### 1. Preprocessing

<b> 1a) </b> Lea el archivo CSV <i> breast-cancer.csv </i> con pandas y guárdelo en un DataFrame.

In [41]:
df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


<b> 1b) </b> Hay columnas que no necesitamos para el entrenamiento del modelo?  Si es el caso, eliminalos (Hint: <code>DataFrame.drop()</code>)

In [42]:
df = df.drop('id', axis=1)

<b> 1c) </b> La columna "diagnosis" indica si una persona tiene cancer ('M') o no ('B'). Vamos a usar esta columna como label para entrenar nuestro modelo. Pero para usar las funciones de scikit-learn, necesitamos convertirlo a una columna numerica, i.e. 'B' -> 0, 'M' -> 1. <br>(Hint: <code>df['diagnosis'] = df['diagnosis'].map(...)</code> https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.Series.map.html)

In [43]:
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})
df['diagnosis'].head()

0    1
1    1
2    1
3    1
4    1
Name: diagnosis, dtype: int64

<b> 1d) </b> Calculando los contados de los dos clases 'B' & 'M' miramos que hay mas ejemplos de la clase 'B' (personas sanas) que de la clase 'M' (personas con cancer). Unos modelos de Machine Learning no funcionan bien si el training set es desbalanceado. Usa el siguiente codigo para balancear el dataset.

In [44]:
# | calcular los contados de los dos clases
print(df['diagnosis'].value_counts())

0    357
1    212
Name: diagnosis, dtype: int64


In [45]:
# | shuffle
df = df.sample(frac=1)

# | class balancing
g = df.groupby('diagnosis')
df = g.apply(lambda x: x.sample(g.size().min())).reset_index(drop=True)

<b> 1e) </b> Divide el DataFrame en dos DataFrames nuevos <code>X, y</code>, donde <code>y</code> contiene los labels (columna "diagnosis), y  <code>X</code> las otras columnas (Features).

In [46]:
label_name = "diagnosis"

X = df.drop(label_name, axis=1)
y = df[label_name]

<b> 1f) </b> Aplica estandarización a los features (X). Recuerde: El objetivo de la estandarización es cambiar los valores de las columnas numéricas en el conjunto de datos a una escala común. (Hint: <code>df.mean(), df.std()</code>)<br>
\begin{equation*}
x = \frac{x-\mu}{\sigma}
\end{equation*}

In [47]:
X = (X - X.mean()) / X.std()
X.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,-1.14932,-0.291097,-1.15339,-1.01296,0.657859,-0.835227,-0.970585,-0.940886,-1.281112,0.424852,-0.974832,0.59543,-0.950951,-0.719816,0.622074,-0.97519,-0.590831,-0.519255,-0.30979,-0.230248,-1.197586,-0.324099,-1.211303,-0.992277,0.259204,-1.058478,-1.046555,-0.911956,-1.153186,-0.395958
1,-0.93762,-1.514032,-0.914953,-0.879829,1.354009,0.009151,-0.688725,-0.20959,-0.232676,-0.309273,-0.349257,0.206975,-0.583983,-0.400538,1.427734,0.440164,-0.44874,1.217268,-0.546701,-0.177637,-1.030483,-1.750152,-1.049854,-0.897313,0.058927,-0.592771,-1.03517,-0.608093,-1.320918,-0.949761
2,-0.511542,-0.853366,-0.550696,-0.54397,-0.252212,-0.918006,-0.777334,-0.980894,0.63437,-0.730561,-0.691528,-0.011973,-0.721146,-0.559489,-0.507195,-0.289115,-0.298026,-0.995963,0.590698,-0.718477,-0.645756,-0.470445,-0.69338,-0.632354,-0.42435,-0.601151,-0.51448,-1.021073,0.948759,-0.804243
3,-1.213633,-0.860394,-1.166593,-1.039587,0.607098,-0.035936,-0.436643,-0.411084,-0.334254,1.368926,-0.252736,0.432985,-0.698211,-0.472253,1.165348,0.863919,0.232473,0.51774,0.732171,1.570726,-1.154839,-0.96314,-1.189758,-0.967969,0.211312,-0.270726,-0.557982,-0.442814,-0.5563,0.577647
4,-0.921542,0.329742,-0.935147,-0.848985,-0.735167,-0.853803,-0.628688,-0.912032,-1.262973,0.326135,-0.691857,-0.423206,-0.698211,-0.595445,0.702136,-0.632208,-0.352763,-0.978517,-0.560175,-0.583793,-0.818689,0.425513,-0.841118,-0.761025,1.504405,-0.368896,0.007552,-0.656671,-0.222346,0.122251


<b> 1f) </b> Divide el dataset en un training set (70%) y un test set (30%) Puedes usar la funcion <code>train_test_split()</code> de scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. <br>

Al fin deberias tener cuatro nuevos variables: <br>
- X_train : Training features
- y_train : Training labels
- X_test : Test features
- y_test : Test labels

In [48]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## 2. Machine Learning

### Training

<b> 2a) </b> Ahora nuestros datos estan listos para entrenar nuestro primero modelo. Entrena un modelo Random Forest, usando la clase <code>RandomForestClassifier</code> de scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [49]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

### Prediction & Evaluation

<b> 2b) </b> Usa <code>model.predict(X_test)</code> para calcular las predicciones en el test-set. Despues calcula el porcentaje de las predicciones correctas ("Accuracy"). (Hint: Puedes usar la funcion <code>accuracy_score</code> de scikit-learn https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html, o calcularlo de forma manual). Si te da un valor mayor a 90%, has hecho todo bien, si te da un valor menor revisa tu codigo o pregunta el tutor.

In [50]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.9765625


<b> 2c) </b> Evalua tu modelo usando <code>cross_val_score(..., scoring='accuracy')</code> con 5 dataset splits (cv=5), y calcula la media y la desviacion estándar de los resultados (con numpy). https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html. Defina una nueva variable de modelo e inicialícela con un nuevo RandomForestClassifier primero de usar cross_val_score, para no usar el modelo que ya esta entrenado!

In [55]:
from sklearn.model_selection import cross_val_score
import numpy as np

model = RandomForestClassifier(n_estimators=100, random_state=0)

cv = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f'mean: {np.mean(cv)}, std: {np.std(cv)}')

mean: 0.949080459770115, std: 0.028591824062547577


### Bonus ejercicio: Model Selection & Grid Search

<b> 2d) </b> Compara los modelos <code>RandomForestClassifier, KNeighborsClassifier, SVC</code>. Cual funciona mejor? Intenta entrenarlas con diferentes parametros y observa si mejoran los resultados.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html <br>
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html <br>
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Hint: En el siguiente encontrarás los parametros mas relevantes de estos modelos:

- RandomForestClassifier(): n_estimators, criterion, max_depth
- KNeighborsClassifier(): n_neighbors, weights, algorithm
- SVC(): C, kernel, degree, gamma

<b> 2e) </b> Seguramente te has dado cuenta que hay muchos diferentes combinaciones de parametros. Cambiarlos parametros a mano cada vez que se entrena un nuevo modelo no es muy efectivo. Scikit-learn tiene una funcion ayuda accelerar este processo: <code>GridSearchCV</code> hace una búsqueda exhaustiva sobre valores de parámetros especificados para un estimador.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Hint:
Es importante no escoger demasiados parametros. Para cada combinacion de parametros se entrena un nuevo modelo, si lo corremos con milles de combinaciones diferentes, este proceso puede demorar mucho tiempo, dependiendo del modelo y del tamano del dataset.<br>

In [16]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

pipe = Pipeline([('classifier' , None)])

param_grid = [    
    {'classifier' : [SVC()],
    'classifier__C': [0.1, 1, 5, 10],
    'classifier__kernel': ["poly","rbf"],
    'classifier__degree': [1, 2, 3, 4],
    'classifier__gamma': [0.001, 0.01, 0.1]},
    
    {'classifier' : [RandomForestClassifier()],
    'classifier__n_estimators' : list(range(50,200,20)),
    'classifier__max_features' : ["auto", "log2"],
    'classifier__criterion' : ['gini', 'entropy'],
    'classifier__oob_score' : [True, False]},

    {'classifier' : [KNeighborsClassifier()],
    'classifier__n_neighbors': np.arange(3, 8),
    'classifier__weights': ['uniform', 'distance'],
    'classifier__algorithm': ['ball_tree', 'kd_tree', 'brute']}
]


grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', return_train_score=True)
_ = grid_search.fit(X, y)
# grid_search.cv_results_ 
# grid_search.best_params_



In [17]:
results = pd.DataFrame(grid_search.cv_results_)
results.sort_values(by='rank_test_score', ascending=True, inplace=True)
results[['rank_test_score', 'mean_test_score', 'params']].head(20)

# for r in results['params'].head(20):
#    print(r)

Unnamed: 0,rank_test_score,mean_test_score,params
28,1,0.959906,"{'classifier': SVC(C=1, cache_size=200, class_..."
74,1,0.959906,"{'classifier': SVC(C=1, cache_size=200, class_..."
93,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
87,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
50,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
81,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
51,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
75,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
69,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
57,3,0.955189,"{'classifier': SVC(C=1, cache_size=200, class_..."
