# Clase 3 - Classification Models

Este es el notebook para la clase o unidad 3, que trata de modelos de clasificación. En particular, revisaremos los modelos K-NN, Naive Bayes, LDA, QDA, SVM y Árboles de clasificación, desde dividir para obtener nuestro set de datos para entrenar y testear, pasando por optimización de parámetros y llegando a medir el desempeño del modelo.

Los datos con los que trabajaremos son datos de cancer de mamas, recolectado por la University of Wisconsin Hospitals, Madison, que contiene 699 registros, 10 atributos y la variable target que nos indica si es el tumor es benigno o maligno.

## K-Nearest Neighbors

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

breast_data = pd.read_excel('./Data/breast-cancer-wisconsin.xlsx')
breast_data.head()


In [None]:
breast_data.Class.unique()

In [None]:
breast_data.describe()

In [None]:
breast_data['Class'].value_counts()

Lo primero que debemos hacer, es dividir nuestros datos para crear el set de train y de test, para esto usamos la función `train_test_split` que viene en `sklearn.model_selection`. Debemos recordar que al K-NN estar basado en distancias, por lo que sería necesario estandarizar o normalizar los datos antes de ponerlos en el modelo.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#breast_data.drop(['ID'], axis='columns',inplace=True)
breast_X =  breast_data.drop(['Class'], axis='columns')
breast_y = breast_data['Class']

scaler = MinMaxScaler().fit(breast_X)
scaled_breast_X = pd.DataFrame(scaler.transform(breast_X), columns=breast_X.columns)
scaled_breast_X.head()

In [None]:
train_X, test_X, train_y, test_y = train_test_split(breast_X, breast_y, test_size=0.2, stratify = breast_y, random_state=2023)

In [None]:
display(train_y.value_counts())
display(test_y.value_counts())

Ahora ajustamos el modelo de K-NN

In [None]:
knn_model = KNeighborsClassifier(n_neighbors = 3)
knn_model.fit(train_X,train_y)

Veamos que tal lo hace el modelo, para eso realizaremos predicciones y luego imprimiremos la matriz de confusión

In [None]:
knn_model.predict_proba(test_X)

In [None]:
knn_model.predict(test_X)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

pred_values = knn_model.predict(test_X)
print(classification_report(test_y,pred_values))
print(confusion_matrix(test_y,pred_values))

In [None]:
probs = pd.DataFrame(knn_model.predict_proba(test_X),columns = ["No",'Si'])
probs["New_si"] = np.where(probs["Si"] >= 0.3,1,0)

In [None]:
probs.head(100)

In [None]:
print(classification_report(test_y,probs["New_si"]))
print(confusion_matrix(test_y,probs["New_si"]))

Veamos con otro set de datos

In [None]:
heart = pd.read_excel('./Data/Heart.xlsx')
heart.head()

In [None]:
heart.dropna(inplace=True)
heart_X = heart.drop(['Sex','ChestPain','Thal','AHD'],axis='columns')
heart_y = heart['AHD'].replace(('Yes','No'),(1,0))

In [None]:
minmax_ahd = MinMaxScaler().fit(heart_X)
scaled_heart_df = pd.DataFrame(minmax_ahd.transform(heart_X), columns=heart_X.columns)
scaled_heart_df.head()

In [None]:
train_X, test_X, train_y, test_y = train_test_split(heart_X, heart_y, test_size=0.2, stratify=heart_y,random_state=2023)

In [None]:
minmax_ahd = MinMaxScaler().fit(train_X)
scaled_heart_train_X = pd.DataFrame(minmax_ahd.transform(train_X), columns=heart_X.columns)
scaled_heart_test_X = pd.DataFrame(minmax_ahd.transform(test_X), columns=heart_X.columns)
scaled_heart_train_X.head()

In [None]:
scaled_heart_train_X.describe()

In [None]:
scaled_heart_test_X.describe()

In [None]:
heart_y.value_counts()

In [None]:
160/(160+137)

In [None]:
knn_model = KNeighborsClassifier(n_neighbors = 3)
knn_model.fit(scaled_heart_train_X,train_y)

pred_values = knn_model.predict(scaled_heart_test_X)
print(classification_report(test_y,pred_values))
print(confusion_matrix(test_y,pred_values))

veamos qué pasa cuando optimizamos los hiper-parámetros del modelo utilizando `GridSearch`

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors' : [2, 3, 5, 7, 8, 9, 10, 11, 15],
    'weights' : ['uniform','distance'],
    'metric' : ['euclidean','manhattan']
}

opt_knn_model = GridSearchCV(KNeighborsClassifier(), 
                             param_grid = param_grid, 
                             n_jobs=-1,
                             cv = 5,
                             scoring = 'accuracy')

opt_knn_model.fit(scaled_heart_train_X,train_y)
pred_values_knn = opt_knn_model.predict(scaled_heart_test_X)

print(opt_knn_model.best_params_)
print(opt_knn_model.best_score_)
print(classification_report(test_y,pred_values_knn))
print(confusion_matrix(test_y,pred_values_knn))


In [None]:
param_grid = {
    'n_neighbors' : range(2,25),
    'weights' : ['uniform','distance'],
    'metric' : ['euclidean','manhattan']
}

opt_knn_model = GridSearchCV(KNeighborsClassifier(), 
                             param_grid = param_grid, 
                             n_jobs=-1,
                             cv = 5,
                             scoring = 'f1')

opt_knn_model.fit(scaled_heart_train_X,train_y)
pred_values_knn = opt_knn_model.predict(scaled_heart_test_X)

print(opt_knn_model.best_params_)
print(opt_knn_model.best_score_)
print(classification_report(test_y,pred_values_knn))
print(confusion_matrix(test_y,pred_values_knn))

In [None]:
model = KNeighborsClassifier(n_neighbors=9, metric='manhattan', weights= "distance")

In [None]:
opt_knn_model.cv_results_

In [None]:
opt_knn_model.predict_proba(scaled_heart_test_X)

# Naive Bayes

Veamos ahora como se desempeña en este último set de datos el modelo de Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

naive_model = GaussianNB()
naive_model.fit(train_X,train_y)
pred_values_nb = naive_model.predict(test_X)

print(classification_report(test_y,pred_values_nb))
print(confusion_matrix(test_y,pred_values_nb))

In [None]:
#hasta el momento
from sklearn.metrics import f1_score

print(f"K-NN: {f1_score(test_y, pred_values_knn)}")
print(f"Naive Bayes: {f1_score(test_y, pred_values_nb)}")

# SVM

Es el turno de support vector machine

In [None]:
from sklearn.svm import SVC

svm_model = SVC(kernel = 'rbf', C = 1) # C parameter, kernel
svm_model.fit(train_X,train_y)
pred_values_svm = svm_model.predict(test_X)

print(classification_report(test_y,pred_values_svm))
print(confusion_matrix(test_y,pred_values_svm))

arreglemos un poquito más el modelo de SVM

In [None]:
param_grid = {
    'degree' : [2,3],
    'kernel' : ['linear','rbf'],
    'C' : [0.01,0.1,1,10,100]
}

opt_svm_model = GridSearchCV(SVC(), 
                             param_grid = param_grid, 
                             cv = 5, 
                             n_jobs = -1,
                             scoring = "f1")

opt_svm_model.fit(train_X,train_y)
pred_values_osvm = opt_svm_model.predict(test_X)

print(opt_svm_model.best_params_)
print(classification_report(test_y,pred_values_osvm))
print(confusion_matrix(test_y,pred_values_osvm))

In [None]:
#hasta el momento
from sklearn.metrics import f1_score

print(f"K-NN: {f1_score(test_y, pred_values_knn)}")
print(f"Naive Bayes: {f1_score(test_y, pred_values_nb)}")
print(f"SVM: {f1_score(test_y, pred_values_osvm)}")

# Árboles de decisión

In [None]:
train_y

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

tree_model = DecisionTreeClassifier(max_depth=4, min_samples_leaf=8)
tree_model.fit(train_X, train_y)
pred_values_tree = tree_model.predict(test_X)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(test_y,pred_values_tree))
print(confusion_matrix(test_y,pred_values_tree))

revisemos como es el árbol

In [None]:
heart.columns

In [None]:
from matplotlib import pyplot as plt

plt.figure(figsize=(20,20))
features = heart_X.columns
classes = ['Not heart disease','heart disease']
plot_tree(tree_model,feature_names=features,class_names=classes,filled=True)
plt.show()

In [None]:
tree_model.feature_importances_

In [None]:
tree_model.feature_names_in_

podemos hacer este árbol un poco mejor

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'max_depth': [2,4,6,8,10,12],
         'min_samples_split': [5,10,15,20],
         'min_samples_leaf': [5,10,15,20]}

opt_tree_model = GridSearchCV(DecisionTreeClassifier(),param_grid=params, cv = 5, n_jobs=-1)
opt_tree_model.fit(train_X,train_y)
pred_values_otree = opt_tree_model.predict(test_X)

print(opt_tree_model.best_params_)
print(classification_report(test_y,pred_values_otree))
print(confusion_matrix(test_y,pred_values_otree))

veamos como queda el árbol ahora

In [None]:
best_tree_model = opt_tree_model.best_estimator_
best_tree_model.fit(train_X,train_y)

plt.figure(figsize=(20,20))
features = heart.columns
classes = ['Not heart disease','heart disease']
plot_tree(best_tree_model,feature_names=features,class_names=classes,filled=True)
plt.show()

Lo que hicimos antes, limitando la profundidad máxima del árbol nos permite evitar el sobre-ajuste y obtener mejores predicciones. Pero existe otro parámetro de en los árboles llamado `cp`, que tiene que ver con la complejidad del modelo y podemos también hacer un tuning con él.

In [None]:
from sklearn.metrics import accuracy_score

tree_model = DecisionTreeClassifier()
path = tree_model.cost_complexity_pruning_path(train_X, train_y)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas[:-1])

luego, ajustamos modelos para cada uno de los valores anteriores y los guardamos en una lista

In [None]:
models = []
for ccp_alpha in ccp_alphas[:-1]:
    model = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    model.fit(train_X, train_y)
    models.append(model)

finalmente, calculamos el accuracy para cada modelo en el set de test

In [None]:
train_acc = []
test_acc = []
for c in models:
    test_pred = c.predict(test_X)
    test_acc.append(accuracy_score(test_pred,test_y))

plt.scatter(ccp_alphas[:-1],test_acc)
plt.plot(ccp_alphas[:-1],test_acc,label='test_accuracy',drawstyle="steps-post")
plt.title('Accuracy vs alpha')
plt.show()

In [None]:
params = {'max_depth': [2,4,6,8,10,12],
         'min_samples_split': [5,10,15,20],
         'min_samples_leaf': [5,10,15,20]}

opt_tree_model = GridSearchCV(DecisionTreeClassifier(ccp_alpha=0.01514514),param_grid=params,cv = 5, n_jobs=-1)
opt_tree_model.fit(train_X,train_y)
pred_values_otree = opt_tree_model.predict(test_X)

print(opt_tree_model.best_params_)
print(classification_report(test_y,pred_values_otree))
print(confusion_matrix(test_y,pred_values_otree))

también podriamos hacerlo directamente con el GridSearch

In [None]:
params = {'ccp_alpha': ccp_alphas[:-1],
         'max_depth': [2,4,6,8,10,12],
         'min_samples_split': [5,10,15,20],
         'min_samples_leaf': [5,10,15,20]}

opt_tree_model = GridSearchCV(DecisionTreeClassifier(),param_grid=params, cv = 5, n_jobs=-1)
opt_tree_model.fit(train_X,train_y)
pred_values_otree = opt_tree_model.predict(test_X)

print(opt_tree_model.best_params_)
print(classification_report(test_y,pred_values_otree))
print(confusion_matrix(test_y,pred_values_otree))

In [None]:
#hasta el momento
from sklearn.metrics import f1_score

print(f"K-NN: {f1_score(test_y, pred_values_knn)}")
print(f"Naive Bayes: {f1_score(test_y, pred_values_nb)}")
print(f"SVM: {f1_score(test_y, pred_values_osvm)}")
print(f"Tree: {f1_score(test_y, pred_values_otree)}")

In [None]:
pip install arff

In [None]:
import scipy.io.arff as arff

In [None]:
data = arff.loadarff(open('./Data/Rice_Cammeo_Osmancik.arff','rt'))

In [None]:
data

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(data[0])
df.head()

In [None]:
df.Class.value_counts()

In [None]:
import scipy.io.arff as arff
import pandas as pd

data = arff.loadarff(open('FILE_PATH','rt'))
df = pd.DataFrame(data[0])
df.head()