# K nearest neighbors

Implementaremos el algoritmo que vimos en la teoría, utilizando sklearn.

Sklearn tiene una implementación para KNN classifier: [documentación KNNClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

Y otra para el regressor: [documentación KNNRegressor](
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)




In [269]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Vamos a trabajar con el dataset penguins que podemos cargar desde seaborn.

La idea es que entrenemos un KNN para clasificar pinguinos (predecir la variable species)

In [270]:
df = sns.load_dataset("penguins")

In [271]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


¿ Hay nulos ?

In [272]:
df.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

En caso de haberlos, por simplicidad los vamos a descartar.

Descartar nulos:

In [273]:
df.dropna(inplace=True)

In [274]:
df.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Dividimos en X e y

In [275]:
X = df[["island", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"]]
y = df["species"].copy()

In [276]:
X.head()

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Torgersen,39.3,20.6,190.0,3650.0,Male


In [277]:
y.head()

0    Adelie
1    Adelie
2    Adelie
4    Adelie
5    Adelie
Name: species, dtype: object

¿ Cuántos pinguinos tenemos de cada especie ?



In [278]:
df["species"].value_counts()

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

¿ Y en porcentajes ?

In [279]:
df["species"].value_counts() / df["species"].count()

Adelie       0.438438
Gentoo       0.357357
Chinstrap    0.204204
Name: species, dtype: float64

## Baseline

¿Cómo se les ocurre definir un baseline para este caso?

No hay una única manera correcta, tiene que ser un modelo simple.



In [280]:
df.groupby("species").mean()

Unnamed: 0_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adelie,38.823973,18.34726,190.10274,3706.164384
Chinstrap,48.833824,18.420588,195.823529,3733.088235
Gentoo,47.568067,14.996639,217.235294,5092.436975


In [281]:
y_baseline = []
for i in range(333):
    y_baseline.append("Adelie")

In [282]:
df["species"].count()

333

El modelo que desarrollemos, tiene que ser mejor que este baseline. ¿ Qué accuracy_score tiene el baseline ?

In [283]:
from sklearn.metrics import accuracy_score

baseline_score = accuracy_score(df["species"], y_baseline)
baseline_score

0.43843843843843844

## Train - test split

Como vimos la clase anterior, es importante guardarnos un conjunto de test para evaluar el modelo.

Vamos a hacer un train-test split utilizando sklearn.

primero, importar train_test_split de sklearn:

In [284]:
from sklearn.model_selection import train_test_split

Aplicar la función para obtener: X_train, X_test, y_train e y_test.

Vamos a tomar un 15% de los datos para el conjunto de test. Como las clases no están balanceadas, sería bueno utilizar el stratify que nos provee sklearn (vimos en el notebook de la clase pasada)

In [285]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, stratify=y)

In [286]:
X_train.shape

(283, 6)

In [287]:
X_test.shape

(50, 6)

In [288]:
y_train.shape

(283,)

In [289]:
y_test.shape

(50,)

## Preprocesamiento de datos

Vimos que en KNN es muy importante que los datos estén en una misma escala.

¿En que rango de valores se encuentran las variables numéricas del dataset?

In [290]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


Debemos llevar todo a una misma escala. Para esto utilizaremos el StandardScaler de sklearn.

Importar standard scaler:

In [291]:
from sklearn.preprocessing import StandardScaler

Crear una instancia de StandardScaler

In [292]:
scaler = StandardScaler()

Como siempre en Sklearn, tenemos que hacer un fit con nuestros datos de entrenamiento a el objeto.

Hacer un fit a el scaler con los datos NUMERICOS de train:

In [293]:
columnas_numericas = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]

In [294]:
scaler.fit(X_train[columnas_numericas])

StandardScaler()

Ahora, con el scaler podemos transformar los datos tanto en train como en test.

Transformar los datos numéricos de train (aplicar el scaler):

In [295]:
X_train[columnas_numericas] = scaler.transform(X_train[columnas_numericas])

In [296]:
X_train

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
341,Biscoe,1.187135,-0.750946,1.500496,1.958254,Male
102,Biscoe,-1.182980,-0.599508,-1.296966,-1.430584,Female
135,Dream,-0.548461,0.157681,-0.794857,-0.385429,Male
48,Dream,-1.500240,0.359598,-0.794857,-0.955514,Female
72,Torgersen,-0.828396,0.006243,-0.364479,-0.828828,Female
...,...,...,...,...,...,...
181,Dream,1.635031,1.419663,0.281090,0.438028,Male
234,Biscoe,0.328668,-1.306218,0.639738,-0.005372,Female
243,Biscoe,0.421980,-0.700467,0.998387,1.071456,Male
154,Dream,1.355096,1.015828,-0.579668,-0.702142,Male


Ahora nos quedan 2 variables categóricas, vamos a aplicar one hot encoder.

Recuerden que el fit se hace sobre los datos de entrenamiento y luego sobre los datos de test aplicamos únicamente transform.

Importar one hot encoder:

In [297]:
from sklearn.preprocessing import OneHotEncoder

Instanciar one hot encoder para cada variable categórica:

In [298]:
ohe_island = OneHotEncoder(sparse=False, handle_unknown="ignore")
ohe_sex = OneHotEncoder(sparse=False, handle_unknown="ignore")

Hacer fit con los datos de entrenamiento para ambos encoders:

In [299]:
X_train.reset_index(drop=True, inplace=True)

In [300]:
ohe_island.fit(X_train[["island"]])
ohe_sex.fit(X_train[["sex"]])

OneHotEncoder(handle_unknown='ignore', sparse=False)

Obtener las variables con one hot encoded para ambas variables categóricas:

In [301]:
island_onehot = pd.DataFrame(ohe_island.transform(X_train[["island"]]))
sex_onehot = pd.DataFrame(ohe_sex.transform(X_train[["sex"]]))

Hacer concat con X_train:

In [302]:
X_train = pd.concat([X_train, island_onehot], axis = 1)
X_train

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,0,1,2
0,Biscoe,1.187135,-0.750946,1.500496,1.958254,Male,1.0,0.0,0.0
1,Biscoe,-1.182980,-0.599508,-1.296966,-1.430584,Female,1.0,0.0,0.0
2,Dream,-0.548461,0.157681,-0.794857,-0.385429,Male,0.0,1.0,0.0
3,Dream,-1.500240,0.359598,-0.794857,-0.955514,Female,0.0,1.0,0.0
4,Torgersen,-0.828396,0.006243,-0.364479,-0.828828,Female,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
278,Dream,1.635031,1.419663,0.281090,0.438028,Male,0.0,1.0,0.0
279,Biscoe,0.328668,-1.306218,0.639738,-0.005372,Female,1.0,0.0,0.0
280,Biscoe,0.421980,-0.700467,0.998387,1.071456,Male,1.0,0.0,0.0
281,Dream,1.355096,1.015828,-0.579668,-0.702142,Male,0.0,1.0,0.0


In [303]:
X_train["island"].unique()

array(['Biscoe', 'Dream', 'Torgersen'], dtype=object)

In [304]:
X_train.rename(columns={0: "island_Biscoe", 1: "island_Dream", 2: "island_Torgersen"}, inplace=True)

In [305]:
X_train = pd.concat([X_train, sex_onehot], axis = 1)
X_train

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,island_Biscoe,island_Dream,island_Torgersen,0,1
0,Biscoe,1.187135,-0.750946,1.500496,1.958254,Male,1.0,0.0,0.0,0.0,1.0
1,Biscoe,-1.182980,-0.599508,-1.296966,-1.430584,Female,1.0,0.0,0.0,1.0,0.0
2,Dream,-0.548461,0.157681,-0.794857,-0.385429,Male,0.0,1.0,0.0,0.0,1.0
3,Dream,-1.500240,0.359598,-0.794857,-0.955514,Female,0.0,1.0,0.0,1.0,0.0
4,Torgersen,-0.828396,0.006243,-0.364479,-0.828828,Female,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
278,Dream,1.635031,1.419663,0.281090,0.438028,Male,0.0,1.0,0.0,0.0,1.0
279,Biscoe,0.328668,-1.306218,0.639738,-0.005372,Female,1.0,0.0,0.0,1.0,0.0
280,Biscoe,0.421980,-0.700467,0.998387,1.071456,Male,1.0,0.0,0.0,0.0,1.0
281,Dream,1.355096,1.015828,-0.579668,-0.702142,Male,0.0,1.0,0.0,0.0,1.0


In [306]:
X_train.rename(columns={0: "sex_female", 1: "sex_male"}, inplace=True)

descartar las columnas originales:

In [307]:
X_train = X_train.drop(["island", "sex"], axis=1)
X_train.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
0,1.187135,-0.750946,1.500496,1.958254,1.0,0.0,0.0,0.0,1.0
1,-1.18298,-0.599508,-1.296966,-1.430584,1.0,0.0,0.0,1.0,0.0
2,-0.548461,0.157681,-0.794857,-0.385429,0.0,1.0,0.0,0.0,1.0
3,-1.50024,0.359598,-0.794857,-0.955514,0.0,1.0,0.0,1.0,0.0
4,-0.828396,0.006243,-0.364479,-0.828828,0.0,0.0,1.0,1.0,0.0


## KNN

Ahora, con nuestro dataset limpio, entrenemos un KNN classifier.

Primero, importar knn classifier de sklearn:

In [308]:
from sklearn.neighbors import KNeighborsClassifier

Instanciar un KNN con n_neighbors = 5 y weights="uniform".

INVESTIGAR: ¿Qué significa weights = "uniform" ???

In [309]:
knn = KNeighborsClassifier(n_neighbors=5, weights="uniform")

Entrenar el modelo con los datos de entrenamiento:

In [310]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

Generar las predicciones para train y test. Tener en cuenta que para generar las de test, debemos aplicar el preprocesamiento a los datos (OHE y scaler)

In [311]:
X_test

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
299,Biscoe,45.2,16.4,223.0,5950.0,Male
244,Biscoe,42.9,13.1,215.0,5000.0,Female
237,Biscoe,49.2,15.2,221.0,6300.0,Male
193,Dream,46.2,17.5,187.0,3650.0,Female
52,Biscoe,35.0,17.9,190.0,3450.0,Female
83,Torgersen,35.1,19.4,193.0,4200.0,Male
55,Biscoe,41.4,18.6,191.0,3700.0,Male
58,Biscoe,36.5,16.6,181.0,2850.0,Female
280,Biscoe,45.3,13.8,208.0,4200.0,Female
56,Biscoe,39.0,17.5,186.0,3550.0,Female


In [312]:
island_onehot_test = pd.DataFrame(ohe_island.fit_transform(X_test[["island"]]), columns= ["island_Biscoe", "island_Dream", "island_Torgersen"])
sex_onehot_test = pd.DataFrame(ohe_sex.fit_transform(X_test[["sex"]]), columns= ["sex_female", "sex_male"])
numeric = pd.DataFrame(scaler.fit_transform(X_test[columnas_numericas]), columns= ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"])
X_test = pd.concat([island_onehot_test, numeric, sex_onehot_test], axis=1)
X_test = X_test[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "island_Biscoe", "island_Dream", "island_Torgersen", "sex_female", "sex_male"]]
X_test

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
0,0.244714,-0.339109,1.588628,1.957645,1.0,0.0,0.0,0.0,1.0
1,-0.138693,-2.098639,1.028265,0.88077,1.0,0.0,0.0,1.0,0.0
2,0.911508,-0.978938,1.448537,2.354388,1.0,0.0,0.0,0.0,1.0
3,0.411412,0.247401,-0.933004,-0.649525,0.0,1.0,0.0,1.0,0.0
4,-1.455613,0.460677,-0.722868,-0.876236,1.0,0.0,0.0,1.0,0.0
5,-1.438943,1.260463,-0.512732,-0.026072,0.0,0.0,1.0,0.0,1.0
6,-0.388741,0.833911,-0.652822,-0.592848,1.0,0.0,0.0,0.0,1.0
7,-1.205565,-0.232471,-1.353276,-1.556367,1.0,0.0,0.0,1.0,0.0
8,0.261384,-1.725406,0.537948,-0.026072,1.0,0.0,0.0,1.0,0.0
9,-0.788818,0.247401,-1.003049,-0.762881,1.0,0.0,0.0,1.0,0.0


In [313]:
pred_train = knn.predict(X_train)
pred_test = knn.predict(X_test)

Medir accuracy_score para train y test.

In [316]:
accuracy_score(y_train, pred_train)

0.9964664310954063

In [317]:
accuracy_score(y_test, pred_test)

1.0

Vemos que las métricas obtenidas van a ser bastante altas. Este es un dataset simple.

Cuando trabajemos con datasets más complejos, podemos hacer un bucle for y entrenar muchos KNN con distintos valores de n_neighbors. Para cada valor de n_neighbors, calcular las métricas en train y test y luego graficarlas para encontrar el mejor valor de n_neighbors. Hay distintas técnicas para encontrar el mejor valor de hiperparámetros pero en general todas se basan en prueba/error. Veremos esto más adelante.