# K-NN con Hiperparámetros

### k-NN (Nearest Neighbour)


- Se miran los k-casos más cercanos. 

- Se calcula la distancia media por clase o se asigna a la clase con más elementos.

- El valor de k se suele determinar heurísticamente $k=\sqrt{n} $ donde n es el número de ejemplos. (Es una opción con base teórica)

## 1. Librerias y configuraciones previas


In [1]:
# Tratamiento de datos
# ==============================================================================
import pandas as pd
import numpy as np


# Almacenar en caché los resultados de funciones en el disco
# ==============================================================================
import joblib


# Gestion de librerias
# ==============================================================================
from importlib import reload


# Matemáticas y estadísticas
# ==============================================================================
import math


# Preprocesado y modelado
# ==============================================================================
import math

#Separar los datos entrenamiento y prueba
from sklearn.model_selection import train_test_split


#Escalar Variables
from sklearn.preprocessing import MinMaxScaler


#Evaluación del modelo
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve


#Creación de modelo
from sklearn.neighbors import KNeighborsClassifier


#configuracion de hiperparámetros
from sklearn.model_selection import GridSearchCV


# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns


# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

## 2. Funciones

In [2]:
#reload(utils.funciones)

# Funciones externas
# ==============================================================================
from utils.funciones import multiple_plot, plot_roc_curve

## 3. Carga del dataset

In [3]:
#Se crea un dataframe d con los datos obtenidos de archivo de entrada
d=pd.read_csv('./datasets/02_GermanCredit_Prep.csv')

In [4]:
## Cargar datos con colab
## =============================================================================

#from google.colab import drive 
#import os

#drive.mount('/gdrive')

In [5]:
#os.chdir("/gdrive/MyDrive/ModelosCuantitativosPython/Notebooks")
#!ls

In [6]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1138 entries, 0 to 1137
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   checking_account_status   1138 non-null   object
 1   loan_duration_mo          1138 non-null   int64 
 2   credit_history            1138 non-null   object
 3   purpose                   1138 non-null   object
 4   loan_amount               1138 non-null   int64 
 5   savings_account_balance   1138 non-null   object
 6   time_employed_yrs         1138 non-null   object
 7   payment_pcnt_income       1138 non-null   int64 
 8   gender_status             1138 non-null   object
 9   other_signators           1138 non-null   object
 10  time_in_residence         1138 non-null   int64 
 11  property                  1138 non-null   object
 12  age_yrs                   1138 non-null   int64 
 13  other_credit_outstanding  1138 non-null   object
 14  home_ownership          

## 4. Visualización de datos

### Variables de entrada

In [7]:
#Lista de variables categóricas
catCols = d.select_dtypes(include = ["object", 'category']).columns.tolist()

d[catCols].head(2)

Unnamed: 0,checking_account_status,credit_history,purpose,savings_account_balance,time_employed_yrs,gender_status,other_signators,property,other_credit_outstanding,home_ownership,job_category,telephone,foreign_worker
0,< 0 DM,critical account - other non-bank loans,car,< 100 DM,1 - 4 years,female-divorced/separated/married,co-applicant,real estate,none,own,skilled,none,yes
1,< 0 DM,current loans paid,car,< 100 DM,1 - 4 years,male-married/widowed,none,real estate,none,own,unskilled-resident,none,yes


In [8]:
#Lista de variables numéricas

numCols=d.select_dtypes(include = ['float64','float64','int32','int64']).columns.tolist()

d[numCols].head(2)

Unnamed: 0,loan_duration_mo,loan_amount,payment_pcnt_income,time_in_residence,age_yrs,number_loans,dependents,bad_credit
0,12,3499,3,2,29,2,1,1
1,12,1168,4,3,27,1,1,0


In [9]:
##Visualización de frecuencia de instancias para variables categóricas
#multiple_plot(3, d , catCols, None, 'countplot', 'Frecuencia de instancias para variables categóricas',30)

In [10]:
##Visualización de variables numéricas
#multiple_plot(1, d , numCols, None, 'scatterplot', 'Relación entre las variables numéricas',30)

In [11]:
#Eliminar la variable de salida de la lista de variable numéricas
numCols.remove('bad_credit')

### Variable de salida

In [12]:
# Distriución de la variable de salida

d.groupby('bad_credit').bad_credit.count().sort_values(ascending=False)

bad_credit
0    569
1    569
Name: bad_credit, dtype: int64

In [13]:
##Visualización de la variable de salida
#multiple_plot(1, d , None, 'bad_credit', 'countplot', 'Gráfica de frecuencia de bad Credit',0)

## 5. Transformación de datos

### Creación de variables Dummies 

In [14]:
# Aplicación de la función de usuario Dummies: one-hot encoding

d =pd.get_dummies(d, drop_first=1)

d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1138 entries, 0 to 1137
Data columns (total 46 columns):
 #   Column                                                  Non-Null Count  Dtype
---  ------                                                  --------------  -----
 0   loan_duration_mo                                        1138 non-null   int64
 1   loan_amount                                             1138 non-null   int64
 2   payment_pcnt_income                                     1138 non-null   int64
 3   time_in_residence                                       1138 non-null   int64
 4   age_yrs                                                 1138 non-null   int64
 5   number_loans                                            1138 non-null   int64
 6   dependents                                              1138 non-null   int64
 7   bad_credit                                              1138 non-null   int64
 8   checking_account_status_< 0 DM                          11

## 6. Creación del modelo

### Selecionar el conjunto de datos

In [15]:
#Se establece las variables de entrada 'X' y la variable de salida 'y'

X = d.drop(columns ='bad_credit')
y = d['bad_credit']

# la validación cruzada se realiza sobre todo el dataset
X_Completo = X
y_Completo = y

### Escalar Variables

In [16]:
#Se establecen las variables numéricas a escalar

#Num_vars se le asigna la lista con las variables numerivas para posteriormente escalarlas
num_vars = numCols

print(num_vars)

['loan_duration_mo', 'loan_amount', 'payment_pcnt_income', 'time_in_residence', 'age_yrs', 'number_loans', 'dependents']


In [17]:
#Escalar Variables númericas

pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Se crea un objeto MinMaxScaler
scaler = MinMaxScaler()

# Se escalan los valores del dataset entrenamiento y prueba de las columnas numéricas
X_Completo[num_vars] = scaler.fit_transform(X_Completo[num_vars])


X_Completo[num_vars].head(2)

Unnamed: 0,loan_duration_mo,loan_amount,payment_pcnt_income,time_in_residence,age_yrs,number_loans,dependents
0,0.1176,0.2356,0.6667,0.3333,0.1667,0.3333,0.0
1,0.1176,0.0619,1.0,0.6667,0.1296,0.0,0.0


In [18]:
# Guardar el scaler
joblib.dump(scaler, './modelos/scaler/minmaxFull_GermanCredits.pkl')

['./modelos/scaler/minmaxFull_GermanCredits.pkl']

### Creación del modelo

#### Creación y entrenamiento del modelo

In [19]:
np.random.seed(4)


# Definición del modelo
modelKNN = KNeighborsClassifier()

#Número de vecinos a evaluar
k=[21, 25, 31, 35, 37]

# definicion de la variable con el número de pliegues
CV = 10

# valor de evaluación (scoring) del modelo 
scoring = 'f1' # Otros valores que puede tomar son: accuracy, precision, recall2, f1, roc_auc, balanced_accuracy

# Definición de para
parameters = {'n_neighbors':k, 'metric':['euclidean','manhattan','chebyshev']}


# Creacion de gridSearch con los múltiples parámetros
grid_knn = GridSearchCV(estimator=modelKNN
                    , param_grid = parameters
                    , cv=CV
                    , scoring=scoring
                    , return_train_score=True
                    , verbose=4)


grid_knn.fit(X_Completo, y_Completo)

Fitting 10 folds for each of 15 candidates, totalling 150 fits
[CV 1/10] END metric=euclidean, n_neighbors=21;, score=(train=0.735, test=0.703) total time=   0.9s
[CV 2/10] END metric=euclidean, n_neighbors=21;, score=(train=0.723, test=0.673) total time=   0.5s
[CV 3/10] END metric=euclidean, n_neighbors=21;, score=(train=0.731, test=0.685) total time=   0.6s
[CV 4/10] END metric=euclidean, n_neighbors=21;, score=(train=0.735, test=0.679) total time=   0.5s
[CV 5/10] END metric=euclidean, n_neighbors=21;, score=(train=0.734, test=0.708) total time=   0.5s
[CV 6/10] END metric=euclidean, n_neighbors=21;, score=(train=0.739, test=0.673) total time=   0.4s
[CV 7/10] END metric=euclidean, n_neighbors=21;, score=(train=0.733, test=0.720) total time=   0.4s
[CV 8/10] END metric=euclidean, n_neighbors=21;, score=(train=0.735, test=0.770) total time=   0.5s
[CV 9/10] END metric=euclidean, n_neighbors=21;, score=(train=0.747, test=0.689) total time=   0.4s
[CV 10/10] END metric=euclidean, n_ne

[CV 3/10] END metric=manhattan, n_neighbors=35;, score=(train=0.724, test=0.667) total time=   0.4s
[CV 4/10] END metric=manhattan, n_neighbors=35;, score=(train=0.719, test=0.655) total time=   0.4s
[CV 5/10] END metric=manhattan, n_neighbors=35;, score=(train=0.721, test=0.649) total time=   0.4s
[CV 6/10] END metric=manhattan, n_neighbors=35;, score=(train=0.712, test=0.667) total time=   0.4s
[CV 7/10] END metric=manhattan, n_neighbors=35;, score=(train=0.711, test=0.694) total time=   0.4s
[CV 8/10] END metric=manhattan, n_neighbors=35;, score=(train=0.718, test=0.718) total time=   0.3s
[CV 9/10] END metric=manhattan, n_neighbors=35;, score=(train=0.713, test=0.729) total time=   0.5s
[CV 10/10] END metric=manhattan, n_neighbors=35;, score=(train=0.713, test=0.695) total time=   0.7s
[CV 1/10] END metric=manhattan, n_neighbors=37;, score=(train=0.707, test=0.709) total time=   0.5s
[CV 2/10] END metric=manhattan, n_neighbors=37;, score=(train=0.698, test=0.697) total time=   0.5s

### Evaluación del modelo

In [20]:
# Resultados
resultados = pd.DataFrame(grid_knn.cv_results_)
resultados.filter(regex = '(param.*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)

Unnamed: 0,param_metric,param_n_neighbors,mean_test_score,std_test_score,mean_train_score,std_train_score
7,manhattan,31,0.7033,0.0269,0.7319,0.0067
0,euclidean,21,0.7005,0.0278,0.7337,0.0064
5,manhattan,21,0.6992,0.0279,0.7391,0.0081
6,manhattan,25,0.6987,0.0356,0.7314,0.007


In [21]:
#grid_knn.cv_results_
#grid_knn.best_score_


In [22]:
# Obtener los resultados de la búsqueda de la cuadrícula para grid_knn
results_grid_knn = pd.DataFrame(grid_knn.cv_results_)

# Seleccionar las columnas deseadas
columns_grid_knn = ['param_metric', 'param_n_neighbors']  + \
               ['mean_test_score', 'std_test_score']  + \
               [f'split{i}_test_score' for i in range(CV)]

# Filtrar y mostrar los resultados
results_grid_knn_filtered = results_grid_knn[columns_grid_knn]

results_grid_knn_filtered.sort_values(by='mean_test_score', ascending=False).head(10)

Unnamed: 0,param_metric,param_n_neighbors,mean_test_score,std_test_score,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score
7,manhattan,31,0.7033,0.0269,0.7188,0.7207,0.7339,0.6667,0.6429,0.7018,0.7059,0.713,0.7288,0.7009
0,euclidean,21,0.7005,0.0278,0.7031,0.6729,0.6847,0.6789,0.708,0.6726,0.72,0.7705,0.6891,0.7049
5,manhattan,21,0.6992,0.0279,0.6719,0.6604,0.7207,0.6847,0.7069,0.6607,0.72,0.75,0.7,0.7167
6,manhattan,25,0.6987,0.0356,0.6825,0.6916,0.7273,0.6364,0.6607,0.6903,0.7581,0.7241,0.7377,0.6783
2,euclidean,31,0.6966,0.0262,0.7087,0.7156,0.7091,0.6727,0.6435,0.6783,0.6833,0.7333,0.7273,0.6942
4,euclidean,37,0.6963,0.0342,0.72,0.7037,0.7027,0.6239,0.6552,0.6724,0.736,0.7273,0.7273,0.6942
1,euclidean,25,0.696,0.0271,0.6822,0.6667,0.7321,0.6607,0.6783,0.7018,0.704,0.7179,0.7438,0.6723
8,manhattan,35,0.6925,0.0298,0.7231,0.729,0.6667,0.6545,0.6491,0.6667,0.6942,0.7179,0.7288,0.6949
3,euclidean,35,0.691,0.0344,0.6984,0.6852,0.7207,0.6296,0.6261,0.6897,0.7154,0.7,0.7333,0.7119
9,manhattan,37,0.691,0.0266,0.7087,0.6972,0.6909,0.6486,0.6491,0.6903,0.6949,0.7009,0.7458,0.6838


In [23]:
# Resultados de grid_knn
print("Resultados grid_knn:")
print("Mejor score de validación (", scoring, "):"  ,grid_knn.best_score_)
print("Mejor conjunto de hiperparámetros:", grid_knn.best_params_)

Resultados grid_knn:
Mejor score de validación ( f1 ): 0.7033287961716663
Mejor conjunto de hiperparámetros: {'metric': 'manhattan', 'n_neighbors': 31}


### Creación el modelo final

In [24]:
# Usar los mejores parámetros para ajustar el modelo
modelKNN.set_params(**grid_knn.best_params_)
modelKNN.fit(X_Completo, y_Completo)

### Guardar modelo

In [25]:
#Se guarda el modelo de Regresión logística
joblib.dump(modelKNN, './modelos/clasificacion/KNN_CV_manhattan.pkl')

['./modelos/clasificacion/KNN_CV_manhattan.pkl']

#### Referencias


- K-Neighbors Classifier

    - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
    
    
- Parámetros Regresion Logística 

    - https://holypython.com/log-reg/logistic-regression-optimization-parameters/


- *scikit-learn:*   
    - https://scikit-learn.org/stable/modules/svm.html



- *Gráficas con  seaborn:*
    - https://ichi.pro/es/como-utilizar-python-seaborn-para-analisis-de-datos-exploratorios-28897898172180



- *Analítica de grandes datos:*
    - https://jdvelasq.github.io/courses/analitica-de-grandes-datos/index.html