# Using K-Nearest Neighbors classifier to classify bank customers.

We will build and train a K-Nearest Neighbors classifier algorithm using scikit-learn to classify whether bank customers will buy term deposit or not. 

# Exploring the Data

We will first read the data of bank customers. 

In [1]:
import pandas as pd
banking_df = pd.read_csv("C:/Users/Linus/Documents/Sheets/bank/bank-additional/bank-additional-full.csv", sep=";")
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


We can see that each customer has 20 features and 1 target variable. But since target variable is categorical, we will turn it into numerical for easier evaluation.

In [2]:
banking_df["y"] = banking_df["y"].apply(lambda x:1 if x=="yes" else 0)
banking_df = pd.get_dummies(data = banking_df, drop_first = True)

# Feature Selection

Not all features in a dataset might be relevant to a model's performance. Identifying and removing such features in the data preparation step, before training a model, can not only boost its performance, but also reduce the computational cost. The latter is especially important when we have to work with large datasets and complex machine learning models.

We'll calculate the Pearson Correlation Coefficient on our columns to identify which features are strongly correlated to the target variable.

In [3]:
correlations = abs(banking_df.corr())

In [4]:
top_5_features = correlations["y"].sort_values(ascending=False)[1:6].index
print(correlations["y"].sort_values(ascending=False)[1:6])

duration            0.405274
nr.employed         0.354678
pdays               0.324914
poutcome_success    0.316269
euribor3m           0.307771
Name: y, dtype: float64


We can see that `duration` is corrlated to target variable more than any other feature. So we will use these 5 features for further evaluation.

# Training, Validation and Test sets

We will divide dataset into training values, validation values and test values using `train_test_split` function from `sklearn.model_selection`. Then scale the data using `MinMaxScaler`. 

In [5]:
X = banking_df.drop("y", axis=1)
y = banking_df["y"]

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_val, y_train, y_val = train_test_split(X[top_5_features], y, test_size=0.20, 
                                                  random_state = 417)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, 
                                                    test_size=0.20*X.shape[0]/X_train.shape[0], 
                                                    random_state = 417)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Training and Evaluating the Model

Once we have scaled our different sets, we can call `KNeighborsClassifier` to classify and then calculate accuracy of the model. 

In [7]:
from sklearn.neighbors import KNeighborsClassifier

num_neighbors = [num for num in range(1, 6)]
X_val_scaled = scaler.transform(X_val)

accuracies = {}

for neighbors in num_neighbors:
    knn = KNeighborsClassifier(n_neighbors = neighbors)
    knn.fit(X_train_scaled, y_train)
    val_accuracy = knn.score(X_val_scaled, y_val)
    accuracies[neighbors] = val_accuracy
    
print(accuracies)

{1: 0.8962126729788784, 2: 0.9062879339645545, 3: 0.9072590434571498, 4: 0.9089584850691915, 5: 0.909686817188638}


# Hyperparameter Optimization

We can use hyperparameter `weights` in `KNeighborsClassifier` to weigh points in each neighborhood by the inverse of their distance and set power paramter to 5. Then calculate accuracies again. 


In [8]:
for neighbors in num_neighbors:
    knn = KNeighborsClassifier(n_neighbors = neighbors, weights = "distance", p=5)
    knn.fit(X_train_scaled, y_train)
    val_accuracy = knn.score(X_val_scaled, y_val)
    accuracies[neighbors] = val_accuracy
    
print(accuracies)

{1: 0.8970623937848993, 2: 0.898154891964069, 3: 0.9032532168001942, 4: 0.9039815489196407, 5: 0.906409322651129}


Mdifying two more hyperparameters improved our model's performance corresponding to some of the 
`Ks`, but the rest worsened. Not every attempt will result in improvement.

We can't always try every possible permutation and combination. Depending on the size of the dataset, the number of hyperparameters, and the range of values they could take, it would be computationally expensive.

We can try out a smaller subset of values. A commonly used approach that can help us find the optimal hyperparameter values is called grid search.

# Grid Search

`GridSearchCV` allows us to input a dictionary of hyperparameters and the values we want to search. Additionally, `GridSearchCV` automatically evaluates the different models on validation sets it creates from the training data. It simplifies our workflow in that regard.

In [9]:
from sklearn.model_selection import GridSearchCV

grid_params = {"n_neighbors": range(1, 10),
                "metric": ["minkowski", "manhattan"]
              }

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(knn, grid_params, scoring='accuracy')
knn_grid.fit(X_train_scaled, y_train)

best_score = knn_grid.best_score_
best_params = knn_grid.best_params_

print(f"Best Model's Accuracy: {best_score*100:.2f}")
print(f"Besy Model's Parameters: {best_params}")

Best Model's Accuracy: 90.92
Besy Model's Parameters: {'metric': 'minkowski', 'n_neighbors': 9}


# Evaluating the Model on Test Set

Because of the grid search technique and the features we selected earlier, we were able to obtain a model that has an accuracy of `~90.92%`.

The following hyperparameters and values:
* `metric = "minkowski"`
* `n_neighbors: 9`

We can now use this model and evaluate it on the test set. 

Scikit-learn again makes this simple for us to do:

* We can obtain our best model, known as an **estimator**, from `GridSearchCV`.
* We can evaluate the test set by calculating the accuracy score using the best estimator .

In [10]:
X_test_scaled = scaler.transform(X_test)
accuracy = knn_grid.best_estimator_.score(X_test_scaled, y_test)
print(f"Model Accuracy on Test Set: {accuracy*100:.2f}")

Model Accuracy on Test Set: 91.14


Model accuracy of our algorithm is `91.14%`