# Objective 

This notebook deals with whether a customer will default on the loan payment or not. The file consists of the following columns *ClientId*,*Income*,*Age*,*Loan* and *Default*.

*ClientId* can be dropped as it is not a deciding factor in whether a client will default on the loan payment or not. *Income*,*Age* and *Loan* are important features in decinding whether a client will default.

Importing the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

*confusion_matrix* is used to show towards which class the model is biased towards. *accuracy_score* calculates the accuracy, but it does not show how good the model behaves.

*train_test_split* is used to split the features and output into train and test datasets. This is recommened as the model shoould not be tested on the complete dataset. If it is the model may overfit.

*KNeighboursClassifier* is the scikit-learn implement of the k-NN classification algorithm.

Reading the file using `read_csv` module of `pandas`.

In [None]:
data = pd.read_csv("../input/credit-risk/original.csv")
data.head()

0 is not default while 1 is default

In [None]:
data.info()

"Age" has three rows with NaN as values, we are filling them with the mean of the column.

In [None]:
data.fillna(data.mean(),inplace=True)

Dropping the *clientid* as it does not define whether the client defaults or not.

In [None]:
data.drop(columns="clientid",inplace=True)
data["age"] = data["age"].astype("int")
data.head()

Splitting the dataset into features and output. 

> X -> *income*, *age* and *loan*

> y -> *default*

In [None]:
X = data[["income","age","loan"]]
y = data["default"]

Splitting the features and output into train and test dataset with test as 20% of features and default.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,shuffle=True)

Initializing the k-NN algorithm with minimum number of neighbours to classify in a class at 4

In [None]:
neigh = KNeighborsClassifier(n_neighbors=4)

model = neigh.fit(X_train,y_train)

y_pred = model.predict(X_test)

Printing the *Mean Squared Error* and *Accuracy Score* of the classifier.

In [None]:
print("Mean Squared Error{:.3f}".format(mean_squared_error(y_pred,y_test)))
print("Accuracy score:{:.3f}".format(accuracy_score(y_pred,y_test)*100))

An accuracy of *84.750* is achieved with a mean squared error of *0.152*.

Comparing the predicted values and actual values and storing in in a CSV file.

In [None]:
results = pd.DataFrame({"Actual Values":y_test,
                        "Predicted Values":y_pred})
results.head()

In [None]:
results.to_csv("k-NN.csv",index=False)

Plotting the confusion matrix to determine towards which model is the class biased towards.

In [None]:
cm = confusion_matrix(y_pred,y_test)

def plot_confusion_matrix(cm,classes,title='Confusion Matrix',cmap=plt.cm.Blues):
    
    cm = cm.astype('float')/cm.sum(axis=1)[:,np.newaxis]
    plt.figure(figsize=(10,10))
    plt.imshow(cm,interpolation='nearest',cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f'
    thresh = cm.max()/2.
    for i,j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,format(cm[i,j],fmt),
                horizontalalignment="center",
                color="white" if cm[i,j] > thresh else "black")
        pass
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    pass

classes=['0','1']

plt.figure()
plot_confusion_matrix(cm,classes,title="KNN")
plt.show()

As it can be seen that the model is biased towards the customer not defaulting. The accuracy should not be the only metric to be used for comparing models.

# Normalising the Data

In the previous model the feature data was not normalised. Normalisation allows the data to be on the same plane or within similar ranges.

Take this for example, features with values (10000,1,100) and (20000,0,200). Using euclidean or  minkowski distance the second values change is overshadowed by the other first and third values change. By normalising all the changes in each value ar given similar measure. This allows the model to train more precisely and efficiently.

Using StandardScaler from scikit-learn. 

The mean and standard variance is calculated. The mean is subtracted from each value and subsequently divided by the variance.

The MinMaxScaler can also be used where the minimum is subtracted from each value and divided by the difference between maximum and minimum values.

This done with every column in the features data before the splitting into train and test datasets.

In [None]:
scaler = StandardScaler()

X = scaler.fit_transform(X)
X = pd.DataFrame(X)
X.columns = ["income","age","loan"]
X.head()

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,shuffle=True)

Using the normalised data and making a model.

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)

model = neigh.fit(X_train,y_train)

y_pred = model.predict(X_test)

Calculating the accuracy and mean squared error.

In [None]:
print("Mean Squared Error: {:.3f}".format(mean_squared_error(y_pred,y_test)))
print("Accuracy score: {:.3f}".format(accuracy_score(y_pred,y_test)*100))

An accuracy of *97.500* is obtained with a mean squared error of *0.025*.

In [None]:
results_normalized = pd.DataFrame({"Actual Values":y_test,
                        "Predicted Values":y_pred})
results_normalized.head()

In [None]:
results_normalized.to_csv("k-NN_normalized.csv",index=False)

Plotting the confusion matrix to comapre with the previous model.

In [None]:
cm = confusion_matrix(y_pred,y_test)

def plot_confusion_matrix(cm,classes,title='Confusion Matrix',cmap=plt.cm.Blues):
    
    cm = cm.astype('float')/cm.sum(axis=1)[:,np.newaxis]
    plt.figure(figsize=(10,10))
    plt.imshow(cm,interpolation='nearest',cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f'
    thresh = cm.max()/2.
    for i,j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j,i,format(cm[i,j],fmt),
                horizontalalignment="center",
                color="white" if cm[i,j] > thresh else "black")
        pass
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    pass

classes=['0','1']

plt.figure()
plot_confusion_matrix(cm,classes,title="Normalized KNN");
plt.show();

As it can be seen that this model is definitely better than the previous model.