# Exercice 2: Classification system with KNN - To Loan or Not To Loan

## Imports

Import some useful libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [6]:
data = pd.read_csv("loandata.csv")

Display the head of the data.

In [7]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [8]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [9]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])

Split into `X` and `y`.

In [10]:
X = data.drop(columns="LoanStatus")
y = data.LoanStatus

Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [11]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns])

Convert `y` type to `int` 

In [12]:
y = y.astype(int)

Split dataset into train and test sets.

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [14]:
class DummyClassifier:
    def __init__(self):
        """
        Initialize the class.
        """
        self.perc = dict()
    
    def fit(self, X, y):
        """
        Fit the dummy classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        unique, counts = np.unique(y, return_counts=True)
        counts = counts / y.shape[0]
        self.perc = dict(zip(unique, counts))
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        return np.random.choice(list(self.perc.keys()), size=X.shape[0], p=list(self.perc.values()))

Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [27]:
def accuracy_score(y_true, y_pred):
    """
    Compute the accuracy of the classifier.

    Parameters
    ----------
    y_true : Numpy array or Pandas DataFrame of shape (n_queries,)
        Ground truth (correct) target values.
    y_pred : Numpy array or Pandas DataFrame of shape (n_queries,)
        Predicted labels, as returned by a classifier.

    Returns
    -------
    score : float
        The accuracy of the classifier.
    """
    return np.count_nonzero(y_true == y_pred) / y_true.shape[0]

Compute the performance of the dummy classifier using the provided test set.

In [24]:
dcl = DummyClassifier()
dcl.fit(X_train, y_train)
y_pred = dcl.predict(X_test)
print("Accuracy: " + str(accuracy_score(y_test, y_pred)))

Accuracy: 0.5104166666666666


## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [34]:
class KNNClassifier:
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """
        self.n_neighbors = n_neighbors
        self.X = []
        self.y = []
    
    def fit(self, X, y):
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X = X
        self.y = np.array(y)
        
    
    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        return np.linalg.norm(a - b)

    def _predict(self, x):
        """
        Predict the class label for one vector of data
        
        Parameters
        ----------
        x : Numpy array or Pandas DataFrame of shape (n_features)
            Test samples.
            
        Returns
        -------
        x : numeric, class label 
        """
        dists = np.array([self._euclidian_distance(x, a) for a in self.X.values])
        idx = np.argpartition(dists, self.n_neighbors)[:self.n_neighbors]
        classes = self.y[idx]
        counts = np.bincount(classes)
        return np.argmax(counts)
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        return np.array([self._predict(x) for x in X.values])

Compute the performance of the system as a function of $k = 1...7$.

In [36]:
# Test the classifier for k = 1 to 7
for k in range(1, 8):
    knn = KNNClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print("Accuracy for k = " + str(k) + ": " + str(accuracy_score(y_test, y_pred)))

Accuracy for k = 1: 0.6979166666666666
Accuracy for k = 2: 0.6354166666666666
Accuracy for k = 3: 0.7916666666666666
Accuracy for k = 4: 0.7395833333333334
Accuracy for k = 5: 0.8125
Accuracy for k = 6: 0.78125
Accuracy for k = 7: 0.8020833333333334


Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

In [37]:
X_train_2c = X_train[["TotalIncome", "CreditHistory"]]
X_test_2c = X_test[["TotalIncome", "CreditHistory"]]

for k in range(1, 8):
    knn = KNNClassifier(n_neighbors=k)
    knn.fit(X_train_2c, y_train)
    y_pred = knn.predict(X_test_2c)
    print("Accuracy for k = " + str(k) + ": " + str(accuracy_score(y_test, y_pred)))

Accuracy for k = 1: 0.7604166666666666
Accuracy for k = 2: 0.6875
Accuracy for k = 3: 0.78125
Accuracy for k = 4: 0.6979166666666666
Accuracy for k = 5: 0.8229166666666666
Accuracy for k = 6: 0.7708333333333334
Accuracy for k = 7: 0.8125


Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

In [38]:
# Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.
X_train_3c = X_train[["TotalIncome", "CreditHistory", "Married"]]
X_test_3c = X_test[["TotalIncome", "CreditHistory", "Married"]]

for k in range(1, 8):
    knn = KNNClassifier(n_neighbors=k)
    knn.fit(X_train_3c, y_train)
    y_pred = knn.predict(X_test_3c)
    print("Accuracy for k = " + str(k) + ": " + str(accuracy_score(y_test, y_pred)))

Accuracy for k = 1: 0.7916666666666666
Accuracy for k = 2: 0.6875
Accuracy for k = 3: 0.8645833333333334
Accuracy for k = 4: 0.7604166666666666
Accuracy for k = 5: 0.8020833333333334
Accuracy for k = 6: 0.75
Accuracy for k = 7: 0.7916666666666666


Re-run the KNN algorithm using all features.

In [43]:
k = 5
knn = KNNClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy for k = " + str(k) + ": " + str(accuracy_score(y_test, y_pred)))

Accuracy for k = 5: 0.8125


We choose a number of neighbors of $k=5$ based on the previous results. The evaluation is done using the accuracy metric. It is possible to use other metrics such as the F1-score.