# Exercice 2: Classification system with KNN - To Loan or Not To Loan

## Imports

Import some useful libraries

In [52]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [53]:
data = pd.read_csv("loandata.csv")

Display the head of the data.

In [54]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [55]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [56]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])

Split into `X` and `y`.

In [57]:
X = data.drop(columns="LoanStatus")
y = data.LoanStatus

Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [58]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns])

Convert `y` type to `int` 

In [59]:
y = y.astype(int)

Split dataset into train and test sets.

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [61]:
class DummyClassifier():
    
    def __init__(self):
        """
        Initialize the class.
        """
        pass
    
    def fit(self, X, y):
        """
        Fit the dummy classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        pass
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        size = np.shape(X)[0]
        return np.random.randint(0, 1, size=size)

Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [62]:
def accuracy_score(y_true, y_pred):
    ok=0
    y_true = y_true.array
    for i in range(np.shape(y_true)[0]):
        if y_true[i] == y_pred[i]:
            ok+=1
    return ok/np.shape(y_true)[0]

Compute the performance of the dummy classifier using the provided test set.

In [63]:
dc = DummyClassifier()
y_pred = dc.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.3125


## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [88]:
class KNNClassifier():
    
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """
        self.n_neighbors = n_neighbors
    
    def fit(self, X, y):
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.X=X
        self.y=y
    
    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        d=0.0
        print(a)
        print(b)
        for x in a:
            print(x)
            d+=pow(abs(a[column]-b[column]), 2)
        math.sqrt(d)
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        for e in X.iterrows():
            #compute distances
            distances = {}
            for row in self.X.iterrows() :
                distances[self._euclidian_distance(e, row)] = row
            #sort distances
            distances = dict(sorted(d.items()))
            print(distances)

Compute the performance of the system as a function of $k = 1...7$.

In [89]:
kc = KNNClassifier()
kc.fit(X_train, y_train)
kc.predict(X_test)

(189, Gender           0.467198
Married          0.737162
Education        1.987072
TotalIncome      0.147169
LoanAmount      -0.270259
CreditHistory    0.413197
Name: 189, dtype: float64)
(370, Gender           0.467198
Married          0.737162
Education       -0.503253
TotalIncome      1.776352
LoanAmount       0.065460
CreditHistory    0.413197
Name: 370, dtype: float64)
189


NameError: name 'column' is not defined

Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

Re-run the KNN algorithm using all features.