# Loan approval prediction using KNN model (from scratch)

https://www.computersciencejournals.com/ijecs/article/view/30/1-2-18

data: https://drive.google.com/file/d/1LIvIdqdHDFEGnfzIgEh4L6GFirzsE3US/view

https://www.geeksforgeeks.org/loan-approval-prediction-using-machine-learning/

for r: https://www.datacamp.com/tutorial/k-nearest-neighbors-knn-classification-with-r-tutorial

## Contents

[**1. Introduction**](#introduction)

[**2. Loading data**](#loading_data)
* [2.1. Importing libraries](#libraries)
* [2.2. Loading stock data](#stock)

[**3. Basic analysis of stock information**](#racial_diversity)
* [3.1. Closing price](#closing_price)
* [3.2. Trading volume](#trading_volume)
* [3.2. Moving average](#moving_average)
* [3.3. Daily returns](#daily_returns)


[**4. Risk analysis - Value at Risk**](#var)
* [4.1. Historical approach](#historical)
* [4.2. Parametric approach](#parametric)
* [4.3. Monte Carlo approach](#montecarlo)

[**5. Recommendations for future work**](#recommendations)

## 1. Introduction

This project has 3 primary objectives:
1. **Exploratory data analysis**: Conduct a thorough analysis of real-time financial data to identify patterns and trends.
2. **Data visualisation**: Utilise visualisation techniques to present stock information effectively, aiding in the interpretation of market dynamics.
3. **Predictive modelling**: Implement models to forecast stock movements and calculate Value at Risk (VaR).

### Scope
This project focuses on examining 4 prominent technology stocks i.e. Apple Inc. (AAPL), Amazon.com Inc. (AMZN), Alphabet Inc. (GOOG), and Microsoft Corporation (MSFT). The analysis spans a one-year period, from 28th January, 2023, to 28th January, 2024, covering 252 trading days.

### Inquiry questions

* How has the stock price changed over time?
* What is the average daily return of the stock?
* What is the moving average of the selected stocks?
* What is the correlation between closing prices of different stocks?
* What is the correlation between daily returns of different stocks?
* How much value is at risk by investing in a particular stock?

### Project outline

The project begins with importing libraries and loading stock data for the aforementioned companies (Section 2). Then in Section 3, we will be conducting some basic analysis of the stock performance,  covering closing prices, trading volumes, moving averages, and daily returns. Section 4 - 'Risk Analysis - Value at Risk' - explores historical, parametric, and Monte Carlo approaches to computing a stock's Value at Risk. The project concludes with some potential areas for improvement in future projects.

<a name="loading_data"></a>
## 2. Loading data

<a name="libraries"></a>
### 2.1. Importing libraries

In [355]:
# Import necessary libraries
import numpy as np
import seaborn as sns
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Set style for seaborn plots
sns.set_style('dark')
sns.color_palette("viridis", as_cmap=True)
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
pd.set_option('display.float_format', '{:.2f}'.format)

from sklearn.preprocessing import LabelEncoder

1	Loan	A unique id 
2	Gender	Gender of the applicant Male/female
3	Married	Marital Status of the applicant, values will be Yes/ No
4	Dependents	It tells whether the applicant has any dependents or not.
5	Education	It will tell us whether the applicant is Graduated or not.
6	Self_Employed	This defines that the applicant is self-employed i.e. Yes/ No
7	ApplicantIncome	Applicant income
8	CoapplicantIncome	Co-applicant income
9	LoanAmount	Loan amount (in thousands)
10	Loan_Amount_Term	Terms of loan (in months)
11	Credit_History	Credit history of individual’s repayment of their debts
12	Property_Area	Area of property i.e. Rural/Urban/Semi-urban 
13	Loan_Status	Status of Loan Approved or not i.e. Y- Yes, N-No 

<a name="stock"></a>
### 2.2. Loading data

In [428]:
data = pd.read_csv('LoanApprovalPrediction.csv')
display(data)

duplicates = data['Loan_ID'].duplicated()
print(f'Duplicates: {duplicates.sum()}')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0.00,Graduate,No,5849,0.00,,360.00,1.00,Urban,Y
1,LP001003,Male,Yes,1.00,Graduate,No,4583,1508.00,128.00,360.00,1.00,Rural,N
2,LP001005,Male,Yes,0.00,Graduate,Yes,3000,0.00,66.00,360.00,1.00,Urban,Y
3,LP001006,Male,Yes,0.00,Not Graduate,No,2583,2358.00,120.00,360.00,1.00,Urban,Y
4,LP001008,Male,No,0.00,Graduate,No,6000,0.00,141.00,360.00,1.00,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,LP002978,Female,No,0.00,Graduate,No,2900,0.00,71.00,360.00,1.00,Rural,Y
594,LP002979,Male,Yes,3.00,Graduate,No,4106,0.00,40.00,180.00,1.00,Rural,Y
595,LP002983,Male,Yes,1.00,Graduate,No,8072,240.00,253.00,360.00,1.00,Urban,Y
596,LP002984,Male,Yes,2.00,Graduate,No,7583,0.00,187.00,360.00,1.00,Urban,Y


Duplicates: 0


<a name="loading_data"></a>
## 3. Basic analysis of stock information

<a name="closing_price"></a>
### 3.1. Closing price

In [357]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 598 entries, 0 to 597
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            598 non-null    object 
 1   Gender             598 non-null    object 
 2   Married            598 non-null    object 
 3   Dependents         586 non-null    float64
 4   Education          598 non-null    object 
 5   Self_Employed      598 non-null    object 
 6   ApplicantIncome    598 non-null    int64  
 7   CoapplicantIncome  598 non-null    float64
 8   LoanAmount         577 non-null    float64
 9   Loan_Amount_Term   584 non-null    float64
 10  Credit_History     549 non-null    float64
 11  Property_Area      598 non-null    object 
 12  Loan_Status        598 non-null    object 
dtypes: float64(5), int64(1), object(7)
memory usage: 60.9+ KB


In [358]:
data.describe()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,586.0,598.0,598.0,577.0,584.0,549.0
mean,0.76,5292.25,1631.5,144.97,341.92,0.84
std,1.01,5807.27,2953.32,82.7,65.21,0.36
min,0.0,150.0,0.0,9.0,12.0,0.0
25%,0.0,2877.5,0.0,100.0,360.0,1.0
50%,0.0,3806.0,1211.5,127.0,360.0,1.0
75%,1.75,5746.0,2324.0,167.0,360.0,1.0
max,3.0,81000.0,41667.0,650.0,480.0,1.0


In [359]:
data.isna().sum()

Loan_ID               0
Gender                0
Married               0
Dependents           12
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           21
Loan_Amount_Term     14
Credit_History       49
Property_Area         0
Loan_Status           0
dtype: int64

<a name="trading_volume"></a>
### 3.2. Trading volume

Data imputation

After finding about No. of blank fields present in the dataset then we must replace them with values which are derived by statistical methods such as mean, mode, mean for both numerical and categorical attributes present in the dataset and must check for null values to make sure that there are no blank fields in the dataset. We can also replace the irrelevant or noisy data with the precise ones so that it will not show any impact on the training process and to make predictions.

In [360]:
data['Dependents'] = data['Dependents'].fillna(data['Dependents'].mean())
data['LoanAmount'] = data['LoanAmount'].fillna(data['LoanAmount'].mean())
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mean())
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].mean())

data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

As Loan_ID is completely unique and not correlated with any of the other column, So we will drop it using .drop() function.

<a name="trading_volume"></a>
### 3.2. Trading volume

As there is no missing value then we must proceed to model training.

## 4. Building knn model

we must divide the data into independent and dependent variables which means we must split first 12 attributes variables into one group of array elements and the final status attribute variables into other as they are dependent on the other attributes of the dataset.
* x = predictor variables
* y = response variable - loan status

After splitting the variables into two groups then we must transform all the categorical data variables into the machine understandable format. So that we will convert them into some dummy variables. Here we will use LabelEncoder( ), OneHotEncoder( ), fitTransform( ) functions for transformation

In [361]:
categorical = [
    'Gender',
    'Married',
    'Education',
    'Self_Employed',
    'Property_Area',
    'Loan_Status'
]

label_encoder = LabelEncoder()

for col in categorical:
    data[col] = label_encoder.fit_transform(data[col])

display(data.head())

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0.0,0,0,5849,0.0,144.97,360.0,1.0,2,1
1,LP001003,1,1,1.0,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,LP001005,1,1,0.0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,LP001006,1,1,0.0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,LP001008,1,0,0.0,0,0,6000,0.0,141.0,360.0,1.0,2,1


After converting all the categorical data into dummy variables and loading it into again the same variable ‘X’, we must split both the data variables ‘X’ and ‘Y’ into train and test data using train_test_split module available from scikitlearn. Thereafter we must fit the split data using StandardScaler

In [397]:
train, test = train_test_split(
    data,
    test_size = 0.1, 
    random_state = 26)

train = train.reset_index(drop = True)
test = test.reset_index(drop = True)

display(train)
display(test)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP002804,0,1,0.00,0,0,4180,2306.00,182.00,360.00,1.00,1,1
1,LP002941,1,1,2.00,1,1,6383,1000.00,187.00,360.00,1.00,0,0
2,LP001813,1,0,0.00,0,1,6050,4333.00,120.00,180.00,1.00,2,0
3,LP002272,1,1,2.00,0,0,3276,484.00,135.00,360.00,0.84,1,1
4,LP002140,1,0,0.00,0,0,8750,4167.00,308.00,360.00,1.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
533,LP001570,1,1,2.00,0,0,4167,1447.00,158.00,360.00,1.00,0,1
534,LP001263,1,1,3.00,0,0,3167,4000.00,180.00,300.00,0.00,1,0
535,LP001356,1,1,0.00,0,0,4652,3583.00,144.97,360.00,1.00,1,1
536,LP002409,1,1,0.00,0,0,7901,1833.00,180.00,360.00,1.00,0,1


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP002699,1,1,2.00,0,1,17500,0.00,400.00,360.00,1.00,0,1
1,LP002837,1,1,3.00,0,0,3400,2500.00,123.00,360.00,0.00,0,0
2,LP002347,1,1,0.00,0,0,3246,1417.00,138.00,360.00,1.00,1,1
3,LP001241,0,0,0.00,0,0,4300,0.00,136.00,360.00,0.00,1,0
4,LP002619,1,1,0.00,1,0,3814,1483.00,124.00,300.00,1.00,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
55,LP002494,1,0,0.00,0,0,6000,0.00,140.00,360.00,1.00,0,1
56,LP002670,0,1,2.00,0,0,2031,1632.00,113.00,480.00,1.00,1,1
57,LP001498,1,0,0.00,0,0,5417,0.00,168.00,360.00,1.00,2,1
58,LP002106,1,1,0.76,0,1,5503,4490.00,70.00,341.92,1.00,1,1


Converting to array -> much faster

In [382]:
def distance(point1, point2):
  """Returns the distance between point1 and point2
  where each argument is an array 
  consisting of the coordinates of the point"""
  return np.sqrt(np.sum((point1 - point2)**2))

def fast_distances(input_row, train_set):
  """An array of the distances between test_row and each row in train_rows.

  Takes 2 arguments:
    test_row: A row of a table containing features of one
      test movie (e.g., test_20.row(0)).
    train_rows: A table of features (for example, the whole
      table train_20)."""

  train_copy = train_set.drop(columns = ['Loan_ID', 'Loan_Status'])
  input_row = input_row.drop(['Loan_ID', 'Loan_Status'])
  input_row = input_row.apply(pd.to_numeric, errors = 'coerce') # must be float cannot be int64

  # Convert train_rows and test_row to NumPy arrays
  train_matrix = np.asarray(train_copy)
  test_array = np.asarray(input_row)

  # Calculate distances directly using NumPy operations
  distances = np.sqrt(np.sum((train_matrix - test_array)**2, axis=1))

  # For tie-breaking purposes, add a small amount of noise
  np.random.seed(0)
  eps = np.random.uniform(size = distances.shape) * 1e-10
  distances = distances + eps
  return distances

In [386]:
testr = test.iloc[5]
testr

Loan_ID              LP001451
Gender                      1
Married                     1
Dependents               1.00
Education                   0
Self_Employed               1
ApplicantIncome         10513
CoapplicantIncome     3850.00
LoanAmount             160.00
Loan_Amount_Term       180.00
Credit_History           0.00
Property_Area               2
Loan_Status                 0
Name: 5, dtype: object

In [387]:
fast_distances(testr, train)

array([4820.05, 9144.52, 7498.02, ..., 8001.58, 5916.73, 9182.48])

In [388]:
test_row = test.iloc[0]

In [389]:
fast_distances(test_row, train)

array([ 3901.83,  3767.98,  1289.39, ...,  3863.04, 12977.49,  3772.62])

In [390]:
def k_nearest_neighbours_model(input_row, train_set, k):
    # Compute distances and create a distance column
    distances_array = fast_distances(input_row, train_set)
    train_distances = pd.DataFrame({'Distance': distances_array})
    all_info = train.merge(train_distances, left_index = True, right_index = True)
    # Identify classification of the majority of 'k' nearest neighbours
    k_neighbours = all_info.sort_values(by = 'Distance', ascending = True)
    k_nearest_neighbours = k_neighbours.head(k)
    mode = k_nearest_neighbours['Loan_Status'].mode().loc[0]
    return mode

In [391]:
k_nearest_neighbours_model(test_row, train, 3)

0

## 5. Evaluating accuracy of knn model

Now that it's easy to use the classifier, let's see how accurate it is on the whole test set.

**Question 3.3.1.** Use `classify_one_argument` and `apply` to classify every movie in the test set.  Name these guesses `test_guesses`.  **Then**, compute the proportion of correct classifications. 

In [433]:
def evaluate_accuracy(test_set, train_set, k):
    # Create a copy of the test set to avoid modifying the original data
    test_accuracy_table = test_set.copy()

    # Apply the k_nearest_neighbours_model function to each row of the test set
    prediction = test_accuracy_table.apply(
        lambda row: k_nearest_neighbours_model(row, train_set, k),
        axis=1
    )

    # Add the prediction and correct columns to the test accuracy table
    test_accuracy_table['prediction'] = prediction
    test_accuracy_table['correct'] = (
        test_accuracy_table['prediction'] == test_accuracy_table['Loan_Status']
    )

    accuracy = (
        test_accuracy_table['correct'].value_counts()[True] /
        test_accuracy_table.shape[0]
    )

    print(f'Accuracy: {accuracy*100:.2f}%')

    return test_accuracy_table.head(5)

In [434]:
evaluate_accuracy(test, train, 5)

Accuracy: 70.00%


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,prediction,correct
0,LP002699,1,1,2.0,0,1,17500,0.0,400.0,360.0,1.0,0,1,1,True
1,LP002837,1,1,3.0,0,0,3400,2500.0,123.0,360.0,0.0,0,0,1,False
2,LP002347,1,1,0.0,0,0,3246,1417.0,138.0,360.0,1.0,1,1,1,True
3,LP001241,0,0,0.0,0,0,4300,0.0,136.0,360.0,0.0,1,0,1,False
4,LP002619,1,1,0.0,1,0,3814,1483.0,124.0,300.0,1.0,1,1,1,True


Draw a horizontal bar chart with two bars that show the proportion of Romance movies in each dataset.  Complete the function `romance_proportion` first; it should help you create the bar chart.

## 6. Feature engineering - improving model accuracy

In [411]:
import statsmodels.formula.api as smf

reg = smf.ols(
    'Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + '
    'ApplicantIncome + CoapplicantIncome + LoanAmount + Loan_Amount_Term + '
    'Credit_History + Property_Area',
    data = data
).fit()

reg.summary()

0,1,2,3
Dep. Variable:,Loan_Status,R-squared:,0.305
Model:,OLS,Adj. R-squared:,0.292
Method:,Least Squares,F-statistic:,23.43
Date:,"Thu, 01 Feb 2024",Prob (F-statistic):,5.13e-40
Time:,23:19:22,Log-Likelihood:,-279.82
No. Observations:,598,AIC:,583.6
Df Residuals:,586,BIC:,636.4
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.1353,0.113,1.197,0.232,-0.087,0.357
Gender,-0.0158,0.045,-0.354,0.723,-0.103,0.072
Married,0.0974,0.038,2.553,0.011,0.022,0.172
Dependents,0.0022,0.017,0.127,0.899,-0.032,0.036
Education,-0.0633,0.040,-1.600,0.110,-0.141,0.014
Self_Employed,-0.0158,0.042,-0.379,0.705,-0.098,0.066
ApplicantIncome,1.595e-07,3.39e-06,0.047,0.963,-6.51e-06,6.83e-06
CoapplicantIncome,-9.366e-06,5.78e-06,-1.622,0.105,-2.07e-05,1.98e-06
LoanAmount,-0.0003,0.000,-1.202,0.230,-0.001,0.000

0,1,2,3
Omnibus:,93.137,Durbin-Watson:,1.928
Prob(Omnibus):,0.0,Jarque-Bera (JB):,135.228
Skew:,-1.145,Prob(JB):,4.3199999999999995e-30
Kurtosis:,3.432,Cond. No.,57300.0


In [422]:
features = ['Credit_History', 'Married']

train_features = train.copy()[['Loan_ID', 'Loan_Status', 'Credit_History', 'Married']]
display(train_features.head(2))

test_features = test.copy()[['Loan_ID', 'Loan_Status', 'Credit_History', 'Married']]
display(test_features.head(2))

Unnamed: 0,Loan_ID,Loan_Status,Credit_History,Married
0,LP002804,1,1.0,1
1,LP002941,0,1.0,1


Unnamed: 0,Loan_ID,Loan_Status,Credit_History,Married
0,LP002699,1,1.0,1
1,LP002837,0,0.0,1


In [427]:
evaluate_accuracy(test_features, train_features, 5)

Accuracy: 81.67%


Unnamed: 0,Loan_ID,Loan_Status,Credit_History,Married,prediction,correct
0,LP002699,1,1.0,1,1,True
1,LP002837,0,0.0,1,0,True
2,LP002347,1,1.0,1,1,True
3,LP001241,0,0.0,0,0,True
4,LP002619,1,1.0,1,1,True
