![Alt text](Untitled-1.png)

# Loan approval prediction using KNN model (from scratch)

https://www.computersciencejournals.com/ijecs/article/view/30/1-2-18

data: https://drive.google.com/file/d/1LIvIdqdHDFEGnfzIgEh4L6GFirzsE3US/view

https://www.geeksforgeeks.org/loan-approval-prediction-using-machine-learning/

for r: https://www.datacamp.com/tutorial/k-nearest-neighbors-knn-classification-with-r-tutorial

## Contents

[**1. Introduction**](#introduction)

[**2. Loading data**](#loading_data)
* [2.1. Importing libraries](#libraries)
* [2.2. Loading stock data](#stock)

[**3. Basic analysis of stock information**](#racial_diversity)
* [3.1. Closing price](#closing_price)
* [3.2. Trading volume](#trading_volume)
* [3.2. Moving average](#moving_average)
* [3.3. Daily returns](#daily_returns)


[**4. Risk analysis - Value at Risk**](#var)
* [4.1. Historical approach](#historical)
* [4.2. Parametric approach](#parametric)
* [4.3. Monte Carlo approach](#montecarlo)

[**5. Recommendations for future work**](#recommendations)

## 1. Introduction

This project has 3 primary objectives:
1. **Exploratory data analysis**: Conduct a thorough analysis of real-time financial data to identify patterns and trends.
2. **Data visualisation**: Utilise visualisation techniques to present stock information effectively, aiding in the interpretation of market dynamics.
3. **Predictive modelling**: Implement models to forecast stock movements and calculate Value at Risk (VaR).

### Scope
This project focuses on examining 4 prominent technology stocks i.e. Apple Inc. (AAPL), Amazon.com Inc. (AMZN), Alphabet Inc. (GOOG), and Microsoft Corporation (MSFT). The analysis spans a one-year period, from 28th January, 2023, to 28th January, 2024, covering 252 trading days.

### Inquiry questions

* How has the stock price changed over time?
* What is the average daily return of the stock?
* What is the moving average of the selected stocks?
* What is the correlation between closing prices of different stocks?
* What is the correlation between daily returns of different stocks?
* How much value is at risk by investing in a particular stock?

### Project outline

The project begins with importing libraries and loading stock data for the aforementioned companies (Section 2). Then in Section 3, we will be conducting some basic analysis of the stock performance,  covering closing prices, trading volumes, moving averages, and daily returns. Section 4 - 'Risk Analysis - Value at Risk' - explores historical, parametric, and Monte Carlo approaches to computing a stock's Value at Risk. The project concludes with some potential areas for improvement in future projects.

1	Loan	A unique id 
2	Gender	Gender of the applicant Male/female
3	Married	Marital Status of the applicant, values will be Yes/ No
4	Dependents	It tells whether the applicant has any dependents or not.
5	Education	It will tell us whether the applicant is Graduated or not.
6	Self_Employed	This defines that the applicant is self-employed i.e. Yes/ No
7	ApplicantIncome	Applicant income
8	CoapplicantIncome	Co-applicant income
9	LoanAmount	Loan amount (in thousands)
10	Loan_Amount_Term	Terms of loan (in months)
11	Credit_History	Credit history of individual’s repayment of their debts
12	Property_Area	Area of property i.e. Rural/Urban/Semi-urban 
13	Loan_Status	Status of Loan Approved or not i.e. Y- Yes, N-No 

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 

pd.set_option('mode.chained_assignment', None)
sns.set_style('white')
%matplotlib inline

NameError: name 'df' is not defined

In [2]:
data = pd.read_csv('LoanApprovalPrediction.csv')
display(data)

NameError: name 'pd' is not defined

In [None]:
data.info()

Total number of applicants: 598

Total number of fields: 13

In [None]:
data.describe()

In [None]:
data.isna().sum()

Data imputation

After finding about No. of blank fields present in the dataset then we must replace them with values which are derived by statistical methods such as mean, mode, mean for both numerical and categorical attributes present in the dataset and must check for null values to make sure that there are no blank fields in the dataset. We can also replace the irrelevant or noisy data with the precise ones so that it will not show any impact on the training process and to make predictions.

In [None]:
data['Dependents'] = data['Dependents'].fillna(data['Dependents'].mean())
data['LoanAmount'] = data['LoanAmount'].fillna(data['LoanAmount'].mean())
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mean())
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].mean())

data.isna().sum()

As Loan_ID is completely unique and not correlated with any of the other column, So we will drop it using .drop() function.

In [None]:
data.drop(['Loan_ID'],axis = 1,inplace = True)
data

As there is no missing value then we must proceed to model training.

### Splitting dataset

we must divide the data into independent and dependent variables which means we must split first 12 attributes variables into one group of array elements and the final status attribute variables into other as they are dependent on the other attributes of the dataset.
* x = predictor variables
* y = response variable - loan status

After splitting the variables into two groups then we must transform all the categorical data variables into the machine understandable format. So that we will convert them into some dummy variables. Here we will use LabelEncoder( ), OneHotEncoder( ), fitTransform( ) functions for transformation

In [None]:
categorical = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']

# Initialising LabelEncoder
label_encoder = LabelEncoder()

# Predictor variables
for col in categorical:
    data[col] = label_encoder.fit_transform(data[col])

display(data.head())


In [None]:
x

After converting all the categorical data into dummy variables and loading it into again the same variable ‘X’, we must split both the data variables ‘X’ and ‘Y’ into train and test data using train_test_split module available from scikitlearn. Thereafter we must fit the split data using StandardScaler

In [None]:
train, test = train_test_split(
    data,
    test_size = 0.4, 
    random_state = 404)

display(train.head(2))
display(test.head(2))

k-NN

finding the distance between a new point (Alice) and each point in the training sample

sorting the data table by these distances

selecting the top k rows

In [None]:
# 1. Compute the distance between any 2 points
def dist_pt_pt1(point, point1):
    '''
    Input: point & point 1 (each is an array consisting of the coordinates of the point)
    Output:  distance between point and point 1
    '''    
    return np.sqrt(np.sum((point - point1)**2))

# 2. Compute the distance between a point and every other point in the data set
def dist_pt_other(point, train_dataset):
    '''
    Input: point 1 & point 2 (each is an array consisting of the coordinates of the point)
    Output:  distance between point 1 and point 2
    '''
    predictor_var = train.drop(columns = 'Loan_Status').copy()
    distance_column = predictor_var.apply(lambda row: dist_pt_pt1(point, row), axis = 1)
    train_dataset['distance'] = distance_column
    return train_dataset

# 3. Pick out k nearest neighbour and identify the classification of the test point
def knn(point, k, train_dataset):
    '''
    Input: point 1 & point 2 (each is an array consisting of the coordinates of the point)
    Output:  distance between point 1 and point 2
    '''
    train_dataset = dist_pt_other(point, train_dataset)
    train_dataset = train_dataset.sort_values(by = 'distance', ascending = True)
    knn = train_dataset.head(k)
    classification = knn['Loan_Status'].mode()
    return classification.iloc[0]

In [None]:
attributes = ['CoapplicantIncome', 'ApplicantIncome']
train = data[['Loan_Status'] + attributes]

knn_result = knn(train.iloc[4], 1, train)

knn_result

In [None]:
train.iloc[4]

In [None]:
train

## Evaluating the accuracy of the knn model


Msitake: when runnig  thr model on the train set the status clumn MUST BE REOMVED OHTERWISE ITS > 2 ATTRIBUTES -> DISTANCE BC SUPER LARGE -> WRONG

In [None]:
knn_prediction = train.drop(columns = {'Loan_Status', 'distance'})
knn_prediction

In [None]:
knn_prediction = knn_prediction.apply(lambda row: knn(row, 1, train), axis = 1)

In [None]:
knn_prediction

runtime 3m 58.9s

In [None]:
knn_result = train.copy()
knn_result['prediction'] = knn_prediction
knn_result['correct'] = (knn_result['Loan_Status'] == knn_result['prediction'])

display(knn_result)
knn_result.value_counts('correct')


In [None]:
knn_result[knn_result['correct'] == False]
