# Loan approval prediction using KNN model (from scratch)

https://www.computersciencejournals.com/ijecs/article/view/30/1-2-18

data: https://drive.google.com/file/d/1LIvIdqdHDFEGnfzIgEh4L6GFirzsE3US/view

https://www.geeksforgeeks.org/loan-approval-prediction-using-machine-learning/

for r: https://www.datacamp.com/tutorial/k-nearest-neighbors-knn-classification-with-r-tutorial

## Contents

[**1. Introduction**](#introduction)

[**2. Loading data**](#loading_data)
* [2.1. Importing libraries](#libraries)
* [2.2. Loading stock data](#stock)

[**3. Basic analysis of stock information**](#racial_diversity)
* [3.1. Closing price](#closing_price)
* [3.2. Trading volume](#trading_volume)
* [3.2. Moving average](#moving_average)
* [3.3. Daily returns](#daily_returns)


[**4. Risk analysis - Value at Risk**](#var)
* [4.1. Historical approach](#historical)
* [4.2. Parametric approach](#parametric)
* [4.3. Monte Carlo approach](#montecarlo)

[**5. Recommendations for future work**](#recommendations)

## 1. Introduction

This project has 3 primary objectives:
1. **Exploratory data analysis**: Conduct a thorough analysis of real-time financial data to identify patterns and trends.
2. **Data visualisation**: Utilise visualisation techniques to present stock information effectively, aiding in the interpretation of market dynamics.
3. **Predictive modelling**: Implement models to forecast stock movements and calculate Value at Risk (VaR).

### Scope
This project focuses on examining 4 prominent technology stocks i.e. Apple Inc. (AAPL), Amazon.com Inc. (AMZN), Alphabet Inc. (GOOG), and Microsoft Corporation (MSFT). The analysis spans a one-year period, from 28th January, 2023, to 28th January, 2024, covering 252 trading days.

### Inquiry questions

* How has the stock price changed over time?
* What is the average daily return of the stock?
* What is the moving average of the selected stocks?
* What is the correlation between closing prices of different stocks?
* What is the correlation between daily returns of different stocks?
* How much value is at risk by investing in a particular stock?

### Project outline

The project begins with importing libraries and loading stock data for the aforementioned companies (Section 2). Then in Section 3, we will be conducting some basic analysis of the stock performance,  covering closing prices, trading volumes, moving averages, and daily returns. Section 4 - 'Risk Analysis - Value at Risk' - explores historical, parametric, and Monte Carlo approaches to computing a stock's Value at Risk. The project concludes with some potential areas for improvement in future projects.

In [52]:
# Import necessary libraries
import numpy as np
import seaborn as sns
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Set style for seaborn plots
sns.set_style('dark')
sns.color_palette("viridis", as_cmap=True)
sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
pd.set_option('display.float_format', '{:.2f}'.format)

from sklearn.preprocessing import LabelEncoder

1	Loan	A unique id 
2	Gender	Gender of the applicant Male/female
3	Married	Marital Status of the applicant, values will be Yes/ No
4	Dependents	It tells whether the applicant has any dependents or not.
5	Education	It will tell us whether the applicant is Graduated or not.
6	Self_Employed	This defines that the applicant is self-employed i.e. Yes/ No
7	ApplicantIncome	Applicant income
8	CoapplicantIncome	Co-applicant income
9	LoanAmount	Loan amount (in thousands)
10	Loan_Amount_Term	Terms of loan (in months)
11	Credit_History	Credit history of individual’s repayment of their debts
12	Property_Area	Area of property i.e. Rural/Urban/Semi-urban 
13	Loan_Status	Status of Loan Approved or not i.e. Y- Yes, N-No 

In [159]:
data = pd.read_csv('LoanApprovalPrediction.csv')
display(data)

duplicates = data['Loan_ID'].duplicated()
duplicates.sum()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0.00,Graduate,No,5849,0.00,,360.00,1.00,Urban,Y
1,LP001003,Male,Yes,1.00,Graduate,No,4583,1508.00,128.00,360.00,1.00,Rural,N
2,LP001005,Male,Yes,0.00,Graduate,Yes,3000,0.00,66.00,360.00,1.00,Urban,Y
3,LP001006,Male,Yes,0.00,Not Graduate,No,2583,2358.00,120.00,360.00,1.00,Urban,Y
4,LP001008,Male,No,0.00,Graduate,No,6000,0.00,141.00,360.00,1.00,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
593,LP002978,Female,No,0.00,Graduate,No,2900,0.00,71.00,360.00,1.00,Rural,Y
594,LP002979,Male,Yes,3.00,Graduate,No,4106,0.00,40.00,180.00,1.00,Rural,Y
595,LP002983,Male,Yes,1.00,Graduate,No,8072,240.00,253.00,360.00,1.00,Urban,Y
596,LP002984,Male,Yes,2.00,Graduate,No,7583,0.00,187.00,360.00,1.00,Urban,Y


0

In [54]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 598 entries, 0 to 597
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            598 non-null    object 
 1   Gender             598 non-null    object 
 2   Married            598 non-null    object 
 3   Dependents         586 non-null    float64
 4   Education          598 non-null    object 
 5   Self_Employed      598 non-null    object 
 6   ApplicantIncome    598 non-null    int64  
 7   CoapplicantIncome  598 non-null    float64
 8   LoanAmount         577 non-null    float64
 9   Loan_Amount_Term   584 non-null    float64
 10  Credit_History     549 non-null    float64
 11  Property_Area      598 non-null    object 
 12  Loan_Status        598 non-null    object 
dtypes: float64(5), int64(1), object(7)
memory usage: 60.9+ KB


In [55]:
data.describe()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,586.0,598.0,598.0,577.0,584.0,549.0
mean,0.76,5292.25,1631.5,144.97,341.92,0.84
std,1.01,5807.27,2953.32,82.7,65.21,0.36
min,0.0,150.0,0.0,9.0,12.0,0.0
25%,0.0,2877.5,0.0,100.0,360.0,1.0
50%,0.0,3806.0,1211.5,127.0,360.0,1.0
75%,1.75,5746.0,2324.0,167.0,360.0,1.0
max,3.0,81000.0,41667.0,650.0,480.0,1.0


In [56]:
data.isna().sum()

Loan_ID               0
Gender                0
Married               0
Dependents           12
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           21
Loan_Amount_Term     14
Credit_History       49
Property_Area         0
Loan_Status           0
dtype: int64

Data imputation

After finding about No. of blank fields present in the dataset then we must replace them with values which are derived by statistical methods such as mean, mode, mean for both numerical and categorical attributes present in the dataset and must check for null values to make sure that there are no blank fields in the dataset. We can also replace the irrelevant or noisy data with the precise ones so that it will not show any impact on the training process and to make predictions.

In [57]:
data['Dependents'] = data['Dependents'].fillna(data['Dependents'].mean())
data['LoanAmount'] = data['LoanAmount'].fillna(data['LoanAmount'].mean())
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mean())
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].mean())

data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

As Loan_ID is completely unique and not correlated with any of the other column, So we will drop it using .drop() function.

As there is no missing value then we must proceed to model training.

### Splitting dataset

we must divide the data into independent and dependent variables which means we must split first 12 attributes variables into one group of array elements and the final status attribute variables into other as they are dependent on the other attributes of the dataset.
* x = predictor variables
* y = response variable - loan status

After splitting the variables into two groups then we must transform all the categorical data variables into the machine understandable format. So that we will convert them into some dummy variables. Here we will use LabelEncoder( ), OneHotEncoder( ), fitTransform( ) functions for transformation

In [58]:
categorical = ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']

# Initialising LabelEncoder
label_encoder = LabelEncoder()

# Predictor variables
for col in categorical:
    data[col] = label_encoder.fit_transform(data[col])

display(data.head())


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0.0,0,0,5849,0.0,144.97,360.0,1.0,2,1
1,LP001003,1,1,1.0,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,LP001005,1,1,0.0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,LP001006,1,1,0.0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,LP001008,1,0,0.0,0,0,6000,0.0,141.0,360.0,1.0,2,1


After converting all the categorical data into dummy variables and loading it into again the same variable ‘X’, we must split both the data variables ‘X’ and ‘Y’ into train and test data using train_test_split module available from scikitlearn. Thereafter we must fit the split data using StandardScaler

In [130]:
train, test = train_test_split(
    data,
    test_size = 0.4, 
    random_state = 404)

train = train.drop(columns = 'Loan_ID')
test = test.drop(columns = 'Loan_ID')

display(train)
display(test)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
286,1,1,2.00,0,0,3153,1560.00,134.00,360.00,1.00,2,1
482,1,1,0.00,0,0,3597,2157.00,119.00,360.00,0.00,0,0
361,1,1,0.00,0,0,19730,5266.00,570.00,360.00,1.00,0,0
0,1,0,0.00,0,0,5849,0.00,144.97,360.00,1.00,2,1
542,1,1,1.00,0,0,5468,1032.00,26.00,360.00,1.00,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0,0.00,0,0,8333,3750.00,187.00,360.00,1.00,0,1
507,1,1,2.00,1,0,2192,1742.00,45.00,360.00,1.00,1,1
71,1,0,0.00,0,0,3500,0.00,81.00,300.00,1.00,1,1
317,1,1,3.00,0,0,15000,0.00,300.00,360.00,1.00,0,1


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
148,1,1,0.00,0,1,2577,3750.00,152.00,360.00,1.00,0,1
386,1,1,2.00,0,0,3100,1400.00,113.00,360.00,1.00,2,1
156,1,1,0.00,0,0,4583,5625.00,255.00,360.00,1.00,1,1
585,1,1,0.00,1,1,2894,2792.00,155.00,360.00,1.00,0,1
355,1,1,0.00,0,0,3013,3033.00,95.00,300.00,0.84,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
329,1,1,2.00,0,1,2500,4600.00,176.00,360.00,1.00,0,1
336,1,1,2.00,0,1,2583,2330.00,125.00,360.00,1.00,0,1
97,1,1,0.00,1,0,4188,0.00,115.00,180.00,1.00,1,1
157,1,1,0.00,1,0,1863,1041.00,98.00,360.00,1.00,1,1


#### Question 1.2.1
Draw a horizontal bar chart with two bars that show the proportion of Romance movies in each dataset.  Complete the function `romance_proportion` first; it should help you create the bar chart.

k-NN

finding the distance between a new point (Alice) and each point in the training sample

sorting the data table by these distances

selecting the top k rows

In [169]:
train

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,distance
286,1,1,2.00,0,0,3153,1560.00,134.00,360.00,1.00,2,1,6132.48
482,1,1,0.00,0,0,3597,2157.00,119.00,360.00,0.00,0,0,5895.82
361,1,1,0.00,0,0,19730,5266.00,570.00,360.00,1.00,0,0,11883.03
0,1,0,0.00,0,0,5849,0.00,144.97,360.00,1.00,2,1,3235.07
542,1,1,1.00,0,0,5468,1032.00,26.00,360.00,1.00,1,1,3764.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0,0.00,0,0,8333,3750.00,187.00,360.00,1.00,0,1,3824.48
507,1,1,2.00,1,0,2192,1742.00,45.00,360.00,1.00,1,1,7110.13
71,1,0,0.00,0,0,3500,0.00,81.00,300.00,1.00,1,1,5585.26
317,1,1,3.00,0,0,15000,0.00,300.00,360.00,1.00,0,1,5917.44


In [173]:
knn.score(features_train, target_train)


0.729050279329609

In [175]:
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array 
    consisting of the coordinates of the point"""
    return np.sqrt(np.sum((point1 - point2)**2))

def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop('Class')
    def distance_from_point(row):
        return distance(np.array(new_point), np.array(row))
    return attributes.apply(distance_from_point)

def table_with_distances(training, new_point):
    """Augments the training table 
    with a column of distances from new_point"""
    return training.with_column('Distance', all_distances(training, new_point))

def closest(training, new_point, k):
    """Returns a table of the k rows of the augmented table
    corresponding to the k smallest distances"""
    with_dists = table_with_distances(training, new_point)
    sorted_by_distance = with_dists.sort('Distance')
    topk = sorted_by_distance.take(np.arange(k))
    return topk

def majority(topkclasses):
    ones = topkclasses.where('Class', are.equal_to(1)).num_rows
    zeros = topkclasses.where('Class', are.equal_to(0)).num_rows
    if ones > zeros:
        return 1
    else:
        return 0

def classify(training, new_point, k):
    closestk = closest(training, new_point, k)
    topkclasses = closestk.select('Class')
    return majority(topkclasses)

In [181]:
test_row = train.

AttributeError: 'DataFrame' object has no attribute 'row'

In [140]:
# 1. Compute the distance between any 2 points
def dist_pt_pt1(point, point1):
    '''
    Input: point & point 1 (each is an array consisting of the coordinates of the point)
    Output:  distance between point and point 1
    '''    
    return np.sqrt(np.sum((point - point1)**2))

# 2. Compute the distance between a point and every other point in the data set
def dist_pt_other(point, train_dataset):
    '''
    Input: point 1 & point 2 (each is an array consisting of the coordinates of the point)
    Output:  distance between point 1 and point 2
    '''
    predictor_var = train.drop(columns = 'Loan_Status').copy() # IMPORTANT: DROP LOAN_STATUS COLUMN B4 CALC DISTANCE!
    distance_column = predictor_var.apply(lambda row: dist_pt_pt1(point, row), axis = 1)
    train_dataset['distance'] = distance_column
    return train_dataset

# 3. Pick out k nearest neighbour and identify the classification of the test point
def knn(point, k, train_dataset):
    '''
    Input: point 1 & point 2 (each is an array consisting of the coordinates of the point)
    Output:  distance between point 1 and point 2
    '''
    train_dataset = dist_pt_other(point, train_dataset)
    train_dataset = train_dataset.sort_values(by = 'distance', ascending = True)
    knn = train_dataset.head(k)
    classification = knn['Loan_Status'].mode()
    return classification.iloc[0]

In [148]:
fast_distances(test.iloc[0], train)

TypeError: unsupported operand type(s) for -: 'float' and 'str'

In [151]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 358 entries, 286 to 182
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             358 non-null    int32  
 1   Married            358 non-null    int32  
 2   Dependents         358 non-null    float64
 3   Education          358 non-null    int32  
 4   Self_Employed      358 non-null    int32  
 5   ApplicantIncome    358 non-null    int64  
 6   CoapplicantIncome  358 non-null    float64
 7   LoanAmount         358 non-null    float64
 8   Loan_Amount_Term   358 non-null    float64
 9   Credit_History     358 non-null    float64
 10  Property_Area      358 non-null    int32  
 11  Loan_Status        358 non-null    int32  
 12  distance           358 non-null    float64
dtypes: float64(6), int32(6), int64(1)
memory usage: 30.8 KB


In [134]:
knn_result = knn(test.iloc[0], 3, train)

knn_result

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,distance
19,1,1,0.0,0,1,2600,3500.0,115.0,341.92,1.0,2,1,254.42
589,1,1,0.0,1,0,2400,3800.0,144.97,180.0,1.0,2,0,257.46
47,0,1,0.0,0,0,2645,3440.0,120.0,360.0,0.0,2,0,318.99


0

In [135]:
test.iloc[0]

Gender                 1.00
Married                1.00
Dependents             0.00
Education              0.00
Self_Employed          1.00
ApplicantIncome     2577.00
CoapplicantIncome   3750.00
LoanAmount           152.00
Loan_Amount_Term     360.00
Credit_History         1.00
Property_Area          0.00
Loan_Status            1.00
Name: 148, dtype: float64

In [136]:
train

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,distance
286,1,1,2.00,0,0,3153,1560.00,134.00,360.00,1.00,2,1,2264.55
482,1,1,0.00,0,0,3597,2157.00,119.00,360.00,0.00,0,0,1891.86
361,1,1,0.00,0,0,19730,5266.00,570.00,360.00,1.00,0,0,17224.94
0,1,0,0.00,0,0,5849,0.00,144.97,360.00,1.00,2,1,4976.80
542,1,1,1.00,0,0,5468,1032.00,26.00,360.00,1.00,1,1,3970.05
...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0,0.00,0,0,8333,3750.00,187.00,360.00,1.00,0,1,5756.11
507,1,1,2.00,1,0,2192,1742.00,45.00,360.00,1.00,1,1,2047.38
71,1,0,0.00,0,0,3500,0.00,81.00,300.00,1.00,1,1,3863.04
317,1,1,3.00,0,0,15000,0.00,300.00,360.00,1.00,0,1,12977.49


## Evaluating the accuracy of the knn model


Msitake: when runnig  thr model on the train set the status clumn MUST BE REOMVED OHTERWISE ITS > 2 ATTRIBUTES -> DISTANCE BC SUPER LARGE -> WRONG

In [128]:
knn_prediction = train.drop(columns = {'distance'})
knn_prediction

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
550,1,0,0.00,1,1,5800,0.00,132.00,360.00,1.00,1,1
379,1,1,0.00,1,0,3010,3136.00,144.97,360.00,0.00,2,0
116,1,1,0.00,0,0,5568,2142.00,175.00,360.00,1.00,0,0
390,1,0,0.00,1,0,3902,1666.00,109.00,360.00,1.00,0,1
515,1,0,1.00,1,0,2679,1302.00,94.00,360.00,1.00,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0,0.00,0,0,8333,3750.00,187.00,360.00,1.00,0,1
507,1,1,2.00,1,0,2192,1742.00,45.00,360.00,1.00,1,1
71,1,0,0.00,0,0,3500,0.00,81.00,300.00,1.00,1,1
317,1,1,3.00,0,0,15000,0.00,300.00,360.00,1.00,0,1


In [129]:
knn_prediction = knn_prediction.apply(lambda row: knn(row, 1, train), axis = 1)

KeyboardInterrupt: 

In [65]:
knn_prediction

0      1
1      0
2      1
3      1
4      1
      ..
593    1
594    1
595    1
596    1
597    0
Length: 598, dtype: int32

In [93]:
knn_result = train.copy()
knn_result['prediction'] = knn_prediction
knn_result['correct'] = (knn_result['Loan_Status'] == knn_result['prediction'])

display(knn_result)
knn_result.value_counts('correct')

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,distance,prediction,correct
286,1,1,2.00,0,0,3153,1560.00,134.00,360.00,1.00,2,1,1841.86,1,True
482,1,1,0.00,0,0,3597,2157.00,119.00,360.00,0.00,0,0,2583.14,0,True
361,1,1,0.00,0,0,19730,5266.00,570.00,360.00,1.00,0,0,18331.97,0,True
0,1,0,0.00,0,0,5849,0.00,144.97,360.00,1.00,2,1,3672.34,1,True
542,1,1,1.00,0,0,5468,1032.00,26.00,360.00,1.00,1,1,3448.82,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0,0.00,0,0,8333,3750.00,187.00,360.00,1.00,0,1,7208.66,1,True
507,1,1,2.00,1,0,2192,1742.00,45.00,360.00,1.00,1,1,1743.22,1,True
71,1,0,0.00,0,0,3500,0.00,81.00,300.00,1.00,1,1,1322.09,1,True
317,1,1,3.00,0,0,15000,0.00,300.00,360.00,1.00,0,1,12824.28,1,True


correct
True    358
Name: count, dtype: int64

100% accuracy for train test as expected

# Using test set

In [137]:
test

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
148,1,1,0.00,0,1,2577,3750.00,152.00,360.00,1.00,0,1
386,1,1,2.00,0,0,3100,1400.00,113.00,360.00,1.00,2,1
156,1,1,0.00,0,0,4583,5625.00,255.00,360.00,1.00,1,1
585,1,1,0.00,1,1,2894,2792.00,155.00,360.00,1.00,0,1
355,1,1,0.00,0,0,3013,3033.00,95.00,300.00,0.84,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
329,1,1,2.00,0,1,2500,4600.00,176.00,360.00,1.00,0,1
336,1,1,2.00,0,1,2583,2330.00,125.00,360.00,1.00,0,1
97,1,1,0.00,1,0,4188,0.00,115.00,180.00,1.00,1,1
157,1,1,0.00,1,0,1863,1041.00,98.00,360.00,1.00,1,1


In [138]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 240 entries, 148 to 342
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             240 non-null    int32  
 1   Married            240 non-null    int32  
 2   Dependents         240 non-null    float64
 3   Education          240 non-null    int32  
 4   Self_Employed      240 non-null    int32  
 5   ApplicantIncome    240 non-null    int64  
 6   CoapplicantIncome  240 non-null    float64
 7   LoanAmount         240 non-null    float64
 8   Loan_Amount_Term   240 non-null    float64
 9   Credit_History     240 non-null    float64
 10  Property_Area      240 non-null    int32  
 11  Loan_Status        240 non-null    int32  
dtypes: float64(5), int32(6), int64(1)
memory usage: 18.8 KB


In [141]:
test_prediction = test.apply(lambda row: knn(row, 1, train), axis = 1)
test_prediction

148    1
386    1
156    1
585    1
355    1
      ..
329    1
336    1
97     1
157    1
342    0
Length: 240, dtype: int32

In [144]:
test_result = test.copy()
test_result['prediction'] = test_prediction
test_result['correct'] = (test_result['Loan_Status'] == test_result['prediction'])

display(test_result)
test_result.value_counts('correct')

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,prediction,correct
148,1,1,0.00,0,1,2577,3750.00,152.00,360.00,1.00,0,1,1,True
386,1,1,2.00,0,0,3100,1400.00,113.00,360.00,1.00,2,1,1,True
156,1,1,0.00,0,0,4583,5625.00,255.00,360.00,1.00,1,1,1,True
585,1,1,0.00,1,1,2894,2792.00,155.00,360.00,1.00,0,1,1,True
355,1,1,0.00,0,0,3013,3033.00,95.00,300.00,0.84,2,1,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
329,1,1,2.00,0,1,2500,4600.00,176.00,360.00,1.00,0,1,1,True
336,1,1,2.00,0,1,2583,2330.00,125.00,360.00,1.00,0,1,1,True
97,1,1,0.00,1,0,4188,0.00,115.00,180.00,1.00,1,1,1,True
157,1,1,0.00,1,0,1863,1041.00,98.00,360.00,1.00,1,1,1,True


correct
True     142
False     98
Name: count, dtype: int64

In [122]:
142/(142+98)

0.5916666666666667