**INRODUCTION:**

This project aims to predict if a particular request for loan approval be approved or not. It uses various variables to determine it like education,income ,dependents of the person applying for loan and many other parameters. 

**ACNOWLEDGEMENT**

The datset I have chosen is from kaggle itself. To see the datset [click here](https://www.kaggle.com/datasets/rishikeshkonapure/home-loan-approval)

****Process for model building:****
1. Importing dataset and all the important libraries
2. Understanding the data
3. Cleaning and pre-processing the data
4. Splitting the data into train and test data
5. Compiling and training the model 
6. Evaluating its performance

# 1. Importing dataset and all the important libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [None]:
org_data = pd.read_csv('/kaggle/input/home-loan-approval/loan_sanction_train.csv')
print(org_data.head())

# 2. Understanding the data

In [None]:
org_data.info()
org_data.describe()

In [None]:
org_data.hist(bins = 20, figsize = (10,8))
print(org_data['Gender'].value_counts())
print(org_data['Married'].value_counts())
print(org_data['Dependents'].value_counts())
print(org_data['Self_Employed'].value_counts())
print(org_data['Property_Area'].value_counts())
print(org_data['Loan_Status'].value_counts())

In [None]:
i=6
while i<11:
    plt.scatter( org_data.iloc[: , i], org_data['Loan_Status'],)
    plt.xlabel(str(i)+"th Column")
    plt.ylabel('Loan_Status')
    plt.show()
    i+=1

**Findings:**
* There are 614 rows and 13 columns including the LoanID
* There are a few outliers and missing values
* More than 75% of the applicants are male and more than 70% are married
* More than half of the applicants do not have any dependents
* Interestingly more than 83% of applicants are self employed which possibly indicates that bank recieve a lot of requests for buisness loans
* There is not much trend observed in area apparently which possibly means the location of bank is very strategic so people from both urban and rural areas have access to it. From user point of view, This is a positive point.
* Finally about 68% of the requests are approved by the bank which is decent.

# 3. Cleaning and pre-processing the data

First let us fill in the missing values, Theoretically if there are missing values that row can be dropped but since we have a limited number of rows I do not want to do that so I will be filling the categorical values with mode and numerical values with the mean.

In [None]:
print("Missing Values in Original Data")
print("")
miss_vals = org_data.isnull()
print(miss_vals.sum())

filled_data = org_data
mode_gender = filled_data['Gender'].mode()[0]
filled_data['Gender'].fillna(mode_gender, inplace = True)

#filling the missing values in married column
mode_married = filled_data['Married'].mode()[0]
filled_data['Married'].fillna(mode_married, inplace = True)

#filling the missing values in dependents column
mode_Dependents = filled_data['Dependents'].mode()[0]
filled_data['Dependents'].fillna(mode_Dependents, inplace = True)

#Education column has no missing values

#filling the missing values in Self_Employed column
mode_self = filled_data['Self_Employed'].mode()[0]
filled_data['Self_Employed'].fillna(mode_self, inplace = True)

#filling the missing values in Property_Area column
mode_self = filled_data['Property_Area'].mode()[0]
filled_data['Property_Area'].fillna(mode_self, inplace = True)

#filling the missing values in Credit_History column
mode_credit = filled_data['Credit_History'].mode()[0]
filled_data['Credit_History'].fillna(mode_credit, inplace = True)

#filling missing values in numerical columns left
filled_data['LoanAmount'].fillna(filled_data['LoanAmount'].mean(), inplace = True)
filled_data['Loan_Amount_Term'].fillna(filled_data['Loan_Amount_Term'].mean(), inplace = True)
print("")
print("")
print("")
print("Missing Values in Filled Data")
print("")
miss_vals = filled_data.isnull()
print(miss_vals.sum())

Now we have no missing values, so we proceed to normalize the numerical columns.  

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
numerical_scaler = MinMaxScaler()
encoded_data = filled_data
encoded_data.iloc[:, 6:10]   = numerical_scaler.fit_transform(filled_data.iloc[:, 6:10])
encoded_data = filled_data
i=1
while i<13 :
    if i==6 :
        i+=5
        continue
    encoded_data.iloc[: , i] = LabelEncoder().fit_transform(filled_data.iloc[: , i])
    i+=1
print(encoded_data.head())
    


# 4. Splitting the data into train and test data

In [None]:
from sklearn.model_selection import train_test_split
X_train , X_test , Y_train , Y_test = train_test_split(encoded_data.iloc[: ,1:12] , encoded_data.iloc[:,12] , test_size = 0.2 , random_state =2)


In [None]:
print(X_train.shape, Y_train.shape)
label_encoder = LabelEncoder()
Y_train_encoded = label_encoder.fit_transform(Y_train)
Y_test_encoded = label_encoder.fit_transform(Y_test)

# 5. Compiling and training the model

In [None]:
log_model = LogisticRegression()

In [None]:
log_model.fit(X_train,Y_train_encoded)

# 6. Evaluating its performance

In [None]:
#Accuracy on training data
X_train_predicted = log_model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_predicted,Y_train_encoded)

In [None]:
# Accuracy on test data

X_test_predicted = log_model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_predicted,Y_test_encoded)

In [None]:
print('Accuracy on Training Data: ',round((training_data_accuracy*100) ,2) , "%")
print('Accuracy on Test Data: ',round((test_data_accuracy*100) ,2) , "%")

# Conclusion 
So this model is able to predict whether a request will be accepted or not based on following parameters with an accuracy of about 77%. Please note that the data points were only about 120 for testing while traing accuracy was achieved as high as about 82%.

Finally I am using the same model to predict the values given in test data in the same datset I used for training. Final predictions are present below.

In [None]:
org_test_data = pd.read_csv('/kaggle/input/home-loan-approval/loan_sanction_test.csv')

In [None]:
print("Missing Values in Original Test Data")
print("")
miss_test_vals = org_test_data.isnull()
print(miss_test_vals.sum())

filled_test_data = org_test_data
mode_gender = filled_test_data['Gender'].mode()[0]
filled_test_data['Gender'].fillna(mode_gender, inplace = True)

#filling the missing values in married column
mode_married = filled_test_data['Married'].mode()[0]
filled_test_data['Married'].fillna(mode_married, inplace = True)

#filling the missing values in dependents column
mode_Dependents = filled_test_data['Dependents'].mode()[0]
filled_test_data['Dependents'].fillna(mode_Dependents, inplace = True)

#Education column has no missing values

#filling the missing values in Self_Employed column
mode_self = filled_test_data['Self_Employed'].mode()[0]
filled_test_data['Self_Employed'].fillna(mode_self, inplace = True)

#filling the missing values in Property_Area column
mode_self = filled_test_data['Property_Area'].mode()[0]
filled_test_data['Property_Area'].fillna(mode_self, inplace = True)

#filling the missing values in Credit_History column
mode_credit = filled_test_data['Credit_History'].mode()[0]
filled_test_data['Credit_History'].fillna(mode_credit, inplace = True)

#filling missing values in numerical columns left
filled_test_data['LoanAmount'].fillna(filled_test_data['LoanAmount'].mean(), inplace = True)
filled_test_data['Loan_Amount_Term'].fillna(filled_test_data['Loan_Amount_Term'].mean(), inplace = True)
print("")
print("")
print("")
print("Missing Values in Filled Data")
print("")
miss_test_vals = filled_test_data.isnull()
print(miss_vals.sum())

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

numerical_scaler = MinMaxScaler()
encoded_test_data = filled_test_data.copy()

# Scale numerical columns (assuming columns 6 to 9 are numerical)
encoded_test_data.iloc[:, 6:10] = numerical_scaler.fit_transform(filled_test_data.iloc[:, 6:10])

i = 1
while i < 12:
    if i == 6:
        i += 5
        continue
    if i >= len(encoded_test_data.columns):
        print("haha" , len(encoded_test_data.columns))
        break  # Exit the loop if i is out of bounds
    encoded_test_data.iloc[:, i] = LabelEncoder().fit_transform(filled_test_data.iloc[:, i])
    i += 1

print(encoded_test_data.head())

test_data_final = encoded_test_data.iloc[:, 1:12]


In [None]:
final_predictions = log_model.predict(test_data_final)

In [None]:
print(final_predictions)