# **Loan Eligibility Prediction Using Logistic Regression**


In [128]:
# Importing Pandas and NumPy for to load data and numerical operations
import pandas as pd
import numpy as np

Description of our dataset

File contains 14 columns and 5000 rows. Description of the columns are as follows:

* ID: Customer ID
* Age : Customer Age
* Experience : Customer Experience
* Income : Income of the Customer
* ZipCode: Customer's residence zipcode
* Family : No of Family members of the customer
* CCAvg: Credit Card Average Score
* Education: Education of the customer
* Mortgage: Mortgage taken or not taken by the customer
* Personal Loan: 0 = No personal loan given , 1 = personal loan given
* Securities Account : Having or not having a Securities Account
* CD Account : Having or not having a CD Account
* Online : Having or not having online banking
* Credit Card : Having or not having a credit card




In [129]:
# Load the dataset and Print the shape of the dataset (rows, columns)
dataset = pd.read_csv('bankloan.csv')
print(dataset.shape)

(5000, 14)


In [130]:
# Display dataset information
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   5000 non-null   int64  
 1   Age                  5000 non-null   int64  
 2   Experience           5000 non-null   int64  
 3   Income               5000 non-null   int64  
 4   ZIP.Code             5000 non-null   int64  
 5   Family               5000 non-null   int64  
 6   CCAvg                5000 non-null   float64
 7   Education            5000 non-null   int64  
 8   Mortgage             5000 non-null   int64  
 9   Personal.Loan        5000 non-null   int64  
 10  Securities.Account   5000 non-null   int64  
 11  CD.Account           5000 non-null   int64  
 12  Online               5000 non-null   int64  
 13  CreditCard           5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


In [131]:
# Replace period (.) in column names with underscore (_) to avoid syntax issues
dataset.columns = [col.replace('.', '_') for col in dataset.columns]

In [132]:
# Select feature variables (X) by drop the 1st, 5th and 10th columns and target variable (y)
x = dataset.drop(dataset.columns[[0, 4, 9]], axis=1)
y = dataset.iloc[:,9]

In [133]:
# Split the data into training and testing sets (80% train, 20% test)
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.2 , random_state=0)

# Check the shapes of the training and testing sets
x_train.shape , x_test.shape , y_train.shape , y_test.shape

((4000, 11), (1000, 11), (4000,), (1000,))

In [134]:
# Scale the feature data to standardize it
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

# Display the scaled training data
x_train

array([[-0.54939992, -0.70298746,  0.02175233, ..., -0.25432119,
         0.82545668,  1.55150089],
       [-1.1586596 , -1.05065683, -0.19431256, ..., -0.25432119,
         0.82545668, -0.64453717],
       [ 1.71356461,  1.64378077, -1.10178513, ..., -0.25432119,
        -1.21145061,  1.55150089],
       ...,
       [-1.68088218, -1.65907822, -1.08017864, ..., -0.25432119,
        -1.21145061,  1.55150089],
       [ 1.01726783,  1.12227672, -0.5400164 , ...,  3.93203578,
         0.82545668,  1.55150089],
       [-1.0716225 , -0.96373949, -0.77768778, ..., -0.25432119,
        -1.21145061, -0.64453717]])

In [136]:
# Initialize and train a Logistic Regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train , y_train)

In [137]:
# Save the model using pickle ... if you want to save the model uncomment this
'''import pickle

# Save the model
filename = 'logistic_regression_model.sav'
pickle.dump(model, open(filename, 'wb'))'''


"import pickle\n\n# Save the model\nfilename = 'logistic_regression_model.sav'\npickle.dump(model, open(filename, 'wb'))"

In [138]:
# Make predictions on the test data
y_pred = model.predict(x_test)

# Combine predictions and actual values for comparison
val = np.concatenate((y_pred.reshape(len(y_pred),1) , y_test.values.reshape(len(y_test), 1 )), 1)
val_df = pd.DataFrame(val , columns=['Y_prediction' , 'Y_actual'])

# Display the first 10 rows of the prediction comparison DataFrame
val_df.head(10)

Unnamed: 0,Y_prediction,Y_actual
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,1,1
7,1,1
8,0,0
9,0,0


In [139]:
# Calculate and print the accuracy of the model
from sklearn.metrics import accuracy_score
out = accuracy_score(y_test , y_pred)*100

print(f'Accuracy of the model : {out}')

Accuracy of the model : 96.0


The model achieved 96% accuracy, indicating that it correctly predicts loan approval in 96% of the cases.