<center><h2 style="font-size: 30px;">🔍 Loan Approval Prediction Project</h2></center>



<left><h2 style="font-size: 24px;">📌 Objective:</h2></left>Predict whether a loan application will be approved (Y) or not (N) based on applicant details using machine learning techniques.






<h2 style="font-size: 24px;">📘 Importing Required Libraries</h2>
<p>Before working with the dataset, we need to import essential Python libraries:</p>


In [35]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


<h2 style="font-size: 24px;">📂 Loading the Dataset</h2>
<p>We load the CSV file into a pandas DataFrame to begin our analysis:</p>


In [36]:
df = pd.read_csv("loan_data.csv")


<h2 style="font-size: 24px;">🔍 Previewing the Dataset</h2>
<p>The table below shows the first five rows of the dataset using <code>df.head()</code>:</p>

In [37]:
df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [38]:
df.shape         # Rows & Columns


(4269, 13)

In [39]:
df.info()         # Data types, missing values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


In [40]:
df.describe()     # Statistical summary


Unnamed: 0,loan_id,no_of_dependents,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
count,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0
mean,2135.0,2.498712,5059124.0,15133450.0,10.900445,599.936051,7472617.0,4973155.0,15126310.0,4976692.0
std,1232.498479,1.69591,2806840.0,9043363.0,5.709187,172.430401,6503637.0,4388966.0,9103754.0,3250185.0
min,1.0,0.0,200000.0,300000.0,2.0,300.0,-100000.0,0.0,300000.0,0.0
25%,1068.0,1.0,2700000.0,7700000.0,6.0,453.0,2200000.0,1300000.0,7500000.0,2300000.0
50%,2135.0,3.0,5100000.0,14500000.0,10.0,600.0,5600000.0,3700000.0,14600000.0,4600000.0
75%,3202.0,4.0,7500000.0,21500000.0,16.0,748.0,11300000.0,7600000.0,21700000.0,7100000.0
max,4269.0,5.0,9900000.0,39500000.0,20.0,900.0,29100000.0,19400000.0,39200000.0,14700000.0


In [41]:
df.isnull().sum() # Missing value count per column


loan_id                      0
 no_of_dependents            0
 education                   0
 self_employed               0
 income_annum                0
 loan_amount                 0
 loan_term                   0
 cibil_score                 0
 residential_assets_value    0
 commercial_assets_value     0
 luxury_assets_value         0
 bank_asset_value            0
 loan_status                 0
dtype: int64

In [42]:
df.columns

Index(['loan_id', ' no_of_dependents', ' education', ' self_employed',
       ' income_annum', ' loan_amount', ' loan_term', ' cibil_score',
       ' residential_assets_value', ' commercial_assets_value',
       ' luxury_assets_value', ' bank_asset_value', ' loan_status'],
      dtype='object')

# Step 4: Data Cleaning & Preprocessing


In [43]:
df.columns = df.columns.str.strip().str.lower()
print(df.columns.tolist())               # check again

['loan_id', 'no_of_dependents', 'education', 'self_employed', 'income_annum', 'loan_amount', 'loan_term', 'cibil_score', 'residential_assets_value', 'commercial_assets_value', 'luxury_assets_value', 'bank_asset_value', 'loan_status']


# Step 5: Feature Selection


In [44]:
X = df.drop(columns=['loan_id', 'loan_status'])
y = df['loan_status']


# Step 6: Data Splitting (Train-Test Split)

In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Step 7: Model Training


In [46]:
categorical_cols = X_train.select_dtypes(include=['object']).columns
print(categorical_cols)  # Ye aapko list of categorical columns dikhayega


Index(['education', 'self_employed'], dtype='object')


In [47]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in categorical_cols:
    X_train[col] = le.fit_transform(X_train[col])
    X_test[col] = le.transform(X_test[col])


In [48]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)  # ✅ should work now

In [49]:
# Combine X_train and y_train temporarily
train_data = X_train.copy()
train_data['Loan_Status'] = y_train


In [50]:
# Drop rows jahan Loan_Status NaN ho
train_data = train_data.dropna(subset=['Loan_Status'])


In [51]:
# Phir se X_train, y_train define karo
X_train = train_data.drop(columns=['Loan_Status'])
y_train = train_data['Loan_Status']


In [52]:
# Check missing values in y_train
print("Missing values in y_train:", y_train.isnull().sum())

# Remove rows with NaN in y_train from X_train and y_train
train_data = X_train.copy()
train_data['Loan_Status'] = y_train
train_data = train_data.dropna(subset=['Loan_Status'])

X_train = train_data.drop(columns=['Loan_Status'])
y_train = train_data['Loan_Status']

# Now fit model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


Missing values in y_train: 0
