# Importing Dataset

🧩 **Project Title:** Loan Data Preprocessing

📅 **Timeline (Month/Year):** (Please fill in the date you worked on this)

💡 **Objective / Problem statement:** To clean and preprocess a loan application dataset for machine learning modeling.

🧠 **Tech Stack / Tools used:** Python, pandas, scikit-learn (LabelEncoder)

⚙️ **Approach / Key steps (briefly):**
1.  Load the dataset.
2.  Analyze data for missing values and unique IDs.
3.  Drop the 'Loan_ID' column.
4.  Remove outliers in 'CoapplicantIncome'.
5.  Fill missing values using mode or median.
6.  Perform one-hot encoding on 'Dependents' and 'Property_Area'.
7.  Perform label encoding on 'Gender', 'Married', 'Education', 'Self_Employed', and 'Loan_Status'.
8.  Drop the original categorical columns.
9.  Export the preprocessed data to a new CSV file.

📈 **Results / Outcomes (accuracy, efficiency, insights, etc.):**
*   Successfully cleaned and transformed the dataset by handling missing values and outliers.
*   Converted categorical features into numerical format using one-hot and label encoding, making the data ready for machine learning models.
*   Generated a clean, preprocessed dataset (`Encoded_loan_data.csv`) for further analysis or model training.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(r'D:\My Drive\loan_data.csv')
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,LP002953,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
377,LP002974,Male,Yes,0,Graduate,No,3232,1950.0,108.0,360.0,1.0,Rural,Y
378,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
379,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y


# Data Analysis

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            381 non-null    object 
 1   Gender             376 non-null    object 
 2   Married            381 non-null    object 
 3   Dependents         373 non-null    object 
 4   Education          381 non-null    object 
 5   Self_Employed      360 non-null    object 
 6   ApplicantIncome    381 non-null    int64  
 7   CoapplicantIncome  381 non-null    float64
 8   LoanAmount         381 non-null    float64
 9   Loan_Amount_Term   370 non-null    float64
 10  Credit_History     351 non-null    float64
 11  Property_Area      381 non-null    object 
 12  Loan_Status        381 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 38.8+ KB


In [None]:
df.isnull().sum()

Loan_ID               0
Gender                5
Married               0
Dependents            8
Education             0
Self_Employed        21
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     11
Credit_History       30
Property_Area         0
Loan_Status           0
dtype: int64

In [None]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,381.0,381.0,381.0,370.0,351.0
mean,3579.845144,1277.275381,104.986877,340.864865,0.837607
std,1419.813818,2340.818114,28.358464,68.549257,0.369338
min,150.0,0.0,9.0,12.0,0.0
25%,2600.0,0.0,90.0,360.0,1.0
50%,3333.0,983.0,110.0,360.0,1.0
75%,4288.0,2016.0,127.0,360.0,1.0
max,9703.0,33837.0,150.0,480.0,1.0


In [None]:
df['Loan_ID'].is_unique

True

In [None]:
df.drop(columns=['Loan_ID'], inplace=True)

In [None]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
376,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
377,Male,Yes,0,Graduate,No,3232,1950.0,108.0,360.0,1.0,Rural,Y
378,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
379,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y


# Removing Outliers

In [None]:
df = df.loc[df['CoapplicantIncome'] < 8000]

In [None]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
376,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
377,Male,Yes,0,Graduate,No,3232,1950.0,108.0,360.0,1.0,Rural,Y
378,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
379,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y


In [None]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,378.0,378.0,378.0,367.0,348.0
mean,3591.132275,1121.229947,104.94709,340.708447,0.83908
std,1419.543455,1264.670693,28.412598,68.80764,0.367986
min,150.0,0.0,9.0,12.0,0.0
25%,2600.0,0.0,90.0,360.0,1.0
50%,3336.5,918.0,110.0,360.0,1.0
75%,4297.0,1998.25,127.0,360.0,1.0
max,9703.0,6666.0,150.0,480.0,1.0


# Filling Missing Values

In [None]:
df['Gender'].value_counts()

Male      289
Female     84
Name: Gender, dtype: int64

In [None]:
df['Gender'] = df['Gender'].fillna('Male')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Gender'] = df['Gender'].fillna('Male')


In [None]:
df['Dependents'].value_counts()

0     231
2      59
1      52
3+     28
Name: Dependents, dtype: int64

In [None]:
df['Dependents'] = df['Dependents'].fillna('0')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Dependents'] = df['Dependents'].fillna('0')


In [None]:
df['Self_Employed'].value_counts()

No     322
Yes     35
Name: Self_Employed, dtype: int64

In [None]:
df['Self_Employed'] = df['Self_Employed'].fillna('No')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Self_Employed'] = df['Self_Employed'].fillna('No')


In [None]:
df['Loan_Amount_Term'].value_counts()

360.0    309
180.0     29
480.0     11
300.0      7
120.0      3
84.0       3
240.0      2
60.0       1
12.0       1
36.0       1
Name: Loan_Amount_Term, dtype: int64

In [None]:
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median())


In [None]:
df['Credit_History'].value_counts()

1.0    292
0.0     56
Name: Credit_History, dtype: int64

In [None]:
df['Credit_History'] = df['Credit_History'].fillna('1.0')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Credit_History'] = df['Credit_History'].fillna('1.0')


In [None]:
df.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 378 entries, 0 to 380
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             378 non-null    object 
 1   Married            378 non-null    object 
 2   Dependents         378 non-null    object 
 3   Education          378 non-null    object 
 4   Self_Employed      378 non-null    object 
 5   ApplicantIncome    378 non-null    int64  
 6   CoapplicantIncome  378 non-null    float64
 7   LoanAmount         378 non-null    float64
 8   Loan_Amount_Term   378 non-null    float64
 9   Credit_History     378 non-null    object 
 10  Property_Area      378 non-null    object 
 11  Loan_Status        378 non-null    object 
dtypes: float64(3), int64(1), object(8)
memory usage: 38.4+ KB


# Performing One-Hot Encoding

In [None]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
376,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
377,Male,Yes,0,Graduate,No,3232,1950.0,108.0,360.0,1.0,Rural,Y
378,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
379,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y


In [None]:
encoded_df = pd.get_dummies(df,columns=['Dependents','Property_Area'])

In [None]:
encoded_df

Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,Male,Yes,Graduate,No,4583,1508.0,128.0,360.0,1.0,N,0,1,0,0,1,0,0
1,Male,Yes,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y,1,0,0,0,0,0,1
2,Male,Yes,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y,1,0,0,0,0,0,1
3,Male,No,Graduate,No,6000,0.0,141.0,360.0,1.0,Y,1,0,0,0,0,0,1
4,Male,Yes,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,Male,Yes,Graduate,No,5703,0.0,128.0,360.0,1.0,Y,0,0,0,1,0,0,1
377,Male,Yes,Graduate,No,3232,1950.0,108.0,360.0,1.0,Y,1,0,0,0,1,0,0
378,Female,No,Graduate,No,2900,0.0,71.0,360.0,1.0,Y,1,0,0,0,1,0,0
379,Male,Yes,Graduate,No,4106,0.0,40.0,180.0,1.0,Y,0,0,0,1,1,0,0


# Performing Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
lbl = LabelEncoder()
columns_to_encode = ['Gender', 'Married', 'Education', 'Self_Employed', 'Loan_Status']
for column in columns_to_encode:
    encoded_df[f'Encoded_{column}'] = lbl.fit_transform(encoded_df[column])
encoded_df

Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,...,Dependents_2,Dependents_3+,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Encoded_Gender,Encoded_Married,Encoded_Education,Encoded_Self_Employed,Encoded_Loan_Status
0,Male,Yes,Graduate,No,4583,1508.0,128.0,360.0,1.0,N,...,0,0,1,0,0,1,1,0,0,0
1,Male,Yes,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Y,...,0,0,0,0,1,1,1,0,1,1
2,Male,Yes,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Y,...,0,0,0,0,1,1,1,1,0,1
3,Male,No,Graduate,No,6000,0.0,141.0,360.0,1.0,Y,...,0,0,0,0,1,1,0,0,0,1
4,Male,Yes,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Y,...,0,0,0,0,1,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,Male,Yes,Graduate,No,5703,0.0,128.0,360.0,1.0,Y,...,0,1,0,0,1,1,1,0,0,1
377,Male,Yes,Graduate,No,3232,1950.0,108.0,360.0,1.0,Y,...,0,0,1,0,0,1,1,0,0,1
378,Female,No,Graduate,No,2900,0.0,71.0,360.0,1.0,Y,...,0,0,1,0,0,0,0,0,0,1
379,Male,Yes,Graduate,No,4106,0.0,40.0,180.0,1.0,Y,...,0,1,1,0,0,1,1,0,0,1


In [None]:
encoded_df.drop(columns=['Gender', 'Married', 'Education', 'Self_Employed', 'Loan_Status'],inplace=True)
encoded_df

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Encoded_Gender,Encoded_Married,Encoded_Education,Encoded_Self_Employed,Encoded_Loan_Status
0,4583,1508.0,128.0,360.0,1.0,0,1,0,0,1,0,0,1,1,0,0,0
1,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,0,1,1,1,0,1,1
2,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,0,1,1,1,1,0,1
3,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,0,1,1,0,0,0,1
4,2333,1516.0,95.0,360.0,1.0,1,0,0,0,0,0,1,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,5703,0.0,128.0,360.0,1.0,0,0,0,1,0,0,1,1,1,0,0,1
377,3232,1950.0,108.0,360.0,1.0,1,0,0,0,1,0,0,1,1,0,0,1
378,2900,0.0,71.0,360.0,1.0,1,0,0,0,1,0,0,0,0,0,0,1
379,4106,0.0,40.0,180.0,1.0,0,0,0,1,1,0,0,1,1,0,0,1


# Exporting csv File

In [None]:
encoded_df.to_csv('Encoded_loan_data.csv')