# Exploratory Data Analysis: Loan Approval Dataset

## Goal
Explore the loan approval dataset, clean the data, perform basic analysis, and visualize trends using NumPy, Pandas, Matplotlib, and Seaborn to understand factors influencing loan approval.

## Why This Project?
- **Data Cleaning**: Practice handling missing values and inconsistencies with Pandas.
- **Statistical Analysis**: Use NumPy and Pandas for calculations like mean and median.
- **Visualization**: Create plots with Matplotlib and Seaborn to identify trends.
- **Feature Relationships**: Analyze how features like income, credit history, and education relate to loan approval.


## Step 1: Import Libraries
Load the necessary libraries for data manipulation, analysis, and visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

## Step 2: Load Dataset
Load the loan approval dataset and inspect the first few rows.

In [5]:
df = pd.read_csv('loan.csv')
df.head(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Step 3: Exploring the Data
Check the dataset's structure, missing values, and basic statistics.

In [6]:
print('Dataset Shape:', df.shape)
print(df.info())
print('\nMissing Values:')
print(df.isnull().sum())
print('\nSummary Statistics:')
print(df.describe())

Dataset Shape: (614, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
None

Missing Values:
Loan_ID               0
Gender               13
Married          

### Initial Insights
- **Dataset Size**: 614 records with 13 features, including categorical (e.g., Gender, Education) and numerical (e.g., ApplicantIncome, LoanAmount) columns.
- **Missing Values**: Several columns have missing data: `Gender` (13), `Married` (3), `Dependents` (15), `Self_Employed` (32), `LoanAmount` (22), `Loan_Amount_Term` (14), and `Credit_History` (50).
- **Numerical Features**: `ApplicantIncome` and `CoapplicantIncome` show high variability, with some extreme values (max ApplicantIncome: 81,000).


## Step 4: Data Cleaning
Handle missing values by filling categorical columns with mode and numerical columns with median.

In [None]:
# Fill missing values for categorical columns with mode
for col in ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Credit_History']:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Fill missing values for numerical columns with median
for col in ['LoanAmount', 'Loan_Amount_Term']:
    df[col].fillna(df[col].median(), inplace=True)

# Verify cleaning
print('Missing Values After Cleaning:')
print(df.isnull().sum())

### Cleaning Insights
- Missing values in categorical columns (`Gender`, `Married`, etc.) were filled with the mode to preserve the most common category.
- Numerical columns (`LoanAmount`, `Loan_Amount_Term`) were filled with the median to avoid skew from outliers.


## Step 5: Basic Analysis
Perform simple statistical analysis to explore relationships with `Loan_Status`.

In [None]:
# Loan approval rate by Credit History
approval_by_credit = df.groupby('Credit_History')['Loan_Status'].value_counts(normalize=True).unstack()
print('Loan Approval Rate by Credit History:\n', approval_by_credit)

# Average Loan Amount for approved vs. rejected loans
avg_loan_approved = df[df['Loan_Status'] == 'Y']['LoanAmount'].mean()
avg_loan_rejected = df[df['Loan_Status'] == 'N']['LoanAmount'].mean()
print(f'Average Loan Amount (Approved): {avg_loan_approved:.2f}')
print(f'Average Loan Amount (Rejected): {avg_loan_rejected:.2f}')

### Analysis Insights
- **Credit History**: Applicants with a good credit history (1.0) are more likely to get approved.
- **Loan Amount**: Approved loans tend to have slightly lower average amounts than rejected ones, possibly due to risk assessment.


## Step 6: Visualizations
Create plots to visualize relationships between features and loan approval.

In [None]:
# Barplot of Loan Status by Credit History
plt.figure(figsize=(8, 5))
sns.countplot(x='Credit_History', hue='Loan_Status', data=df, palette='Set2')
plt.title('Loan Status by Credit History')
plt.xlabel('Credit History (0 = Bad, 1 = Good)')
plt.ylabel('Count')
plt.show()

# Boxplot of Loan Amount by Loan Status
plt.figure(figsize=(8, 5))
sns.boxplot(x='Loan_Status', y='LoanAmount', data=df, palette='Set1')
plt.title('Loan Amount by Loan Status')
plt.xlabel('Loan Status (Y = Approved, N = Rejected)')
plt.ylabel('Loan Amount')
plt.show()

# Correlation Heatmap for numerical features
plt.figure(figsize=(8, 6))
corr = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

## Insights from Visualizations
- **Credit History Barplot**: Applicants with good credit history (1.0) have a much higher approval rate compared to those with bad credit (0.0).
- **Loan Amount Boxplot**: Rejected loans tend to have a wider range of amounts, with some high outliers, suggesting larger loans may face stricter scrutiny.
- **Correlation Heatmap**:
  - Moderate positive correlation (0.56) between `ApplicantIncome` and `LoanAmount`, indicating higher incomes often request larger loans.
  - Weak correlation between `CoapplicantIncome` and `LoanAmount` (0.19), showing limited influence.
  - `Credit_History` has low correlation with other numerical features but is critical for `Loan_Status`.


## Step 7: Key Takeaways
- **Credit History is Key**: A good credit history significantly increases the likelihood of loan approval.
- **Income and Loan Amount**: Higher applicant incomes are associated with larger loan amounts, but extreme amounts may lead to rejection.
- **Next Steps**: Consider encoding categorical variables and building a predictive model to further analyze loan approval factors.
