# Loan Prediction - Data Exploration
**Date:** December 21, 2025
**Goal:** Understand the loan dataset and identify patterns

## Dataset Overview
- Source: Kapple Loan Prediction Dataset
- Purpose: Predict loan approval based on applicant information
- Two datasets: Training data (with target) and Test data (without target)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10,6)

# Get the parent directory (project root)
project_root = os.path.dirname(os.path.dirname(os.path.abspath('__file__')))

# Load data from project root
train_df = pd.read_csv('../train_u6lujuX_CVtuZ9i.csv')
test_df = pd.read_csv('../test_Y3wMUE5_7gLdaTN.csv')

print(f"Training Dataset shape: {train_df.shape}")
print(f"Test Dataset shape: {test_df.shape}")
print("\n" + "="*50 + "\n")
print("Training Data - First 5 rows:")
train_df.head()

## Key Observations from First Look

**Dataset Structure:**
- Training dataset has 614 rows and 13 columns (including target variable 'Loan_Status')
- Test dataset has 367 rows and 12 columns (no target variable)
- Target variable is 'Loan_Status' with values Y/N (Yes/No for loan approval)
- Both datasets share 12 common features for prediction

**Features Overview:**
- **Loan_ID**: Unique identifier for each loan application
- **Gender**: Male/Female
- **Married**: Yes/No
- **Dependents**: Number of dependents (0, 1, 2, 3+)
- **Education**: Graduate/Not Graduate
- **Self_Employed**: Yes/No
- **ApplicantIncome**: Applicant's income
- **CoapplicantIncome**: Co-applicant's income
- **LoanAmount**: Loan amount requested (in thousands)
- **Loan_Amount_Term**: Term of loan (in months)
- **Credit_History**: Credit history meets guidelines (1/0)
- **Property_Area**: Urban/Semiurban/Rural

In [None]:
# Dataset information - Training Data
print("="*50)
print("TRAINING DATA INFORMATION")
print("="*50)
train_df.info()
print("\n" + "="*50 + "\n")

# Dataset information - Test Data
print("="*50)
print("TEST DATA INFORMATION")
print("="*50)
test_df.info()

In [None]:
# Missing values analysis - Training Data
print("="*50)
print("MISSING VALUES - TRAINING DATA")
print("="*50)
missing_train = train_df.isnull().sum()
missing_percent_train = (missing_train / len(train_df)) * 100
missing_df_train = pd.DataFrame({
    'Missing_Count': missing_train,
    'Percentage': missing_percent_train.round(2)
})
print(missing_df_train[missing_df_train['Missing_Count'] > 0].sort_values('Percentage', ascending=False))

print("\n" + "="*50 + "\n")

# Missing values analysis - Test Data
print("="*50)
print("MISSING VALUES - TEST DATA")
print("="*50)
missing_test = test_df.isnull().sum()
missing_percent_test = (missing_test / len(test_df)) * 100
missing_df_test = pd.DataFrame({
    'Missing_Count': missing_test,
    'Percentage': missing_percent_test.round(2)
})
print(missing_df_test[missing_df_test['Missing_Count'] > 0].sort_values('Percentage', ascending=False))

In [None]:
# Target variable distribution (Training data only)
if 'Loan_Status' in train_df.columns:
    print("="*50)
    print("LOAN STATUS DISTRIBUTION (TRAINING DATA)")
    print("="*50)
    print(train_df['Loan_Status'].value_counts())
    print("\nPercentage:")
    print(train_df['Loan_Status'].value_counts(normalize=True) * 100)
    
    # Visualization
    plt.figure(figsize=(8, 6))
    train_df['Loan_Status'].value_counts().plot(kind='bar', color=['green', 'red'])
    plt.title('Loan Approval Distribution (Training Data)', fontsize=14, fontweight='bold')
    plt.xlabel('Loan Status (Y=Approved, N=Rejected)', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.xticks(rotation=0)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# Statistical summary of numeric columns - Training Data
print("="*50)
print("STATISTICAL SUMMARY - TRAINING DATA")
print("="*50)
print(train_df.describe())

print("\n" + "="*50 + "\n")

# Statistical summary of numeric columns - Test Data
print("="*50)
print("STATISTICAL SUMMARY - TEST DATA")
print("="*50)
print(test_df.describe())

In [None]:
# Compare distributions between train and test
print("="*50)
print("COMPARING TRAIN VS TEST DATA")
print("="*50)
print(f"\nTrain shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
print(f"\nTrain columns: {train_df.columns.tolist()}")
print(f"Test columns: {test_df.columns.tolist()}")
print(f"\nColumns only in train: {set(train_df.columns) - set(test_df.columns)}")
print(f"Columns only in test: {set(test_df.columns) - set(train_df.columns)}")

## Initial Insights

**Missing Data (Training Set):**
- Gender: ~13% missing
- Married: ~3% missing
- Dependents: ~10% missing
- Self_Employed: ~23% missing
- LoanAmount: ~5% missing
- Loan_Amount_Term: ~9% missing
- Credit_History: ~8% missing

**Missing Data (Test Set):**
- Similar pattern to training data
- Need to handle missing values consistently across both datasets

**Target Variable (Training Data):**
- Approximately 69% of loans are approved (Y)
- Approximately 31% of loans are rejected (N)
- Dataset is somewhat imbalanced but manageable

**Data Types:**
- Numeric columns: ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History
- Categorical columns: Gender, Married, Dependents, Education, Self_Employed, Property_Area
- Total: 6 numeric + 6 categorical features

**Key Observations:**
- Income ranges vary significantly (potential outliers)
- Credit_History appears to be binary (0/1)
- Loan_Amount_Term mostly 360 months (30 years)
- Both datasets need similar preprocessing

## Questions to Investigate:

1. **Which features correlate most with loan approval?**
   - Focus on Credit_History, Income levels, Education

2. **How does income affect loan approval?**
   - Analyze ApplicantIncome + CoapplicantIncome combined

3. **Does credit history impact approval rate?**
   - Compare approval rates for Credit_History = 1 vs 0

4. **Are there any outliers in the data?**
   - Check income and loan amount distributions

5. **How do categorical variables affect approval?**
   - Gender, Married status, Education, Property_Area

6. **What's the relationship between loan amount and income?**
   - Debt-to-income ratio analysis

In [None]:
# Save column names for reference
print("="*50)
print("COLUMNS IN TRAINING DATASET")
print("="*50)
for i, col in enumerate(train_df.columns, 1):
    print(f"{i}. {col} - {train_df[col].dtype}")

print("\n" + "="*50 + "\n")

print("="*50)
print("COLUMNS IN TEST DATASET")
print("="*50)
for i, col in enumerate(test_df.columns, 1):
    print(f"{i}. {col} - {test_df[col].dtype}")

## Next Steps:

1. **Data Cleaning:**
   - Handle missing values (imputation strategy)
   - Remove or cap outliers if necessary
   - Ensure consistent data types

2. **Feature Engineering:**
   - Create TotalIncome = ApplicantIncome + CoapplicantIncome
   - Create LoanAmount_to_Income ratio
   - Encode categorical variables

3. **Exploratory Data Analysis:**
   - Univariate analysis (distributions)
   - Bivariate analysis (feature vs target)
   - Correlation analysis

4. **Model Building:**
   - Split training data for validation
   - Try multiple algorithms
   - Evaluate and tune models
   - Make predictions on test set