# Loan Approval Prediction System

## Project Overview

This notebook demonstrates a comprehensive machine learning pipeline for predicting loan approval status. The system analyzes applicant information to determine whether a loan should be approved or rejected using various machine learning algorithms.

### Key Features:
- **Data Preprocessing**: Comprehensive data cleaning and feature engineering
- **Model Comparison**: Multiple ML algorithms evaluation
- **Hyperparameter Tuning**: Automated optimization for best performance
- **Model Persistence**: Save and load trained models
- **API Integration**: FastAPI-based REST API for real-time predictions

---

## 1. Import Libraries and Setup

First, we'll import all necessary libraries and set up our environment.

In [None]:
# Standard library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path

# Machine learning imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Custom modules
from src.data_preprocessing import DataPreprocessor
from src.model_training import ModelTrainer
from src.utils import create_sample_data, decode_prediction, print_data_info
from src.config import DATA_FILE, MODEL_FILE, SCALER_FILE

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

print("✅ All libraries imported successfully!")
print("📊 Environment setup complete!")

## 2. Data Loading and Exploration

Let's load the dataset and explore its structure to understand the data we're working with.

In [None]:
# Initialize data preprocessor
preprocessor = DataPreprocessor()

# Load the dataset
df = preprocessor.load_data(DATA_FILE)

# Display basic information about the dataset
print("📋 Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display first few rows
print("\n🔍 First 5 rows:")
df.head()

In [None]:
# Get comprehensive data information
print_data_info(df)

## 3. Data Preprocessing

Now we'll clean the data, handle missing values, and prepare it for machine learning.

In [None]:
# Check missing values before preprocessing
print("❌ Missing Values (Before Preprocessing):")
print(df.isnull().sum())

# Handle missing values
df_clean = preprocessor.handle_missing_values(df)

# Check missing values after preprocessing
print("\n✅ Missing Values (After Preprocessing):")
print(df_clean.isnull().sum())

print(f"\n📊 Data shape after cleaning: {df_clean.shape}")

In [None]:
# Explore unique values in categorical columns
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']

print("🏷️ Unique Values in Categorical Columns:")
for col in categorical_columns:
    if col in df_clean.columns:
        unique_values = df_clean[col].unique()
        print(f"{col}: {unique_values}")

In [None]:
# Encode categorical variables
df_encoded = preprocessor.encode_categorical_variables(df_clean)

print("🔄 Categorical Variables Encoded:")
print(df_encoded.head())

## 4. Exploratory Data Analysis

Let's visualize the data to gain insights before model training.

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Loan Data Analysis', fontsize=16, fontweight='bold')

# 1. Loan Status Distribution
loan_status_counts = df_clean['Loan_Status'].value_counts()
axes[0, 0].pie(loan_status_counts.values, labels=['Approved (Y)', 'Rejected (N)'], autopct='%1.1f%%', startangle=90)
axes[0, 0].set_title('Loan Status Distribution')

# 2. Income Distribution
axes[0, 1].hist(df_clean['ApplicantIncome'], bins=30, alpha=0.7, label='Applicant Income')
axes[0, 1].hist(df_clean['CoapplicantIncome'], bins=30, alpha=0.7, label='Coapplicant Income')
axes[0, 1].set_title('Income Distribution')
axes[0, 1].set_xlabel('Income')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()

# 3. Loan Amount vs Loan Status
sns.boxplot(data=df_clean, x='Loan_Status', y='LoanAmount', ax=axes[1, 0])
axes[1, 0].set_title('Loan Amount by Loan Status')

# 4. Education vs Loan Status
education_loan = pd.crosstab(df_clean['Education'], df_clean['Loan_Status'])
education_loan.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Education vs Loan Status')
axes[1, 1].set_xlabel('Education')
axes[1, 1].set_ylabel('Count')
axes[1, 1].legend(['Rejected', 'Approved'])

plt.tight_layout()
plt.show()

## 5. Feature Engineering and Data Splitting

Now we'll prepare the features and target variable for machine learning.

In [None]:
# Separate features and target
X, y = preprocessor.prepare_features_target(df_encoded)

print("🎯 Features and Target Separated:")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature columns: {list(X.columns)}")

# Display target distribution
print(f"\n📊 Target Distribution:")
print(y.value_counts())
print(f"Approval rate: {y.mean():.2%}")

In [None]:
# Scale numerical features
X_scaled = preprocessor.scale_numerical_features(X, fit_scaler=True)

print("⚖️ Numerical Features Scaled:")
print("Numerical columns scaled:", preprocessor.numerical_columns)
print("\nScaled features preview:")
print(X_scaled.head())

## 6. Model Training and Comparison

We'll train multiple machine learning models and compare their performance.

In [None]:
# Initialize model trainer
trainer = ModelTrainer()

print("🤖 Available Models:")
for name in trainer.models.keys():
    print(f"  • {name}")

print("\n🏃‍♂️ Starting model comparison...")
print("=" * 60)

In [None]:
# Compare all models
model_scores = trainer.compare_models(X_scaled, y)

# Create visualization of model performance
plt.figure(figsize=(12, 6))
models = list(model_scores.keys())
scores = list(model_scores.values())

bars = plt.bar(models, scores, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'])
plt.title('Model Performance Comparison', fontsize=16, fontweight='bold')
plt.xlabel('Models', fontsize=12)
plt.ylabel('Cross-Validation Score', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)

# Add score labels on bars
for bar, score in zip(bars, scores):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 7. Hyperparameter Tuning

We'll optimize the best performing models using hyperparameter tuning.

In [None]:
# Get the best model from comparison
best_model_name = max(model_scores, key=model_scores.get)
print(f"🏆 Best Model: {best_model_name} (Score: {model_scores[best_model_name]:.4f})")

# Tune hyperparameters for top models
models_to_tune = ['Random Forest', 'Logistic Regression', 'Support Vector Machine']

tuned_models = {}
for model_name in models_to_tune:
    if model_name in trainer.models:
        print(f"\n🔧 Tuning {model_name}...")
        tuned_model = trainer.tune_hyperparameters(model_name, X_scaled, y)
        tuned_models[model_name] = tuned_model

## 8. Final Model Selection and Evaluation

Let's select the best model and evaluate its performance in detail.

In [None]:
# Select the best tuned model (assuming Random Forest performed best)
final_model = tuned_models.get('Random Forest', trainer.models['Random Forest'])

print("🎖️ Final Model Selected: Random Forest")
print("\n📊 Detailed Model Evaluation:")
print("=" * 50)

# Get detailed evaluation
evaluation = trainer.get_detailed_evaluation(final_model, X_scaled, y)

print(f"Accuracy: {evaluation['accuracy']:.4f}")
print(f"\nClassification Report:")
print(evaluation['classification_report'])

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
conf_matrix = evaluation['confusion_matrix']
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Rejected', 'Approved'],
            yticklabels=['Rejected', 'Approved'])
plt.title('Confusion Matrix', fontsize=16, fontweight='bold')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.show()

# Calculate and display additional metrics
tn, fp, fn, tp = conf_matrix.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * (precision * recall) / (precision + recall)

print(f"\n📈 Additional Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1_score:.4f}")

## 9. Feature Importance Analysis

Let's analyze which features are most important for loan approval prediction.

In [None]:
# Plot feature importance
if hasattr(final_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': X_scaled.columns,
        'importance': final_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    plt.figure(figsize=(10, 8))
    sns.barplot(data=feature_importance, x='importance', y='feature', palette='viridis')
    plt.title('Feature Importance Analysis', fontsize=16, fontweight='bold')
    plt.xlabel('Importance Score', fontsize=12)
    plt.ylabel('Features', fontsize=12)
    
    # Add importance values on bars
    for i, (feature, importance) in enumerate(zip(feature_importance['feature'], feature_importance['importance'])):
        plt.text(importance + 0.002, i, f'{importance:.3f}', va='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("🔍 Top 5 Most Important Features:")
    for i, (feature, importance) in enumerate(feature_importance.head().values):
        print(f"{i+1}. {feature}: {importance:.4f}")
else:
    print("❌ Feature importance not available for this model type.")

## 10. Model Persistence

Save the trained model and preprocessing objects for future use.

In [None]:
# Save the final model
trainer.save_model(final_model, MODEL_FILE)

# Save the scaler
from src.utils import save_preprocessing_objects
save_preprocessing_objects(preprocessor.scaler, SCALER_FILE)

print("💾 Model and preprocessing objects saved successfully!")
print(f"📁 Model saved to: {MODEL_FILE}")
print(f"📁 Scaler saved to: {SCALER_FILE}")

## 11. Model Testing with Sample Data

Let's test our model with sample data to ensure it works correctly.

In [None]:
# Test with sample data
print("🧪 Testing Model with Sample Data:")
print("=" * 50)

# Test approved case
approved_sample = create_sample_data("approved")
approved_processed = preprocessor.preprocess_prediction_data(approved_sample)
approved_prediction = final_model.predict(approved_processed)

print("✅ Sample Case 1 (Expected: Approved)")
print(f"Input: {approved_sample.iloc[0].to_dict()}")
print(f"Prediction: {decode_prediction(approved_prediction[0])}")

# Test rejected case
rejected_sample = create_sample_data("rejected")
rejected_processed = preprocessor.preprocess_prediction_data(rejected_sample)
rejected_prediction = final_model.predict(rejected_processed)

print("\n❌ Sample Case 2 (Expected: Rejected)")
print(f"Input: {rejected_sample.iloc[0].to_dict()}")
print(f"Prediction: {decode_prediction(rejected_prediction[0])}")

## 12. API Integration Testing

Let's test our FastAPI integration to ensure the API works correctly.

In [None]:
# Test API compatibility
print("🌐 API Integration Test:")
print("=" * 30)

# Sample API request data
api_request = {
    "Gender": 1,
    "Married": 1,
    "Dependents": 0,
    "Education": 1,
    "Self_Employed": 0,
    "ApplicantIncome": 5000,
    "CoapplicantIncome": 2000,
    "LoanAmount": 150,
    "Loan_Amount_Term": 360,
    "Credit_History": 1,
    "Property_Area": 2
}

# Convert to DataFrame and process
api_df = pd.DataFrame([api_request])
api_processed = preprocessor.preprocess_prediction_data(api_df)
api_prediction = final_model.predict(api_processed)

print("📝 API Request Format:")
print(api_request)
print(f"\n🎯 API Response: {{'Loan status': '{decode_prediction(api_prediction[0])}'}}")

print("\n✅ API integration test completed successfully!")

## 13. Project Summary

Let's summarize the key findings and results of our loan approval prediction project.

In [None]:
print("📊 PROJECT SUMMARY")
print("=" * 50)

print(f"📁 Dataset: {DATA_FILE}")
print(f"📈 Dataset Size: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"🎯 Target Variable: Loan_Status (Approval Rate: {y.mean():.2%})")

print(f"\n🤖 Models Evaluated: {len(trainer.models)}")
for name, score in sorted(model_scores.items(), key=lambda x: x[1], reverse=True):
    print(f"  • {name}: {score:.4f}")

print(f"\n🏆 Best Model: {type(final_model).__name__}")
print(f"🎯 Final Accuracy: {evaluation['accuracy']:.4f}")
print(f"📊 Precision: {precision:.4f}")
print(f"📊 Recall: {recall:.4f}")
print(f"📊 F1-Score: {f1_score:.4f}")

if hasattr(final_model, 'feature_importances_'):
    top_feature = feature_importance.iloc[0]['feature']
    top_importance = feature_importance.iloc[0]['importance']
    print(f"\n🔍 Most Important Feature: {top_feature} ({top_importance:.4f})")

print(f"\n💾 Model Saved: {MODEL_FILE}")
print(f"💾 Scaler Saved: {SCALER_FILE}")
print(f"🌐 API Ready: app.py")

print("\n✅ Project completed successfully!")
print("🚀 Ready for deployment and portfolio showcase!")

## 14. Next Steps

### For Production Deployment:
1. **API Testing**: Test the FastAPI application with Postman or curl
2. **Model Monitoring**: Implement monitoring for model drift
3. **Data Validation**: Add input validation and error handling
4. **Documentation**: Update API documentation with examples
5. **Containerization**: Create Docker container for easy deployment

### For Portfolio Enhancement:
1. **Visualizations**: Create interactive dashboards with Plotly/Dash
2. **Web Interface**: Build a user-friendly web interface
3. **Model Explainability**: Add SHAP or LIME explanations
4. **A/B Testing**: Implement multiple model comparison
5. **Real-time Predictions**: Add streaming prediction capabilities

---

**This notebook demonstrates a complete end-to-end machine learning pipeline suitable for production use and portfolio showcase.**