# Loan Approval Prediction - Professional ML Pipeline

## Project Overview
This notebook demonstrates a comprehensive machine learning pipeline for predicting loan approval decisions. The project focuses on:

- **Binary Classification**: Predicting loan approval (Approved/Rejected)
- **Imbalanced Data Handling**: Using SMOTE and other techniques
- **Model Comparison**: Logistic Regression vs Decision Tree vs Ensemble Methods
- **Production Readiness**: Complete pipeline with API deployment

## Business Context
In loan approval, we prioritize **precision** to minimize bad loan approvals (Type I errors) which are costly to the business, while maintaining reasonable **recall** to not miss too many good applicants.

---

## 1. Environment Setup and Data Loading

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from src.data_loader import DataLoader
from src.eda_analyzer import EDAAnalyzer
from src.visualizer import LoanDataVisualizer
from src.data_preprocessor import LoanDataPreprocessor
from src.imbalance_handler import ImbalanceHandler
from src.model_trainer import ModelTrainer
from src.model_evaluator import ModelEvaluator
from src.evaluation_visualizer import EvaluationVisualizer
from src.production_pipeline import ProductionPipeline

print("✅ All modules imported successfully")

[32m2025-08-06 17:01:46.893[0m | [1mINFO    [0m | [36msrc.config_loader[0m:[36m_load_config[0m:[36m29[0m - [1mConfiguration loaded successfully from config\config.yaml[0m


✅ All modules imported successfully


In [2]:
# Load and validate data
data_loader = DataLoader()
df, data_summary = data_loader.load_and_validate()

print(f"Dataset Shape: {df.shape}")
print(f"Target Distribution: {data_summary['target_distribution']}")
print(f"Missing Values: {sum(data_summary['missing_values'].values())}")

# Display first few rows
df.head()

[32m2025-08-06 17:01:47.272[0m | [1mINFO    [0m | [36msrc.data_loader[0m:[36mload_raw_data[0m:[36m43[0m - [1mTarget values after cleaning: ['Approved' 'Rejected'][0m
[32m2025-08-06 17:01:47.273[0m | [1mINFO    [0m | [36msrc.data_loader[0m:[36mload_raw_data[0m:[36m45[0m - [1mCleaned column names: ['loan_id', 'no_of_dependents', 'education', 'self_employed', 'income_annum', 'loan_amount', 'loan_term', 'cibil_score', 'residential_assets_value', 'commercial_assets_value', 'luxury_assets_value', 'bank_asset_value', 'loan_status'][0m
[32m2025-08-06 17:01:47.274[0m | [1mINFO    [0m | [36msrc.data_loader[0m:[36mload_raw_data[0m:[36m47[0m - [1mData loaded successfully. Shape: (4269, 13)[0m
[32m2025-08-06 17:01:47.274[0m | [1mINFO    [0m | [36msrc.data_loader[0m:[36mload_raw_data[0m:[36m48[0m - [1mColumns: ['loan_id', 'no_of_dependents', 'education', 'self_employed', 'income_annum', 'loan_amount', 'loan_term', 'cibil_score', 'residential_assets_valu

Dataset Shape: (4269, 13)
Target Distribution: {'Approved': 2656, 'Rejected': 1613}
Missing Values: 0


Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


## 2. Exploratory Data Analysis

In [3]:
# Comprehensive EDA
eda_analyzer = EDAAnalyzer(df)
eda_results = eda_analyzer.run_comprehensive_eda()

# Create visualizations
visualizer = LoanDataVisualizer(df)
visualizer.create_eda_report(eda_results)

print("✅ EDA completed - check the visualizations above")

[32m2025-08-06 17:01:47.322[0m | [1mINFO    [0m | [36msrc.eda_analyzer[0m:[36mrun_comprehensive_eda[0m:[36m192[0m - [1mStarting comprehensive EDA analysis...[0m
[32m2025-08-06 17:01:47.327[0m | [1mINFO    [0m | [36msrc.eda_analyzer[0m:[36manalyze_missing_values[0m:[36m52[0m - [1mTotal missing values: 0[0m
[32m2025-08-06 17:01:47.327[0m | [1mINFO    [0m | [36msrc.eda_analyzer[0m:[36manalyze_missing_values[0m:[36m53[0m - [1mRows with missing values: 0[0m
[32m2025-08-06 17:01:47.329[0m | [1mINFO    [0m | [36msrc.eda_analyzer[0m:[36manalyze_target_distribution[0m:[36m71[0m - [1mTarget distribution: {'Approved': 2656, 'Rejected': 1613}[0m
[32m2025-08-06 17:01:47.330[0m | [1mINFO    [0m | [36msrc.eda_analyzer[0m:[36manalyze_target_distribution[0m:[36m72[0m - [1mImbalance ratio: 0.607[0m
[32m2025-08-06 17:01:47.367[0m | [1mINFO    [0m | [36msrc.eda_analyzer[0m:[36mrun_comprehensive_eda[0m:[36m207[0m - [1mEDA analysis compl

✅ EDA completed - check the visualizations above


## 3. Data Preprocessing and Feature Engineering

In [4]:
# Data preprocessing pipeline
preprocessor = LoanDataPreprocessor()
X_train, X_val, X_test, y_train, y_val, y_test = preprocessor.process_full_pipeline(df)

print(f"Training Set: {X_train.shape}")
print(f"Validation Set: {X_val.shape}")
print(f"Test Set: {X_test.shape}")
print(f"Feature Names: {len(preprocessor.feature_names)} features")

print("\n✅ Data preprocessing completed")

[32m2025-08-06 17:01:51.680[0m | [1mINFO    [0m | [36msrc.data_preprocessor[0m:[36mprocess_full_pipeline[0m:[36m301[0m - [1mStarting complete preprocessing pipeline...[0m
[32m2025-08-06 17:01:51.687[0m | [1mINFO    [0m | [36msrc.data_preprocessor[0m:[36mengineer_features[0m:[36m140[0m - [1mFeature engineering completed. Added 6 new features[0m
[32m2025-08-06 17:01:51.690[0m | [1mINFO    [0m | [36msrc.data_preprocessor[0m:[36mprepare_target_variable[0m:[36m160[0m - [1mTarget variable encoded: {'Approved': np.int64(0), 'Rejected': np.int64(1)}[0m
[32m2025-08-06 17:01:51.698[0m | [1mINFO    [0m | [36msrc.data_preprocessor[0m:[36msplit_data[0m:[36m196[0m - [1mData split completed:[0m
[32m2025-08-06 17:01:51.699[0m | [1mINFO    [0m | [36msrc.data_preprocessor[0m:[36msplit_data[0m:[36m197[0m - [1m  Train: 2774 samples[0m
[32m2025-08-06 17:01:51.699[0m | [1mINFO    [0m | [36msrc.data_preprocessor[0m:[36msplit_data[0m:[36m198

Training Set: (2774, 20)
Validation Set: (641, 20)
Test Set: (854, 20)
Feature Names: 20 features

✅ Data preprocessing completed


## 4. Class Imbalance Analysis and Handling

In [5]:
# Handle class imbalance
imbalance_handler = ImbalanceHandler()

# Analyze original distribution
original_distribution = imbalance_handler.analyze_class_distribution(y_train)
print("Original Class Distribution:")
print(f"  Class Counts: {original_distribution['class_counts']}")
print(f"  Imbalance Ratio: {original_distribution['imbalance_ratio']:.3f}")

# Apply resampling
X_train_balanced, y_train_balanced, resampling_method = imbalance_handler.apply_best_resampling(X_train, y_train)

# Analyze new distribution
new_distribution = imbalance_handler.analyze_class_distribution(y_train_balanced)
print(f"\nAfter {resampling_method}:")
print(f"  Class Counts: {new_distribution['class_counts']}")
print(f"  Imbalance Ratio: {new_distribution['imbalance_ratio']:.3f}")

print("\n✅ Class imbalance handling completed")

[32m2025-08-06 17:01:51.733[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m53[0m - [1mClass distribution analysis:[0m
[32m2025-08-06 17:01:51.734[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m54[0m - [1m  Class counts: {0: 1726, 1: 1048}[0m
[32m2025-08-06 17:01:51.734[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m55[0m - [1m  Imbalance ratio: 0.607[0m
[32m2025-08-06 17:01:51.735[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m56[0m - [1m  Severely imbalanced: False[0m
[32m2025-08-06 17:01:51.736[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m53[0m - [1mClass distribution analysis:[0m
[32m2025-08-06 17:01:51.737[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m54[0m - [1

Original Class Distribution:
  Class Counts: {0: 1726, 1: 1048}
  Imbalance Ratio: 0.607


[32m2025-08-06 17:01:53.116[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36mapply_smote[0m:[36m79[0m - [1mSMOTE applied:[0m
[32m2025-08-06 17:01:53.117[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36mapply_smote[0m:[36m80[0m - [1m  Original shape: (2774, 20)[0m
[32m2025-08-06 17:01:53.118[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36mapply_smote[0m:[36m81[0m - [1m  Resampled shape: (3452, 20)[0m
[32m2025-08-06 17:01:53.118[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36mapply_smote[0m:[36m82[0m - [1m  New class distribution: Counter({0: 1726, 1: 1726})[0m
[32m2025-08-06 17:01:53.119[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36mapply_best_resampling[0m:[36m219[0m - [1mApplied resampling method: smote[0m
[32m2025-08-06 17:01:53.120[0m | [1mINFO    [0m | [36msrc.imbalance_handler[0m:[36manalyze_class_distribution[0m:[36m53[0m - [1mClass distribution analysis:[0m
[32m2025-08-06 


After smote:
  Class Counts: {0: 1726, 1: 1726}
  Imbalance Ratio: 1.000

✅ Class imbalance handling completed


## 5. Model Training and Hyperparameter Tuning

In [6]:
# Train multiple models
trainer = ModelTrainer()
training_results = trainer.train_all_models(X_train_balanced, y_train_balanced, X_val, y_val)

print("Training Results Summary:")
for model_name, results in training_results.items():
    cv_scores = results['cv_scores']
    print(f"\n{model_name.upper()}:")
    for metric, scores in cv_scores.items():
        mean_score = np.mean(scores)
        std_score = np.std(scores)
        print(f"  {metric}: {mean_score:.4f} (+/- {std_score * 2:.4f})")

print("\n✅ Model training completed")

[32m2025-08-06 17:01:53.132[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mtrain_all_models[0m:[36m190[0m - [1mTraining 4 models: ['logistic_regression', 'decision_tree', 'random_forest', 'gradient_boosting'][0m
[32m2025-08-06 17:01:53.132[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mtrain_single_model[0m:[36m116[0m - [1mTraining logistic_regression...[0m
[32m2025-08-06 17:01:53.133[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m94[0m - [1mStarting hyperparameter tuning for logistic_regression...[0m


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[32m2025-08-06 17:01:56.639[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m97[0m - [1mBest parameters for logistic_regression: {'C': 1.0, 'max_iter': 1000, 'penalty': 'l1', 'solver': 'liblinear'}[0m
[32m2025-08-06 17:01:56.640[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m98[0m - [1mBest CV score for logistic_regression: 0.9771[0m
[32m2025-08-06 17:01:56.827[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  PRECISION CV: 0.9663 (+/- 0.0284)[0m
[32m2025-08-06 17:01:56.965[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  RECALL CV: 0.9884 (+/- 0.0132)[0m
[32m2025-08-06 17:01:57.104[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  F1 CV: 0.9771 (+/- 0.0150)[0m
[32m2025-08-06 17:01:57.233[0m | [1mINFO    [0m | [36msrc.model_trainer[

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[32m2025-08-06 17:01:57.492[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m97[0m - [1mBest parameters for decision_tree: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}[0m
[32m2025-08-06 17:01:57.493[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m98[0m - [1mBest CV score for decision_tree: 0.9988[0m
[32m2025-08-06 17:01:57.548[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  PRECISION CV: 0.9988 (+/- 0.0028)[0m
[32m2025-08-06 17:01:57.590[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  RECALL CV: 0.9988 (+/- 0.0028)[0m
[32m2025-08-06 17:01:57.632[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  F1 CV: 0.9988 (+/- 0.0022)[0m
[32m2025-08-06 17:01:57.676[0m | [1mINFO    [0m | [36msrc.model_train

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[32m2025-08-06 17:02:37.796[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m97[0m - [1mBest parameters for random_forest: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 100}[0m
[32m2025-08-06 17:02:37.797[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36mperform_hyperparameter_tuning[0m:[36m98[0m - [1mBest CV score for random_forest: 0.9977[0m
[32m2025-08-06 17:02:38.660[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  PRECISION CV: 0.9988 (+/- 0.0028)[0m
[32m2025-08-06 17:02:39.413[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  RECALL CV: 0.9965 (+/- 0.0068)[0m
[32m2025-08-06 17:02:40.162[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36m_get_cv_scores[0m:[36m169[0m - [1m  F1 CV: 0.9977 (+/- 0.0039)[0m
[32m2025-08-06 17:02:40.907[0m | [1mINFO    [0m | [36msrc.model_tra

Training Results Summary:

LOGISTIC_REGRESSION:
  precision: 0.9663 (+/- 0.0284)
  recall: 0.9884 (+/- 0.0132)
  f1: 0.9771 (+/- 0.0150)
  roc_auc: 0.9951 (+/- 0.0053)

DECISION_TREE:
  precision: 0.9988 (+/- 0.0028)
  recall: 0.9988 (+/- 0.0028)
  f1: 0.9988 (+/- 0.0022)
  roc_auc: 0.9988 (+/- 0.0022)

RANDOM_FOREST:
  precision: 0.9988 (+/- 0.0028)
  recall: 0.9965 (+/- 0.0068)
  f1: 0.9977 (+/- 0.0039)
  roc_auc: 0.9999 (+/- 0.0001)

GRADIENT_BOOSTING:
  precision: 1.0000 (+/- 0.0000)
  recall: 0.9994 (+/- 0.0023)
  f1: 0.9997 (+/- 0.0012)
  roc_auc: 1.0000 (+/- 0.0000)

✅ Model training completed


## 6. Model Evaluation and Performance Analysis

In [7]:
# Comprehensive model evaluation
evaluator = ModelEvaluator(feature_names=preprocessor.feature_names)

# Evaluate each model on test set
evaluation_results = {}
for model_name, model in trainer.models.items():
    results = evaluator.evaluate_single_model(model, model_name, X_test, y_test)
    evaluation_results[model_name] = results

# Generate evaluation report
evaluation_report = evaluator.generate_evaluation_report(evaluation_results)

# Display performance summary
comparison_df = pd.DataFrame(evaluation_report['model_comparison']['comparison_table'])
print("Model Performance Comparison:")
print(comparison_df.round(4))

print(f"\nRecommended Model: {evaluation_report['model_comparison']['recommended_model']}")
print(f"Reason: {evaluation_report['model_comparison']['recommendation_reason']}")

print("\n✅ Model evaluation completed")

[32m2025-08-06 17:02:55.291[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mevaluate_single_model[0m:[36m53[0m - [1mEvaluating model: logistic_regression[0m
[32m2025-08-06 17:02:55.306[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mevaluate_single_model[0m:[36m87[0m - [1mEvaluation completed for logistic_regression[0m
[32m2025-08-06 17:02:55.307[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mevaluate_single_model[0m:[36m88[0m - [1m  Accuracy: 0.9731[0m
[32m2025-08-06 17:02:55.308[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mevaluate_single_model[0m:[36m89[0m - [1m  Precision: 0.9518[0m
[32m2025-08-06 17:02:55.308[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mevaluate_single_model[0m:[36m90[0m - [1m  Recall: 0.9783[0m
[32m2025-08-06 17:02:55.308[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mevaluate_single_model[0m:[36m91[0m - [1m  F1-Score: 0.9649[0m
[32m2025-08-06 17:02:55.3

Model Performance Comparison:
                 Model  Accuracy  Precision  Recall  F1-Score  ROC-AUC  \
0  logistic_regression    0.9731     0.9518  0.9783    0.9649   0.9955   
1        decision_tree    1.0000     1.0000  1.0000    1.0000   1.0000   
2        random_forest    0.9988     0.9969  1.0000    0.9985   1.0000   
3    gradient_boosting    1.0000     1.0000  1.0000    1.0000   1.0000   

   Type I Error Rate  Type II Error Rate  
0             0.0301              0.0217  
1             0.0000              0.0000  
2             0.0019              0.0000  
3             0.0000              0.0000  

Recommended Model: decision_tree
Reason: Best F1 score for imbalanced classification

✅ Model evaluation completed


In [8]:
# Create comprehensive evaluation visualizations
viz = EvaluationVisualizer()
viz.create_evaluation_dashboard(
    evaluation_results, y_test, trainer.models, X_test, 
    evaluation_report['model_comparison']
)

print("✅ Evaluation visualizations created")

[32m2025-08-06 17:02:55.452[0m | [1mINFO    [0m | [36msrc.evaluation_visualizer[0m:[36mcreate_evaluation_dashboard[0m:[36m353[0m - [1mCreating comprehensive evaluation dashboard...[0m
[32m2025-08-06 17:02:58.424[0m | [1mINFO    [0m | [36msrc.evaluation_visualizer[0m:[36mcreate_evaluation_dashboard[0m:[36m365[0m - [1mEvaluation dashboard created successfully[0m


✅ Evaluation visualizations created


## 7. Production Pipeline Setup

In [9]:
# Select and save best model for production
best_model_name, best_model = evaluator.select_best_model(evaluation_results)

# Get performance metrics
best_model_results = evaluation_results[best_model_name]
performance_metrics = {
    'accuracy': best_model_results['accuracy'],
    'precision': best_model_results['precision'],
    'recall': best_model_results['recall'],
    'f1_score': best_model_results['f1_score'],
    'roc_auc': best_model_results.get('roc_auc', 0)
}

# Save model for production
model_path = trainer.save_model(
    best_model, 
    best_model_name, 
    {
        'performance_metrics': performance_metrics,
        'resampling_method': resampling_method,
        'feature_names': preprocessor.feature_names
    }
)

print(f"Best model ({best_model_name}) saved to: {model_path}")
print(f"Performance Metrics: {performance_metrics}")

print("\n✅ Production pipeline setup completed")

[32m2025-08-06 17:02:58.433[0m | [1mINFO    [0m | [36msrc.model_evaluator[0m:[36mselect_best_model[0m:[36m309[0m - [1mBest model selected: decision_tree with f1: 1.0000[0m
[32m2025-08-06 17:02:58.435[0m | [1mINFO    [0m | [36msrc.model_trainer[0m:[36msave_model[0m:[36m264[0m - [1mModel saved to models\decision_tree_model.joblib[0m


Best model (decision_tree) saved to: models\decision_tree_model.joblib
Performance Metrics: {'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 'roc_auc': np.float64(1.0)}

✅ Production pipeline setup completed


## 8. Production Inference Testing

In [11]:
# Test production pipeline
from src.production_pipeline import create_production_pipeline

# Create production pipeline
prod_pipeline = create_production_pipeline(best_model_name, model_path, performance_metrics)

# Test with sample application
sample_application = {
    "loan_id": "DEMO_001",
    "no_of_dependents": 2,
    "education": "Graduate",
    "self_employed": "No",
    "income_annum": 6000000.0,
    "loan_amount": 18000000.0,
    "loan_term": 15,
    "cibil_score": 780,
    "residential_assets_value": 2500000.0,
    "commercial_assets_value": 1200000.0,
    "luxury_assets_value": 600000.0,
    "bank_asset_value": 400000.0
}

# Create DataFrame from sample application (raw input)
import pandas as pd
sample_df = pd.DataFrame([sample_application])

# Make prediction using the raw DataFrame (let pipeline handle preprocessing)
prediction_response = prod_pipeline.predict(sample_df.iloc[0].to_dict())

print("Production Pipeline Test:")
print(f"  Loan ID: {prediction_response.loan_id}")
print(f"  Prediction: {prediction_response.prediction}")
print(f"  Confidence: {prediction_response.confidence:.3f}")
print(f"  Risk Score: {prediction_response.risk_score:.3f}")
print(f"  Key Factors: {prediction_response.key_factors}")

# Health check
health_status = prod_pipeline.health_check()
print(f"\nPipeline Health: {health_status['status']}")

print("\n✅ Production pipeline testing completed")

[32m2025-08-06 17:07:04.197[0m | [1mINFO    [0m | [36msrc.production_pipeline[0m:[36mregister_model[0m:[36m483[0m - [1mModel registered: decision_tree_v15_20250806_170704[0m
[32m2025-08-06 17:07:04.200[0m | [1mINFO    [0m | [36msrc.production_pipeline[0m:[36mset_active_model[0m:[36m500[0m - [1mActive model set to: decision_tree_v15_20250806_170704[0m
[32m2025-08-06 17:07:04.202[0m | [1mINFO    [0m | [36msrc.production_pipeline[0m:[36m_load_artifacts[0m:[36m93[0m - [1mPreprocessor loaded successfully[0m
[32m2025-08-06 17:07:04.204[0m | [1mINFO    [0m | [36msrc.production_pipeline[0m:[36m_load_artifacts[0m:[36m102[0m - [1mModel loaded successfully[0m
[32m2025-08-06 17:07:04.205[0m | [1mINFO    [0m | [36msrc.production_pipeline[0m:[36mcreate_production_pipeline[0m:[36m546[0m - [1mProduction pipeline created and configured successfully[0m
[32m2025-08-06 17:07:04.217[0m | [1mINFO    [0m | [36msrc.production_pipeline[0m:[36mp

Production Pipeline Test:
  Loan ID: DEMO_001
  Prediction: Rejected
  Confidence: 1.000
  Risk Score: 0.000
  Key Factors: ['credit_score_category_Poor', 'debt_to_income_ratio', 'loan_term', 'High confidence prediction']

Pipeline Health: healthy

✅ Production pipeline testing completed


## 9. Business Impact Analysis

In [12]:
# Analyze business impact of the best model
best_results = evaluation_results[best_model_name]
business_metrics = best_results['business_metrics']

print("BUSINESS IMPACT ANALYSIS")
print("=" * 50)

print(f"Model: {best_model_name}")
print(f"\nPrediction Accuracy: {best_results['accuracy']:.1%}")
print(f"Precision (Bad Loan Avoidance): {best_results['precision']:.1%}")
print(f"Recall (Good Loan Capture): {best_results['recall']:.1%}")
print(f"F1-Score (Balanced Performance): {best_results['f1_score']:.1%}")

print(f"\nBUSINESS RISK METRICS:")
print(f"Type I Error Rate (Bad Loans Approved): {business_metrics['type_i_error_rate']:.1%}")
print(f"Type II Error Rate (Good Loans Rejected): {business_metrics['type_ii_error_rate']:.1%}")
print(f"Cost Ratio (Financial Risk): {business_metrics['cost_ratio']:.1%}")
print(f"Opportunity Loss Ratio: {business_metrics['opportunity_loss_ratio']:.1%}")

print(f"\nCONFUSION MATRIX BREAKDOWN:")
print(f"True Positives (Correctly Approved): {business_metrics['true_positives']}")
print(f"True Negatives (Correctly Rejected): {business_metrics['true_negatives']}")
print(f"False Positives (Incorrectly Approved): {business_metrics['false_positives']}")
print(f"False Negatives (Incorrectly Rejected): {business_metrics['false_negatives']}")

BUSINESS IMPACT ANALYSIS
Model: decision_tree

Prediction Accuracy: 100.0%
Precision (Bad Loan Avoidance): 100.0%
Recall (Good Loan Capture): 100.0%
F1-Score (Balanced Performance): 100.0%

BUSINESS RISK METRICS:
Type I Error Rate (Bad Loans Approved): 0.0%
Type II Error Rate (Good Loans Rejected): 0.0%
Cost Ratio (Financial Risk): 0.0%
Opportunity Loss Ratio: 0.0%

CONFUSION MATRIX BREAKDOWN:
True Positives (Correctly Approved): 323
True Negatives (Correctly Rejected): 531
False Positives (Incorrectly Approved): 0
False Negatives (Incorrectly Rejected): 0


## 10. Model Interpretability and Feature Importance

In [13]:
# Feature importance analysis
if 'feature_importance' in best_results:
    feature_importance = best_results['feature_importance']
    print("TOP 10 MOST IMPORTANT FEATURES:")
    print("=" * 40)
    
    for i, (feature, importance) in enumerate(list(feature_importance.items())[:10], 1):
        print(f"{i:2d}. {feature:<25} {importance:.4f}")

elif 'feature_coefficients' in best_results:
    feature_coef = best_results['feature_coefficients']
    print("TOP 10 MOST INFLUENTIAL FEATURES (by coefficient magnitude):")
    print("=" * 60)
    
    for i, (feature, coef) in enumerate(list(feature_coef.items())[:10], 1):
        direction = "↑ Increases" if coef > 0 else "↓ Decreases"
        print(f"{i:2d}. {feature:<25} {coef:8.4f} ({direction} approval probability)")

print("\n✅ Feature importance analysis completed")

TOP 10 MOST IMPORTANT FEATURES:
 1. credit_score_category_Poor 0.8544
 2. debt_to_income_ratio      0.0751
 3. loan_term                 0.0533
 4. loan_to_asset_ratio       0.0149
 5. cibil_score               0.0023
 6. no_of_dependents          0.0000
 7. income_annum              0.0000
 8. loan_amount               0.0000
 9. residential_assets_value  0.0000
10. commercial_assets_value   0.0000

✅ Feature importance analysis completed


## 11. Final Recommendations and Next Steps

In [14]:
print("FINAL RECOMMENDATIONS")
print("=" * 50)

print("\n🎯 BUSINESS RECOMMENDATIONS:")
business_recommendations = [
    "Deploy the trained model to automate initial loan screening",
    "Focus on precision to minimize bad loan approvals and reduce financial risk",
    "Implement manual review for borderline cases (confidence < 70%)",
    "Monitor model performance monthly and retrain quarterly",
    "Use model insights to improve loan application process"
]

for i, rec in enumerate(business_recommendations, 1):
    print(f"  {i}. {rec}")

print("\n🔧 TECHNICAL RECOMMENDATIONS:")
technical_recommendations = [
    "Deploy using the provided FastAPI service for scalable predictions",
    "Implement model monitoring to track prediction drift",
    "Set up automated retraining pipeline with new data",
    "Use A/B testing for model improvements",
    "Implement comprehensive logging and error handling"
]

for i, rec in enumerate(technical_recommendations, 1):
    print(f"  {i}. {rec}")

print("\n📊 NEXT STEPS:")
next_steps = [
    "Run the API server: python src/api_server.py",
    "Execute tests: pytest tests/",
    "Review results in the 'results' folder",
    "Deploy to production environment",
    "Set up monitoring and alerting"
]

for i, step in enumerate(next_steps, 1):
    print(f"  {i}. {step}")

print("\n" + "="*80)
print("🎉 LOAN APPROVAL PREDICTION PIPELINE COMPLETED SUCCESSFULLY!")
print("="*80)

FINAL RECOMMENDATIONS

🎯 BUSINESS RECOMMENDATIONS:
  1. Deploy the trained model to automate initial loan screening
  2. Focus on precision to minimize bad loan approvals and reduce financial risk
  3. Implement manual review for borderline cases (confidence < 70%)
  4. Monitor model performance monthly and retrain quarterly
  5. Use model insights to improve loan application process

🔧 TECHNICAL RECOMMENDATIONS:
  1. Deploy using the provided FastAPI service for scalable predictions
  2. Implement model monitoring to track prediction drift
  3. Set up automated retraining pipeline with new data
  4. Use A/B testing for model improvements
  5. Implement comprehensive logging and error handling

📊 NEXT STEPS:
  1. Run the API server: python src/api_server.py
  2. Execute tests: pytest tests/
  3. Review results in the 'results' folder
  4. Deploy to production environment
  5. Set up monitoring and alerting

🎉 LOAN APPROVAL PREDICTION PIPELINE COMPLETED SUCCESSFULLY!
