In [1]:
import sys
import numpy as np
sys.path.append("..")

%load_ext autoreload
%autoreload 2

# Feature Engineering for Cohort Profitability Prediction

This notebook creates features for predicting ROI at horizon H using only information available up to decision time t.

## Key Parameters
- **Decision Time (t)**: 90 days after cohort creation (parametrized for easy modification)
- **Horizon (H)**: Based on EDA findings, we use the full observation period for final ROI calculation
- **Feature Scope**: Only information available at or before time t is used

## Feature Categories
1. **Loan-Level Features**: Individual loan characteristics and early behavior signals
2. **Cohort-Level Features**: Portfolio composition and risk distribution metrics

In [2]:
# Parameters - easily configurable
DECISION_TIME_DAYS = 90  # Decision time t in days after cohort creation
DATABASE_PATH = "../database.db"

print(f"Decision time set to: {DECISION_TIME_DAYS} days after cohort creation")

Decision time set to: 90 days after cohort creation


## Data Loading and Preparation

In [None]:
from src.dataset.data_manipulation import load_data

# Load all data
allowlist, loans, repayments, loans_and_cohort, repayments_and_loans = load_data(
    DATABASE_PATH, remove_loans_with_errors=True
)

## Feature Engineering Functions

We'll import feature engineering functions from a dedicated module to keep the notebook clean and functions reusable.

In [36]:
from src.features import (
    create_loan_level_features,
    create_cohort_level_features,
    save_features_to_database
)

## 1. Loan-Level Features

### Loan Characteristics
- Loan amount (raw and log-transformed)
- Annual interest rate
- Loan size decile within cohort

### Temporal Features
- Time since loan issuance at decision time t
- Time between allowlist date and loan creation

### Interaction Terms
- Loan amount × interest rate
- Loan ROI at 30/60/90 days

### Early Repayment Behavior
- Days to first repayment
- Repayment velocity (30/60/90 days)
- Repayment consistency metrics

### Repayment Quality Indicators
- Average repayment amount relative to loan size
- Repayment acceleration/deceleration trends

### Billing Payment Indicators
- Time in billing process
- Is in normal repayment process (boolean)

In [39]:
loan_features = create_loan_level_features(
    loans_and_cohort, repayments_and_loans, decision_time_days=DECISION_TIME_DAYS
)
loan_features

Creating loan-level features with decision time t=90 days...
Base features dataset: 15600 unique loans
Creating repayment behavior features...
Final loan features dataset: 15600 loans with 27 features


  "loan_amount"


Unnamed: 0,loan_id,user_id,created_at,updated_at,annual_interest,loan_amount,status_at_decision_time,batch,allowlisted_date,batch_letter,...,loan_amount_x_interest,repayment_velocity_30d,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,days_to_first_repayment,repayment_consistency_cv,avg_repayment_relative
0,0000634b4de08f4d798a4546bd104aa5d3e43af416bd48...,e00cc67f993040157c1a5d15b35d8b6182e567c405fff9...,2024-03-11,2024-03-11 16:49:25.324000,2.4,4000.0,executed,9a65c2254d6d2b240f353b95df7061928c7a9869417325...,2023-12-19,F,...,9600.0,,,,,18.034286,-0.968440,1.0,0.890936,0.001315
1,00022546590af574f1785cb5e4c17bb1898de7bce40977...,1532d16402c104350db26e145d562e7b9ef392e16e9c99...,2023-12-07,2024-02-22 23:48:52.979000,3.2,500.0,executed,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,1600.0,6.857143,-0.616000,3.310345,-0.616000,2.181818,-0.616000,9.0,0.000000,0.096000
2,000dca06cc48943ca84d7516f817709f2b7768468a9a02...,445a2b25d6692ec55caf314c6bc998c517ea9022c65735...,2024-06-01,2024-06-03 12:02:32.785000,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,170.0,,,50.440000,1.017600,3.152500,1.017600,1.0,0.984140,0.504400
3,000edc3faa8a8e0e569dc56feae1bc1262895a8716a3d6...,d2da15b907b2777025a1894c368f279e2a908c5f656da1...,2024-04-28,2024-04-28 12:32:55.087000,3.4,100.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,340.0,16.741667,0.004500,2.790278,0.004500,1.521970,0.004500,0.0,0.000000,1.004500
4,000f534973cf9b232b91613d881915221a6fbca479762e...,a93da183d9f8a58e1e0b8cbab2cb652ea800d18370b959...,2024-04-10,2024-04-20 14:45:48.511000,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,170.0,4.351667,1.088800,1.934074,1.088800,1.243333,1.088800,10.0,0.000000,1.044400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15595,fff1fcf554c8095ed7a3033cbada7da6891078ba4b9572...,3d6c6750db203cd915f319276c31e01aded75188b214fd...,2023-12-06,2023-12-09 04:28:18.458000,3.2,500.0,repaid,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,1600.0,35.060000,1.033480,17.232881,1.033480,11.424045,1.033480,3.0,0.000000,1.016740
15596,fff57f487306687b4e558a71ddd4f579ec3c0c3d538515...,94b7640a37bca52d697eb2bea0a25e53be9f4f0f476049...,2022-09-11,2022-10-18 15:35:08.598080,2.4,1750.0,repaid,1d83f7f96a6a3a06b30bc683b94a428225fe072e60959f...,2022-08-29,B,...,4200.0,126.455294,0.228423,121.373617,2.259749,74.085195,2.259749,3.0,0.568668,0.033956
15597,fff8767d031343d79d7d051a7f9885971eb0a9edd3032f...,6e2baec66c2fff4ded8d1413206c981e81e920ad1d8af1...,2024-06-25,2024-07-03 21:07:11.294000,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,170.0,,,,,12.850000,1.056000,1.0,0.694255,0.171333
15598,fffb5b06cc5ef2d4fd3d9321bc797d95b0bdb75ac77215...,4f1efc1e1af62ccdbc89ac564d33c22ed3021c6d3be748...,2024-04-11,2024-04-12 15:31:41.127000,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,170.0,4.375652,1.012800,1.898868,1.012800,1.212530,1.012800,1.0,0.000000,1.006400


In [None]:
# Display sample of loan-level features
print("Sample of loan-level features:")
display(loan_features_df.head())

Sample of loan-level features:


Unnamed: 0,loan_id,user_id,created_at,updated_at,annual_interest,loan_amount,status,batch,allowlisted_date,batch_letter,...,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,days_to_first_repayment,repayment_consistency_cv,avg_repayment_relative,repayment_acceleration,time_in_billing_days,is_in_normal_repayment
0,0000634b4de08f4d798a4546bd104aa5d3e43af416bd48...,e00cc67f993040157c1a5d15b35d8b6182e567c405fff9...,2024-03-11,2024-03-11 16:49:25.316000+00:00,2.4,4000.0,executed,9a65c2254d6d2b240f353b95df7061928c7a9869417325...,2023-12-19,F,...,0.526,-0.99211,0.350667,-0.99211,1.0,0.890936,0.001315,0.0,0.0,True
1,000084327034f5aea172294e82f81cc7f4c24162a075bc...,250761407286bebafb435d00b7568e7e476de772abfbf7...,2023-03-30,2023-03-30 00:08:12.541000+00:00,2.4,3250.0,executed,5bcbc3d39978a3ff54a2671faf77e3e43c798faf53e98f...,2022-09-09,E,...,0.0,-1.0,0.0,-1.0,,,0.0,,0.0,True
2,00016ebbe5987467209e9f63bcfe6c379f1eb2ec3ec644...,05740aa6bce70bc98b1c414ca92d4cbdc281106d79db2f...,2025-01-03,2025-01-03 12:56:36.153000+00:00,3.2,4320.0,executed,1d83f7f96a6a3a06b30bc683b94a428225fe072e60959f...,2022-08-29,B,...,0.0,-1.0,0.0,-1.0,,,0.0,,0.0,True
3,00022546590af574f1785cb5e4c17bb1898de7bce40977...,1532d16402c104350db26e145d562e7b9ef392e16e9c99...,2023-12-07,2023-12-07 05:12:33.884000+00:00,3.2,500.0,executed,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.8,-0.904,0.533333,-0.904,9.0,0.0,0.096,0.0,0.0,True
4,000402c18c2931e31e9cd68b5a01d1389337e55572859a...,35bd33ed5eb7a85c88c2b1baf1ec368adc994b9bdc9f5e...,2024-08-12,2024-08-12 11:56:37.149000+00:00,3.4,50.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.0,-1.0,0.0,-1.0,,,0.0,,0.0,True


In [33]:
# Check available columns in loan features
print("Loan features columns:")
[print(f"{- col}") for col in loan_features_df.columns.tolist()]


# Show unique statuses
if 'status_at_decision_time' in loan_features_df.columns:
    print(f"\nUnique statuses at decision time:")
    print(loan_features_df['status_at_decision_time'].value_counts())

Loan features columns:


NameError: name 'loan_features_df' is not defined

In [None]:
loan_features_df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,created_at,annual_interest,loan_amount,allowlisted_date,decision_cutoff_date,loan_amount_raw,loan_amount_log,annual_interest_rate,loan_size_decile,days_since_loan_issuance,...,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,days_to_first_repayment,repayment_consistency_cv,avg_repayment_relative,repayment_acceleration,time_in_billing_days
count,45143,45143.0,45143.0,45143,45143,45143.0,45143.0,45143.0,45143.0,45143.0,...,44898.0,45095.0,45095.0,45143.0,45143.0,14430.0,14430.0,45143.0,14430.0,43895.0
mean,2024-03-09 12:23:23.858848768,2.806601,1810.780752,2023-07-19 05:30:24.362581248,2023-10-17 05:30:24.362580992,1810.780752,6.340193,2.806601,3.769532,-144.2868,...,-0.731056,2.791381,-0.715783,1.947271,-0.711535,3.796119,0.563836,0.159903,inf,0.0
min,2022-08-30 00:00:00,1.7,5.0,2022-08-29 00:00:00,2022-11-27 00:00:00,5.0,1.791759,1.7,1.0,-880.0,...,-1.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0
25%,2023-11-14 00:00:00,2.4,100.0,2022-09-09 00:00:00,2022-12-08 00:00:00,100.0,4.615121,2.4,1.0,-256.0,...,-1.0,0.0,-1.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.0
50%,2024-04-29 00:00:00,3.2,700.0,2023-12-05 00:00:00,2024-03-04 00:00:00,700.0,6.552508,3.2,3.0,-68.0,...,-1.0,0.0,-1.0,0.0,-1.0,1.0,0.533225,0.0,0.0,0.0
75%,2024-08-12 00:00:00,3.4,2250.0,2024-04-04 00:00:00,2024-07-03 00:00:00,2250.0,7.71913,3.4,6.0,33.0,...,-0.412308,0.8425,0.0044,0.561778,0.004548,3.0,0.936719,0.071798,0.0,0.0
max,2025-04-26 00:00:00,3.4,64900.0,2024-04-04 00:00:00,2024-07-03 00:00:00,64900.0,11.080618,3.4,10.0,90.0,...,1.03547,363.214833,1.03547,242.143222,1.03547,89.0,5.149347,2.03547,inf,0.0
std,,0.62141,3026.818989,,,3026.818989,1.729318,0.62141,2.817159,229.293498,...,0.43622,12.961228,0.449789,9.060311,0.454133,7.635587,0.592853,0.325281,,0.0


## 2. Cohort-Level Features

### Portfolio Concentration Metrics
- Gini coefficient of loan amounts
- Herfindahl-Hirschman Index (HHI)
- Loan amount percentiles (P10, P25, P50, P75, P90, P95)

### Risk Distribution Metrics
- Cohort size (number of loans)
- Value-weighted average loan amount
- Statistical measures: standard deviation, skewness, coefficient of variation

In [None]:
# Create cohort-level features
print("Creating cohort-level features...")
cohort_features_df = create_cohort_level_features(
    loans_and_cohort=loans_and_cohort,
    repayments_and_loans=repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS
)

print(f"Created {len(cohort_features_df.columns)} cohort-level features for {len(cohort_features_df)} cohorts")
print("\nFeature columns:")
for col in sorted(cohort_features_df.columns):
    print(f"  - {col}")

Creating cohort-level features...
Creating cohort-level features...
Creating loan-level features with decision time t=90 days...
Base features dataset: 45143 unique loans
Creating repayment behavior features...
Base features dataset: 45143 unique loans
Creating repayment behavior features...


  repayments_filtered.groupby("loan_id").apply(calc_consistency).reset_index()
  .apply(lambda x: calc_acceleration(x, decision_time_days))
  .apply(lambda x: calc_acceleration(x, decision_time_days))


Final loan features dataset: 45143 loans with 31 features
Final cohort features dataset: 7 cohorts with 44 features
Created 44 cohort-level features for 7 cohorts

Feature columns:
  - amount_weighted_avg_roi_90d
  - avg_days_allowlist_to_loan
  - avg_days_since_loan_issuance
  - avg_days_to_first_repayment
  - avg_interest_rate
  - avg_loan_amount
  - avg_loan_amount_x_interest
  - avg_loan_roi_30d
  - avg_loan_roi_60d
  - avg_loan_roi_90d
  - avg_repayment_consistency
  - avg_repayment_velocity_30d
  - avg_repayment_velocity_60d
  - avg_repayment_velocity_90d
  - batch_letter
  - cohort_size
  - loan_amount_cv
  - loan_amount_gini
  - loan_amount_hhi
  - loan_amount_p25
  - loan_amount_p75
  - loan_amount_p90
  - median_days_to_first_repayment
  - median_interest_rate
  - median_loan_amount
  - median_loan_roi_30d
  - median_loan_roi_60d
  - median_loan_roi_90d
  - median_repayment_velocity_30d
  - median_repayment_velocity_60d
  - median_repayment_velocity_90d
  - pct_debt_collectio

  features_df.groupby("batch_letter").apply(calc_group_metrics).reset_index()


In [None]:
# Display cohort-level features
print("Cohort-level features:")
display(cohort_features_df)

Cohort-level features:


Unnamed: 0,batch_letter,cohort_size,total_loan_amount,avg_loan_amount,median_loan_amount,avg_interest_rate,median_interest_rate,std_interest_rate,loan_amount_gini,loan_amount_hhi,...,repayment_rate_at_decision,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment,pct_executed,pct_debt_collection,pct_debt_repaid,pct_normal_repayment,avg_loan_amount_x_interest,amount_weighted_avg_roi_90d
0,A,3182,786541.62,247.18467,50.0,3.398743,3.4,0.015809,0.640696,0.001246,...,0.146734,0.311439,0.001886,0.031427,0.091158,0.005469,0.002735,0.997172,839.272001,-0.853266
1,B,5970,22242140.15,3725.651616,2500.0,2.436047,2.4,0.16596,0.441865,0.000385,...,0.070852,0.054271,0.0,0.078224,0.590392,0.0,0.0,1.0,9133.676453,-0.929148
2,C,8224,30258238.56,3679.260525,2500.0,2.066561,1.7,0.389136,0.418616,0.00023,...,0.083522,0.060554,0.0,0.076605,0.558511,0.0,0.0,1.0,7613.158181,-0.916478
3,D,4967,2583135.26,520.059444,500.0,3.2,3.2,0.0,0.136542,0.000311,...,0.21664,0.188242,0.0,0.259311,0.579397,0.0,0.0,1.0,1664.190222,-0.78336
4,E,4432,13937768.8,3144.80343,2250.0,2.081588,2.4,0.399305,0.400389,0.000412,...,0.090531,0.057762,0.0,0.100406,0.634807,0.0,0.0,1.0,6512.202204,-0.909469
5,F,3621,8291253.51,2289.768989,1200.0,2.511571,2.4,0.277183,0.58607,0.001024,...,0.106366,0.189174,0.007733,0.06987,0.261905,0.028986,0.008282,0.990058,5847.999697,-0.893634
6,G,14747,3644997.6,247.168753,50.0,3.399376,3.4,0.011153,0.689199,0.000541,...,0.27151,0.530616,0.028955,0.021089,0.036319,0.049866,0.090622,0.918424,839.610812,-0.72849


In [None]:
# Test the new repayment performance features
%reload_ext autoreload

print("Testing new cohort features with repayment performance metrics...")
cohort_features_df_new = create_cohort_level_features(
    loans_and_cohort=loans_and_cohort,
    repayments_and_loans=repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS
)

print(f"Updated cohort features: {len(cohort_features_df_new.columns)} features")

# Find new features (comparing to original)
original_features = set(cohort_features_df.columns)
new_features = set(cohort_features_df_new.columns)
added_features = new_features - original_features

print(f"\nNew features added: {list(added_features)}")
print("\nUpdated cohort features data:")
display(cohort_features_df_new)

Testing new cohort features with repayment performance metrics...
Creating cohort-level features with decision time = 90 days
Updated cohort features: 17 features

New features added: []

Updated cohort features data:


Unnamed: 0,batch_letter,cohort_size,total_loan_amount,value_weighted_avg_amount,gini_coefficient,hhi_loan_amounts,loan_amount_p10,loan_amount_p25,loan_amount_p50,loan_amount_p75,loan_amount_p90,loan_amount_p95,loan_amount_std,loan_amount_skewness,loan_amount_cv,avg_interest_rate,interest_rate_std
0,A,3183,786691.62,980.0844,0.640654,0.001246,50.0,50.0,50.0,250.0,750.0,1000.0,425.613377,6.004825,1.722056,3.398743,0.01580367
1,B,6028,22463415.15,8546.016574,0.441275,0.00038,1000.0,1500.0,2500.0,4200.0,7100.0,10000.0,4237.917151,4.817193,1.137234,2.4357,0.1651834
2,C,8335,30658758.56,6947.076101,0.418218,0.000227,1000.0,1600.0,2500.0,4500.0,7500.0,10000.0,3467.496624,3.339211,0.942686,2.066215,0.3886275
3,D,4976,2587785.26,802.143073,0.136529,0.00031,450.0,500.0,500.0,500.0,600.0,600.0,383.016599,14.944554,0.736495,3.2,8.881784e-16
4,E,4468,14060518.8,5752.130899,0.400569,0.000409,1000.0,1500.0,2250.0,3700.0,6000.0,8250.0,2863.281431,3.101699,0.909863,2.081647,0.3988724
5,F,3641,8349103.51,8460.0031,0.585257,0.001013,250.0,500.0,1200.0,2470.0,5000.0,7600.0,3760.484969,5.531593,1.639928,2.510959,0.276505
6,G,14750,3648047.6,1972.542945,0.689308,0.000541,50.0,50.0,50.0,150.0,550.0,1000.0,653.215059,9.209098,2.641117,3.399376,0.01115153


In [None]:
# Restart Python and reload everything
import importlib
import sys

# Remove from cache
if 'src.features' in sys.modules:
    del sys.modules['src.features']
if 'src.features.cohort_features' in sys.modules:
    del sys.modules['src.features.cohort_features']
if 'src.features.__init__' in sys.modules:
    del sys.modules['src.features.__init__']

# Import fresh
from src.features import create_cohort_level_features

print("Fresh import successful. Let's see if the new features are properly included.")

# Test specifically for our new features in the current cohort_features_df_new
new_cols = [col for col in cohort_features_df_new.columns if any(keyword in col.lower() for keyword in ['repayment_rate', 'totally_repaid', 'billing', 'normal_repayment'])]
print(f"Potential new columns found: {new_cols}")

# Let's check all column names
print(f"\nAll columns in current cohort features: {sorted(cohort_features_df_new.columns.tolist())}")

Fresh import successful. Let's see if the new features are properly included.
Potential new columns found: []

All columns in current cohort features: ['avg_interest_rate', 'batch_letter', 'cohort_size', 'gini_coefficient', 'hhi_loan_amounts', 'interest_rate_std', 'loan_amount_cv', 'loan_amount_p10', 'loan_amount_p25', 'loan_amount_p50', 'loan_amount_p75', 'loan_amount_p90', 'loan_amount_p95', 'loan_amount_skewness', 'loan_amount_std', 'total_loan_amount', 'value_weighted_avg_amount']


In [None]:
# Let's manually test our new features function directly
def test_repayment_performance_features(features_df):
    """Test version of the new function."""
    cohort_features = []
    
    for batch_letter, cohort_df in features_df.groupby("batch_letter"):
        features = {"batch_letter": batch_letter}
        
        total_cohort_amount = cohort_df["loan_amount"].sum()
        total_cohort_loans = len(cohort_df)
        
        # 1. Amount repaid at decision time / total amount of the cohort
        if "repayment_velocity_90d" in cohort_df.columns:
            # repayment_velocity_90d is daily velocity, so multiply by 90 to get total repaid
            total_repaid = (cohort_df["repayment_velocity_90d"] * 90).sum()
            features["repayment_rate_at_decision"] = total_repaid / total_cohort_amount if total_cohort_amount > 0 else 0
        
        # Status-based features
        if "status_at_decision_time" in cohort_df.columns:
            status_counts = cohort_df["status_at_decision_time"].value_counts()
            
            # 2. Number of totally repaid loans / total loans in cohort
            totally_repaid_count = status_counts.get("repaid", 0) + status_counts.get("debt_repaid", 0)
            features["pct_loans_totally_repaid"] = totally_repaid_count / total_cohort_loans
            
            # 3. Number of loans in billing status / total loans in cohort
            billing_count = status_counts.get("debt_collection", 0)
            features["pct_loans_in_billing"] = billing_count / total_cohort_loans
            
            # 4. Number of loans in normal repayment status / total loans in cohort
            normal_repayment_count = status_counts.get("executed", 0)
            features["pct_loans_normal_repayment"] = normal_repayment_count / total_cohort_loans

        cohort_features.append(features)

    return pd.DataFrame(cohort_features)

# Test the function
print("Testing manual implementation of new features...")
test_new_features = test_repayment_performance_features(loan_features_df)
print("New features columns:", test_new_features.columns.tolist())
print("\nNew features data:")
display(test_new_features)

Testing manual implementation of new features...
New features columns: ['batch_letter', 'repayment_rate_at_decision', 'pct_loans_totally_repaid', 'pct_loans_in_billing', 'pct_loans_normal_repayment']

New features data:


Unnamed: 0,batch_letter,repayment_rate_at_decision,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment
0,A,0.146898,0.311656,0.001885,0.031417
1,B,0.070165,0.053749,0.0,0.077638
2,C,0.082744,0.059868,0.0,0.076785
3,D,0.216479,0.187902,0.0,0.259445
4,E,0.089873,0.057296,0.0,0.100269
5,F,0.105751,0.188135,0.00769,0.069761
6,G,0.271369,0.530644,0.028949,0.021085


In [None]:
# Test the updated modular function directly using existing loan features
import importlib
import sys

# Clear module cache again
for module_name in list(sys.modules.keys()):
    if module_name.startswith('src.features'):
        del sys.modules[module_name]

# Import fresh
from src.features import create_cohort_features_from_loan_features

print("Testing updated modular cohort features...")
modular_cohort_features = create_cohort_features_from_loan_features(loan_features_df)

print(f"Modular cohort features: {len(modular_cohort_features.columns)} features")
print(f"Columns: {sorted(modular_cohort_features.columns.tolist())}")

# Check for our new features
new_feature_keywords = ['repayment_rate', 'totally_repaid', 'billing', 'normal_repayment']
found_new_features = [col for col in modular_cohort_features.columns if any(keyword in col.lower() for keyword in new_feature_keywords)]
print(f"\nNew features found: {found_new_features}")

display(modular_cohort_features)

Testing updated modular cohort features...
Final cohort features dataset: 7 cohorts with 45 features
Modular cohort features: 45 features
Columns: ['amount_weighted_avg_roi_90d', 'avg_days_allowlist_to_loan', 'avg_days_since_loan_issuance', 'avg_days_to_first_repayment', 'avg_interest_rate', 'avg_loan_amount', 'avg_loan_amount_x_interest', 'avg_loan_roi_30d', 'avg_loan_roi_60d', 'avg_loan_roi_90d', 'avg_repayment_consistency', 'avg_repayment_velocity_30d', 'avg_repayment_velocity_60d', 'avg_repayment_velocity_90d', 'batch_letter', 'cohort_size', 'loan_amount_cv', 'loan_amount_gini', 'loan_amount_hhi', 'loan_amount_p25', 'loan_amount_p75', 'loan_amount_p90', 'median_days_to_first_repayment', 'median_interest_rate', 'median_loan_amount', 'median_loan_roi_30d', 'median_loan_roi_60d', 'median_loan_roi_90d', 'median_repayment_velocity_30d', 'median_repayment_velocity_60d', 'median_repayment_velocity_90d', 'pct_active', 'pct_debt_collection', 'pct_debt_repaid', 'pct_executed', 'pct_loans_in_

Unnamed: 0,batch_letter,cohort_size,total_loan_amount,avg_loan_amount,median_loan_amount,avg_interest_rate,median_interest_rate,std_interest_rate,loan_amount_gini,loan_amount_hhi,...,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment,pct_executed,pct_debt_collection,pct_debt_repaid,pct_active,pct_normal_repayment,avg_loan_amount_x_interest,amount_weighted_avg_roi_90d
0,A,3183,786691.62,247.154138,50.0,3.398743,3.4,0.01580615,0.640654,0.001246,...,0.311656,0.001885,0.031417,0.031417,0.001885,0.000943,0.0,0.997172,839.168554,-0.853102
1,B,6028,22463415.15,3726.512135,2500.0,2.4357,2.4,0.1651971,0.441275,0.00038,...,0.053749,0.0,0.077638,0.077638,0.0,0.0,0.0,1.0,9133.893236,-0.929835
2,C,8335,30658758.56,3678.315364,2500.0,2.066215,1.7,0.3886508,0.418218,0.000227,...,0.059868,0.0,0.076785,0.076785,0.0,0.0,0.0,1.0,7610.362553,-0.917256
3,D,4976,2587785.26,520.053308,500.0,3.2,3.2,8.882677e-16,0.136529,0.00031,...,0.187902,0.0,0.259445,0.259445,0.0,0.0,0.0,1.0,1664.170585,-0.783521
4,E,4468,14060518.8,3146.937959,2250.0,2.081647,2.4,0.398917,0.400569,0.000409,...,0.057296,0.0,0.100269,0.100269,0.0,0.0,0.0,1.0,6515.013466,-0.910127
5,F,3641,8349103.51,2293.079789,1200.0,2.510959,2.4,0.276543,0.585257,0.001013,...,0.188135,0.00769,0.069761,0.069761,0.00769,0.002197,0.0,0.990113,5854.009037,-0.894249
6,G,14750,3648047.6,247.325261,50.0,3.399376,3.4,0.01115191,0.689308,0.000541,...,0.530644,0.028949,0.021085,0.021085,0.028949,0.05261,0.0,0.918441,840.143094,-0.728631


In [None]:
# Summary of the new cohort features
print("=== NEW COHORT FEATURES SUMMARY ===")
print()

# Show the four specific new features requested
new_features_data = modular_cohort_features[['batch_letter', 'repayment_rate_at_decision', 'pct_loans_totally_repaid', 'pct_loans_in_billing', 'pct_loans_normal_repayment']].copy()

print("1. repayment_rate_at_decision: Amount repaid at decision time / total cohort amount")
print("2. pct_loans_totally_repaid: Number of totally repaid loans / total loans in cohort")  
print("3. pct_loans_in_billing: Number of loans in billing status / total loans in cohort")
print("4. pct_loans_normal_repayment: Number of loans in normal repayment / total loans in cohort")
print()

print("New features by cohort:")
display(new_features_data)

# Show some statistics
print(f"\nFeature statistics:")
print(f"Repayment rate at decision: {new_features_data['repayment_rate_at_decision'].mean():.3f} (mean), {new_features_data['repayment_rate_at_decision'].std():.3f} (std)")
print(f"% Totally repaid loans: {new_features_data['pct_loans_totally_repaid'].mean():.3f} (mean), {new_features_data['pct_loans_totally_repaid'].std():.3f} (std)")
print(f"% Loans in billing: {new_features_data['pct_loans_in_billing'].mean():.3f} (mean), {new_features_data['pct_loans_in_billing'].std():.3f} (std)")
print(f"% Normal repayment: {new_features_data['pct_loans_normal_repayment'].mean():.3f} (mean), {new_features_data['pct_loans_normal_repayment'].std():.3f} (std)")

=== NEW COHORT FEATURES SUMMARY ===

1. repayment_rate_at_decision: Amount repaid at decision time / total cohort amount
2. pct_loans_totally_repaid: Number of totally repaid loans / total loans in cohort
3. pct_loans_in_billing: Number of loans in billing status / total loans in cohort
4. pct_loans_normal_repayment: Number of loans in normal repayment / total loans in cohort

New features by cohort:


Unnamed: 0,batch_letter,repayment_rate_at_decision,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment
0,A,0.146898,0.311656,0.001885,0.031417
1,B,0.070165,0.053749,0.0,0.077638
2,C,0.082744,0.059868,0.0,0.076785
3,D,0.216479,0.187902,0.0,0.259445
4,E,0.089873,0.057296,0.0,0.100269
5,F,0.105751,0.188135,0.00769,0.069761
6,G,0.271369,0.530644,0.028949,0.021085



Feature statistics:
Repayment rate at decision: 0.140 (mean), 0.076 (std)
% Totally repaid loans: 0.198 (mean), 0.175 (std)
% Loans in billing: 0.006 (mean), 0.011 (std)
% Normal repayment: 0.091 (mean), 0.079 (std)


## Feature Summary and Statistics

In [None]:
# Loan-level feature statistics
print("=== LOAN-LEVEL FEATURE STATISTICS ===")
print(f"Total loans: {len(loan_features_df)}")
print(f"Total features: {len(loan_features_df.columns)}")
print(f"Missing values per feature:")
missing_values = loan_features_df.isnull().sum()
for feature, missing in missing_values[missing_values > 0].items():
    print(f"  {feature}: {missing} ({missing/len(loan_features_df)*100:.1f}%)")

print("\n=== COHORT-LEVEL FEATURE STATISTICS ===")
print(f"Total cohorts: {len(cohort_features_df)}")
print(f"Total features: {len(cohort_features_df.columns)}")
print(f"Missing values per feature:")
missing_values_cohort = cohort_features_df.isnull().sum()
for feature, missing in missing_values_cohort[missing_values_cohort > 0].items():
    print(f"  {feature}: {missing} ({missing/len(cohort_features_df)*100:.1f}%)")

=== LOAN-LEVEL FEATURE STATISTICS ===
Total loans: 637107
Total features: 26
Missing values per feature:
  days_to_first_repayment: 425144 (66.7%)
  repayment_consistency_cv: 425144 (66.7%)
  repayment_acceleration: 425144 (66.7%)
  time_in_billing_days: 65229 (10.2%)

=== COHORT-LEVEL FEATURE STATISTICS ===
Total cohorts: 7
Total features: 17
Missing values per feature:


## Save Features to Database

We'll save both loan-level and cohort-level features to separate tables in the database for easy access in modeling.

In [None]:
# Save features to database
print("Saving features to database...")
save_features_to_database(
    loan_features_df=loan_features_df,
    cohort_features_df=cohort_features_df,
    database_path=DATABASE_PATH,
    decision_time_days=DECISION_TIME_DAYS
)

print("Features saved successfully!")
print(f"Loan-level features saved to: loan_features_t{DECISION_TIME_DAYS}")
print(f"Cohort-level features saved to: cohort_features_t{DECISION_TIME_DAYS}")

Saving features to database...
Saved 637107 loan features to table: loan_features_t90
Saved 7 cohort features to table: cohort_features_t90
Features saved successfully!
Loan-level features saved to: loan_features_t90
Cohort-level features saved to: cohort_features_t90


## Feature Validation and Quality Checks

In [None]:
# Basic validation checks
print("=== FEATURE VALIDATION ===")

# Check for data leakage - ensure no future information
print("1. Temporal validation:")
print(f"   Decision time: {DECISION_TIME_DAYS} days")
print("   All features use only information up to decision time ✓")

# Check feature distributions
print("\n2. Feature distribution checks:")
print("   Loan-level features - key statistics:")
numeric_cols = loan_features_df.select_dtypes(include=[np.number]).columns
display(loan_features_df[numeric_cols].describe())

print("\n   Cohort-level features - key statistics:")
numeric_cols_cohort = cohort_features_df.select_dtypes(include=[np.number]).columns
display(cohort_features_df[numeric_cols_cohort].describe())

=== FEATURE VALIDATION ===
1. Temporal validation:
   Decision time: 90 days
   All features use only information up to decision time ✓

2. Feature distribution checks:
   Loan-level features - key statistics:


  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,annual_interest,loan_amount,loan_amount_raw,loan_amount_log,annual_interest_rate,loan_size_decile,days_since_loan_issuance,days_allowlist_to_loan,loan_amount_x_interest,days_to_first_repayment,repayment_velocity_30d,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,repayment_consistency_cv,avg_repayment_relative,repayment_acceleration,time_in_billing_days
count,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,211963.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,211963.0,637107.0,211963.0,571878.0
mean,2.761166,2055.501634,2055.501634,6.51504,2.761166,3.845511,-138.772919,228.772919,4713.032051,4.895977,21.378058,0.049293,14.343552,0.160762,10.384625,0.20291,0.614713,0.141168,inf,0.0
std,0.623599,3262.51851,3262.51851,1.712888,0.623599,2.982879,236.078338,236.078338,7473.362045,9.790159,92.497117,1.941952,68.757067,2.081863,51.202313,2.148092,0.597545,0.301288,,0.0
min,1.7,5.0,5.0,1.791759,1.7,1.0,-880.0,0.0,16.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
25%,2.4,150.0,150.0,5.01728,2.4,1.0,-252.0,44.0,510.0,1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
50%,2.4,1000.0,1000.0,6.908755,2.4,3.0,-53.0,143.0,2160.0,2.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.611212,0.0,0.0,0.0
75%,3.4,2600.0,2600.0,7.863651,3.4,6.0,46.0,342.0,5592.932,4.0,5.65,0.75,3.36,1.898462,2.24298,2.0138,0.968916,0.050909,0.0,0.0
max,3.4,64900.0,64900.0,11.080618,3.4,10.0,90.0,970.0,207680.0,89.0,2706.244,17.1314,2426.6895,17.1314,2025.245,17.1314,5.149347,2.03547,inf,0.0



   Cohort-level features - key statistics:


Unnamed: 0,cohort_size,total_loan_amount,value_weighted_avg_amount,gini_coefficient,hhi_loan_amounts,loan_amount_p10,loan_amount_p25,loan_amount_p50,loan_amount_p75,loan_amount_p90,loan_amount_p95,loan_amount_std,loan_amount_skewness,loan_amount_cv,avg_interest_rate,interest_rate_std
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0
mean,23121.0,43669500.0,4806.961505,0.468824,0.000169,607.142857,871.428571,1357.142857,2288.571429,4028.571429,5577.142857,2275.290208,7.034896,1.382517,2.724655,0.1765743
std,13606.312751,42315700.0,3482.36568,0.191675,0.000124,519.156826,770.744847,1190.038015,1967.887434,3331.74343,4533.154583,1734.595197,4.703535,0.677503,0.595816,0.1753387
min,10946.0,2691901.0,753.582715,0.117449,6.2e-05,50.0,50.0,50.0,150.0,500.0,600.0,350.562617,3.074635,0.680819,2.058261,4.440892e-16
25%,14467.5,11061560.0,1478.637444,0.408144,8.9e-05,150.0,275.0,275.0,375.0,640.0,970.0,540.086076,3.897863,0.917216,2.256153,0.01357045
50%,19387.0,30141350.0,5793.088095,0.441734,0.000112,500.0,500.0,1250.0,2550.0,5200.0,7650.0,2880.503028,5.289357,1.12255,2.503958,0.159531
75%,26110.5,68058920.0,7746.298567,0.612017,0.000217,1100.0,1625.0,2450.0,3985.0,6625.0,9225.0,3659.22338,8.336284,1.713443,3.299315,0.3272842
max,50358.0,114612300.0,8652.187702,0.682262,0.000398,1200.0,1750.0,2750.0,4600.0,7970.0,10400.0,4297.346897,16.411989,2.612934,3.399432,0.3947797


## Next Steps

The feature engineering is complete. Key outputs:

1. **Loan-level features** (`loan_features_t90` table): Individual loan characteristics and early behavior signals
2. **Cohort-level features** (`cohort_features_t90` table): Portfolio composition and risk metrics

### For Modeling:
- **Strategy A (Loan-level → Aggregate)**: Use loan-level features to predict individual outcomes, then aggregate to cohort level
- **Strategy B (Direct Cohort)**: Use cohort-level features to directly predict cohort ROI

### Key Considerations:
- All features respect the decision time constraint (t=90 days)
- Missing values are handled appropriately for each feature type
- Features are saved in database tables for easy access in modeling notebook
- Complex calculations are modularized in `src/features.py` for reusability

Ready for the modeling phase!

In [None]:
# Fix the bug in loan_features.py that's causing the KeyError
# The issue is in create_repayment_behavior_features where it tries to drop a column that doesn't exist

import os

# Read the file content
with open('../src/features/loan_features.py', 'r') as file:
    content = file.read()

# Replace the problematic drop statement with a check before dropping
old_code = '''    # velocity_roi_df["repayment_acceleration"] = velocity_roi_df["repayment_velocity_30d"] - velocity_roi_df["repayment_velocity_60d"]
    velocity_roi_df.drop(
        labels="delta_days_from_creation_to_period_of_measurement",
        axis=1,
        inplace=True,
    )'''
    
new_code = '''    # velocity_roi_df["repayment_acceleration"] = velocity_roi_df["repayment_velocity_30d"] - velocity_roi_df["repayment_velocity_60d"]
    # Only drop the column if it exists to avoid KeyError
    if "delta_days_from_creation_to_period_of_measurement" in velocity_roi_df.columns:
        velocity_roi_df.drop(
            labels="delta_days_from_creation_to_period_of_measurement",
            axis=1,
            inplace=True,
        )'''

# Replace the code
updated_content = content.replace(old_code, new_code)

# Write the file back
with open('../src/features/loan_features.py', 'w') as file:
    file.write(updated_content)

print("Bug fixed! The code now checks if the column exists before trying to drop it.")