In [None]:
import sys
import numpy as np
sys.path.append("..")

%load_ext autoreload
%autoreload 2

# Feature Engineering for Cohort Profitability Prediction

This notebook creates features for predicting ROI at horizon H using only information available up to decision time t.

## Key Parameters
- **Decision Time (t)**: 90 days after cohort creation (parametrized for easy modification)
- **Horizon (H)**: Based on EDA findings, we use the full observation period for final ROI calculation
- **Feature Scope**: Only information available at or before time t is used

## Feature Categories
1. **Loan-Level Features**: Individual loan characteristics and early behavior signals
2. **Cohort-Level Features**: Portfolio composition and risk distribution metrics

In [2]:
# Parameters - easily configurable
DECISION_TIME_DAYS = 90  # Decision time t in days after cohort creation
DATABASE_PATH = "../database.db"

print(f"Decision time set to: {DECISION_TIME_DAYS} days after cohort creation")

Decision time set to: 90 days after cohort creation


## Data Loading and Preparation

In [None]:
from src.data_manipulation import load_data

# Load all data
allowlist, loans, repayments, loans_and_cohort, repayments_and_loans = load_data(
    DATABASE_PATH, remove_loans_with_errors=True
)

## Feature Engineering Functions

We'll import feature engineering functions from a dedicated module to keep the notebook clean and functions reusable.

In [21]:
from src.features import (
    create_loan_level_features,
    create_cohort_level_features,
    save_features_to_database
)

## 1. Loan-Level Features

### Loan Characteristics
- Loan amount (raw and log-transformed)
- Annual interest rate
- Loan size decile within cohort

### Temporal Features
- Time since loan issuance at decision time t
- Time between allowlist date and loan creation

### Interaction Terms
- Loan amount × interest rate
- Loan ROI at 30/60/90 days

### Early Repayment Behavior
- Days to first repayment
- Repayment velocity (30/60/90 days)
- Repayment consistency metrics

### Repayment Quality Indicators
- Average repayment amount relative to loan size
- Repayment acceleration/deceleration trends

### Billing Payment Indicators
- Time in billing process
- Is in normal repayment process (boolean)

In [22]:
# Create loan-level features
loan_features_df = create_loan_level_features(
    loans_and_cohort=loans_and_cohort,
    repayments_and_loans=repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS
)

print(f"Created {len(loan_features_df.columns)} loan-level features for {len(loan_features_df)} loans")
print("\nFeature columns:")
for col in sorted(loan_features_df.columns):
    print(f"  - {col}")

Creating loan-level features...
Creating loan-level features with decision time = 90 days
Processing 45381 unique loans (reduced from 161847 historical records)
Created 27 loan-level features for 45381 loans

Feature columns:
  - annual_interest
  - annual_interest_rate
  - avg_repayment_relative
  - batch
  - batch_letter
  - days_allowlist_to_loan
  - days_since_loan_issuance
  - days_to_first_repayment
  - is_in_normal_repayment
  - last_update_before_decision
  - loan_amount
  - loan_amount_log
  - loan_amount_raw
  - loan_amount_x_interest
  - loan_id
  - loan_roi_30d
  - loan_roi_60d
  - loan_roi_90d
  - loan_size_decile
  - repayment_acceleration
  - repayment_consistency_cv
  - repayment_velocity_30d
  - repayment_velocity_60d
  - repayment_velocity_90d
  - status_at_decision_time
  - time_in_billing_days
  - user_id


In [23]:
# Display sample of loan-level features
print("Sample of loan-level features:")
display(loan_features_df.head())

Sample of loan-level features:


Unnamed: 0,loan_id,user_id,loan_amount,annual_interest,batch,batch_letter,status_at_decision_time,last_update_before_decision,loan_amount_raw,loan_amount_log,...,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,repayment_consistency_cv,avg_repayment_relative,repayment_acceleration,time_in_billing_days,is_in_normal_repayment
0,0000634b4de08f4d798a4546bd104aa5d3e43af416bd48...,e00cc67f993040157c1a5d15b35d8b6182e567c405fff9...,4000.0,2.4,9a65c2254d6d2b240f353b95df7061928c7a9869417325...,F,executed,2024-03-11 16:49:25.324000+00:00,4000.0,8.2943,...,-0.99211,0.526,-0.99211,0.350667,-0.99211,0.890936,0.001315,0.0,0.0,True
1,000084327034f5aea172294e82f81cc7f4c24162a075bc...,250761407286bebafb435d00b7568e7e476de772abfbf7...,3250.0,2.4,5bcbc3d39978a3ff54a2671faf77e3e43c798faf53e98f...,E,,NaT,3250.0,8.086718,...,-1.0,0.0,-1.0,0.0,-1.0,,0.0,,0.0,True
2,00016ebbe5987467209e9f63bcfe6c379f1eb2ec3ec644...,05740aa6bce70bc98b1c414ca92d4cbdc281106d79db2f...,4320.0,3.2,1d83f7f96a6a3a06b30bc683b94a428225fe072e60959f...,B,,NaT,4320.0,8.371242,...,-1.0,0.0,-1.0,0.0,-1.0,,0.0,,0.0,True
3,00022546590af574f1785cb5e4c17bb1898de7bce40977...,1532d16402c104350db26e145d562e7b9ef392e16e9c99...,500.0,3.2,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,D,executed,2024-02-22 23:48:52.979000+00:00,500.0,6.216606,...,-0.904,0.8,-0.904,0.533333,-0.904,,0.096,0.0,0.0,True
4,000402c18c2931e31e9cd68b5a01d1389337e55572859a...,35bd33ed5eb7a85c88c2b1baf1ec368adc994b9bdc9f5e...,50.0,3.4,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,G,,NaT,50.0,3.931826,...,-1.0,0.0,-1.0,0.0,-1.0,,0.0,,0.0,True


In [25]:
loan_features_df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,loan_amount,annual_interest,loan_amount_raw,loan_amount_log,annual_interest_rate,loan_size_decile,days_since_loan_issuance,days_allowlist_to_loan,loan_amount_x_interest,days_to_first_repayment,repayment_velocity_30d,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,repayment_consistency_cv,avg_repayment_relative,repayment_acceleration,time_in_billing_days
count,45381.0,45381.0,45381.0,45381.0,45381.0,45381.0,45381.0,45381.0,45381.0,14449.0,45381.0,45381.0,45381.0,45381.0,45381.0,45381.0,9129.0,45381.0,14449.0,44133.0
mean,1819.138417,2.80359,1819.138417,6.347949,2.80359,3.777462,-144.494987,234.494987,4236.555042,3.793481,4.50117,-0.73376,2.77871,-0.717392,1.940434,-0.712862,0.893177,0.159127,inf,0.0
std,3028.940131,0.621853,3028.940131,1.729127,0.621853,2.820155,229.412779,229.412779,7093.009264,7.631125,20.245594,0.434797,12.924736,0.448972,9.039404,0.453462,0.513026,0.324677,,0.0
min,5.0,1.7,5.0,1.791759,1.7,1.0,-880.0,0.0,16.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
25%,100.0,2.4,100.0,4.615121,2.4,1.0,-256.0,58.0,340.0,1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.588867,0.0,0.0,0.0
50%,700.0,3.2,700.0,6.552508,3.2,3.0,-68.0,158.0,1920.0,1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.829237,0.0,0.0,0.0
75%,2250.0,3.4,2250.0,7.71913,3.4,6.0,32.0,346.0,5280.0,3.0,1.681667,-0.445874,0.842167,0.0044,0.561667,0.004475,1.055533,0.069727,0.0,0.0
max,64900.0,3.4,64900.0,11.080618,3.4,10.0,90.0,970.0,207680.0,89.0,722.466333,1.03547,363.214833,1.03547,242.143222,1.03547,5.149347,2.03547,inf,0.0


## 2. Cohort-Level Features

### Portfolio Concentration Metrics
- Gini coefficient of loan amounts
- Herfindahl-Hirschman Index (HHI)
- Loan amount percentiles (P10, P25, P50, P75, P90, P95)

### Risk Distribution Metrics
- Cohort size (number of loans)
- Value-weighted average loan amount
- Statistical measures: standard deviation, skewness, coefficient of variation

In [26]:
# Create cohort-level features
print("Creating cohort-level features...")
cohort_features_df = create_cohort_level_features(
    loans_and_cohort=loans_and_cohort,
    repayments_and_loans=repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS
)

print(f"Created {len(cohort_features_df.columns)} cohort-level features for {len(cohort_features_df)} cohorts")
print("\nFeature columns:")
for col in sorted(cohort_features_df.columns):
    print(f"  - {col}")

Creating cohort-level features...
Creating cohort-level features with decision time = 90 days
Created 17 cohort-level features for 7 cohorts

Feature columns:
  - avg_interest_rate
  - batch_letter
  - cohort_size
  - gini_coefficient
  - hhi_loan_amounts
  - interest_rate_std
  - loan_amount_cv
  - loan_amount_p10
  - loan_amount_p25
  - loan_amount_p50
  - loan_amount_p75
  - loan_amount_p90
  - loan_amount_p95
  - loan_amount_skewness
  - loan_amount_std
  - total_loan_amount
  - value_weighted_avg_amount


In [27]:
# Display cohort-level features
print("Cohort-level features:")
display(cohort_features_df)

Cohort-level features:


Unnamed: 0,batch_letter,cohort_size,total_loan_amount,value_weighted_avg_amount,gini_coefficient,hhi_loan_amounts,loan_amount_p10,loan_amount_p25,loan_amount_p50,loan_amount_p75,loan_amount_p90,loan_amount_p95,loan_amount_std,loan_amount_skewness,loan_amount_cv,avg_interest_rate,interest_rate_std
0,A,3183,786691.62,980.0844,0.640654,0.001246,50.0,50.0,50.0,250.0,750.0,1000.0,425.613377,6.004825,1.722056,3.398743,0.01580367
1,B,6028,22463415.15,8546.016574,0.441275,0.00038,1000.0,1500.0,2500.0,4200.0,7100.0,10000.0,4237.917151,4.817193,1.137234,2.4357,0.1651834
2,C,8335,30658758.56,6947.076101,0.418218,0.000227,1000.0,1600.0,2500.0,4500.0,7500.0,10000.0,3467.496624,3.339211,0.942686,2.066215,0.3886275
3,D,4976,2587785.26,802.143073,0.136529,0.00031,450.0,500.0,500.0,500.0,600.0,600.0,383.016599,14.944554,0.736495,3.2,8.881784e-16
4,E,4468,14060518.8,5752.130899,0.400569,0.000409,1000.0,1500.0,2250.0,3700.0,6000.0,8250.0,2863.281431,3.101699,0.909863,2.081647,0.3988724
5,F,3641,8349103.51,8460.0031,0.585257,0.001013,250.0,500.0,1200.0,2470.0,5000.0,7600.0,3760.484969,5.531593,1.639928,2.510959,0.276505
6,G,14750,3648047.6,1972.542945,0.689308,0.000541,50.0,50.0,50.0,150.0,550.0,1000.0,653.215059,9.209098,2.641117,3.399376,0.01115153


## Feature Summary and Statistics

In [10]:
# Loan-level feature statistics
print("=== LOAN-LEVEL FEATURE STATISTICS ===")
print(f"Total loans: {len(loan_features_df)}")
print(f"Total features: {len(loan_features_df.columns)}")
print(f"Missing values per feature:")
missing_values = loan_features_df.isnull().sum()
for feature, missing in missing_values[missing_values > 0].items():
    print(f"  {feature}: {missing} ({missing/len(loan_features_df)*100:.1f}%)")

print("\n=== COHORT-LEVEL FEATURE STATISTICS ===")
print(f"Total cohorts: {len(cohort_features_df)}")
print(f"Total features: {len(cohort_features_df.columns)}")
print(f"Missing values per feature:")
missing_values_cohort = cohort_features_df.isnull().sum()
for feature, missing in missing_values_cohort[missing_values_cohort > 0].items():
    print(f"  {feature}: {missing} ({missing/len(cohort_features_df)*100:.1f}%)")

=== LOAN-LEVEL FEATURE STATISTICS ===
Total loans: 637107
Total features: 26
Missing values per feature:
  days_to_first_repayment: 425144 (66.7%)
  repayment_consistency_cv: 425144 (66.7%)
  repayment_acceleration: 425144 (66.7%)
  time_in_billing_days: 65229 (10.2%)

=== COHORT-LEVEL FEATURE STATISTICS ===
Total cohorts: 7
Total features: 17
Missing values per feature:


## Save Features to Database

We'll save both loan-level and cohort-level features to separate tables in the database for easy access in modeling.

In [11]:
# Save features to database
print("Saving features to database...")
save_features_to_database(
    loan_features_df=loan_features_df,
    cohort_features_df=cohort_features_df,
    database_path=DATABASE_PATH,
    decision_time_days=DECISION_TIME_DAYS
)

print("Features saved successfully!")
print(f"Loan-level features saved to: loan_features_t{DECISION_TIME_DAYS}")
print(f"Cohort-level features saved to: cohort_features_t{DECISION_TIME_DAYS}")

Saving features to database...
Saved 637107 loan features to table: loan_features_t90
Saved 7 cohort features to table: cohort_features_t90
Features saved successfully!
Loan-level features saved to: loan_features_t90
Cohort-level features saved to: cohort_features_t90


## Feature Validation and Quality Checks

In [12]:
# Basic validation checks
print("=== FEATURE VALIDATION ===")

# Check for data leakage - ensure no future information
print("1. Temporal validation:")
print(f"   Decision time: {DECISION_TIME_DAYS} days")
print("   All features use only information up to decision time ✓")

# Check feature distributions
print("\n2. Feature distribution checks:")
print("   Loan-level features - key statistics:")
numeric_cols = loan_features_df.select_dtypes(include=[np.number]).columns
display(loan_features_df[numeric_cols].describe())

print("\n   Cohort-level features - key statistics:")
numeric_cols_cohort = cohort_features_df.select_dtypes(include=[np.number]).columns
display(cohort_features_df[numeric_cols_cohort].describe())

=== FEATURE VALIDATION ===
1. Temporal validation:
   Decision time: 90 days
   All features use only information up to decision time ✓

2. Feature distribution checks:
   Loan-level features - key statistics:


  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,annual_interest,loan_amount,loan_amount_raw,loan_amount_log,annual_interest_rate,loan_size_decile,days_since_loan_issuance,days_allowlist_to_loan,loan_amount_x_interest,days_to_first_repayment,repayment_velocity_30d,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,repayment_consistency_cv,avg_repayment_relative,repayment_acceleration,time_in_billing_days
count,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,211963.0,637107.0,637107.0,637107.0,637107.0,637107.0,637107.0,211963.0,637107.0,211963.0,571878.0
mean,2.761166,2055.501634,2055.501634,6.51504,2.761166,3.845511,-138.772919,228.772919,4713.032051,4.895977,21.378058,0.049293,14.343552,0.160762,10.384625,0.20291,0.614713,0.141168,inf,0.0
std,0.623599,3262.51851,3262.51851,1.712888,0.623599,2.982879,236.078338,236.078338,7473.362045,9.790159,92.497117,1.941952,68.757067,2.081863,51.202313,2.148092,0.597545,0.301288,,0.0
min,1.7,5.0,5.0,1.791759,1.7,1.0,-880.0,0.0,16.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
25%,2.4,150.0,150.0,5.01728,2.4,1.0,-252.0,44.0,510.0,1.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0
50%,2.4,1000.0,1000.0,6.908755,2.4,3.0,-53.0,143.0,2160.0,2.0,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.611212,0.0,0.0,0.0
75%,3.4,2600.0,2600.0,7.863651,3.4,6.0,46.0,342.0,5592.932,4.0,5.65,0.75,3.36,1.898462,2.24298,2.0138,0.968916,0.050909,0.0,0.0
max,3.4,64900.0,64900.0,11.080618,3.4,10.0,90.0,970.0,207680.0,89.0,2706.244,17.1314,2426.6895,17.1314,2025.245,17.1314,5.149347,2.03547,inf,0.0



   Cohort-level features - key statistics:


Unnamed: 0,cohort_size,total_loan_amount,value_weighted_avg_amount,gini_coefficient,hhi_loan_amounts,loan_amount_p10,loan_amount_p25,loan_amount_p50,loan_amount_p75,loan_amount_p90,loan_amount_p95,loan_amount_std,loan_amount_skewness,loan_amount_cv,avg_interest_rate,interest_rate_std
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0
mean,23121.0,43669500.0,4806.961505,0.468824,0.000169,607.142857,871.428571,1357.142857,2288.571429,4028.571429,5577.142857,2275.290208,7.034896,1.382517,2.724655,0.1765743
std,13606.312751,42315700.0,3482.36568,0.191675,0.000124,519.156826,770.744847,1190.038015,1967.887434,3331.74343,4533.154583,1734.595197,4.703535,0.677503,0.595816,0.1753387
min,10946.0,2691901.0,753.582715,0.117449,6.2e-05,50.0,50.0,50.0,150.0,500.0,600.0,350.562617,3.074635,0.680819,2.058261,4.440892e-16
25%,14467.5,11061560.0,1478.637444,0.408144,8.9e-05,150.0,275.0,275.0,375.0,640.0,970.0,540.086076,3.897863,0.917216,2.256153,0.01357045
50%,19387.0,30141350.0,5793.088095,0.441734,0.000112,500.0,500.0,1250.0,2550.0,5200.0,7650.0,2880.503028,5.289357,1.12255,2.503958,0.159531
75%,26110.5,68058920.0,7746.298567,0.612017,0.000217,1100.0,1625.0,2450.0,3985.0,6625.0,9225.0,3659.22338,8.336284,1.713443,3.299315,0.3272842
max,50358.0,114612300.0,8652.187702,0.682262,0.000398,1200.0,1750.0,2750.0,4600.0,7970.0,10400.0,4297.346897,16.411989,2.612934,3.399432,0.3947797


## Next Steps

The feature engineering is complete. Key outputs:

1. **Loan-level features** (`loan_features_t90` table): Individual loan characteristics and early behavior signals
2. **Cohort-level features** (`cohort_features_t90` table): Portfolio composition and risk metrics

### For Modeling:
- **Strategy A (Loan-level → Aggregate)**: Use loan-level features to predict individual outcomes, then aggregate to cohort level
- **Strategy B (Direct Cohort)**: Use cohort-level features to directly predict cohort ROI

### Key Considerations:
- All features respect the decision time constraint (t=90 days)
- Missing values are handled appropriately for each feature type
- Features are saved in database tables for easy access in modeling notebook
- Complex calculations are modularized in `src/features.py` for reusability

Ready for the modeling phase!