In [1]:
import sys
import numpy as np
sys.path.append("..")

%load_ext autoreload
%autoreload 2

# Feature Engineering for Cohort Profitability Prediction

This notebook creates features for predicting ROI at horizon H using only information available up to decision time t.

## Key Parameters
- **Decision Time (t)**: 90 days after cohort creation (parametrized for easy modification)
- **Horizon (H)**: Based on EDA findings, we use the full observation period for final ROI calculation
- **Feature Scope**: Only information available at or before time t is used

## Feature Categories
1. **Loan-Level Features**: Individual loan characteristics and early behavior signals
2. **Cohort-Level Features**: Portfolio composition and risk distribution metrics

In [15]:
# Parameters - easily configurable
DECISION_TIME_DAYS = 180  # Decision time t in days after cohort creation
TIME_HORIZON_DAYS = 400  # Time horizon H in days for target variable
DATABASE_PATH = "../database.db"

print(f"Decision time set to: {DECISION_TIME_DAYS} days after cohort creation")
print(f"Time horizon set to: {TIME_HORIZON_DAYS} days for target variable")

Decision time set to: 180 days after cohort creation
Time horizon set to: 400 days for target variable


## Data Loading and Preparation

In [3]:
from src.dataset.data_manipulation import load_data

# Load all data
allowlist, loans, repayments, loans_and_cohort, repayments_and_loans = load_data(
    DATABASE_PATH, remove_loans_with_errors=True
)

## Feature Engineering Functions

We'll import feature engineering functions from a dedicated module to keep the notebook clean and functions reusable.

In [16]:
from src.features import (
    create_loan_level_features,
    create_cohort_level_features,
    save_features_to_database
)

## 1. Loan-Level Features

### Loan Characteristics
- Loan amount (raw and log-transformed)
- Annual interest rate
- Loan size decile within cohort

### Temporal Features
- Time since loan issuance at decision time t
- Time between allowlist date and loan creation

### Interaction Terms
- Loan amount × interest rate
- Loan ROI at 30/60/90 days

### Early Repayment Behavior
- Days to first repayment
- Repayment velocity (30/60/90 days)
- Repayment consistency metrics

### Repayment Quality Indicators
- Average repayment amount relative to loan size
- Repayment acceleration/deceleration trends

### Billing Payment Indicators
- Time in billing process
- Is in normal repayment process (boolean)

In [17]:
loan_features_df = create_loan_level_features(
    loans_and_cohort,
    repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS,
    time_horizon_days=TIME_HORIZON_DAYS,
)
loan_features_df

Creating loan-level features with decision time t=180 days...
Base features dataset: 24316 unique loans
Creating repayment behavior features...


  repayments_filtered.groupby("loan_id").apply(calc_consistency).reset_index()


Final loan features dataset: 24316 loans with 30 features


Unnamed: 0,loan_id,user_id,created_at,updated_at,annual_interest,loan_amount,status_at_decision_time,batch,allowlisted_date,batch_letter,...,repayment_velocity_120d,loan_roi_120d,repayment_velocity_180d,loan_roi_180d,days_to_first_repayment,num_repayments,total_repaid_amount,repayment_consistency_cv,avg_repayment_relative,repayment_at_H
0,0000634b4de08f4d798a4546bd104aa5d3e43af416bd48...,e00cc67f993040157c1a5d15b35d8b6182e567c405fff9...,2024-03-11,2024-06-06 23:45:31.989,2.4,4000.0,executed,9a65c2254d6d2b240f353b95df7061928c7a9869417325...,2023-12-19,F,...,1.382703,-0.98721,1.977320,-0.952050,1.0,24,191.80,1.450956,0.001998,191.800000
1,00022546590af574f1785cb5e4c17bb1898de7bce40977...,1532d16402c104350db26e145d562e7b9ef392e16e9c99...,2023-12-07,2024-03-20 12:01:00.658,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.406780,-0.90400,0.269663,-0.904000,9.0,1,48.00,0.000000,0.096000,48.000000
2,000402c18c2931e31e9cd68b5a01d1389337e55572859a...,35bd33ed5eb7a85c88c2b1baf1ec368adc994b9bdc9f5e...,2024-08-12,2024-08-12 15:14:57.424,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,,,1.004800,0.004800,0.0,1,50.24,0.000000,1.004800,50.240000
3,000dca06cc48943ca84d7516f817709f2b7768468a9a02...,445a2b25d6692ec55caf314c6bc998c517ea9022c65735...,2024-06-01,2024-06-03 12:02:32.785,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.813548,0.00880,0.413443,0.008800,1.0,2,50.44,0.984140,0.504400,50.440000
4,000eb39b9c161b1f71e9ad6e36194639ee58fd61a3dad4...,9d4b09514327fecbf514ec885540846ffe6aafc0753e50...,2024-04-24,2024-05-12 20:20:24.469,3.2,500.0,executed,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.000000,-1.00000,0.000000,-1.000000,,0,0.00,,0.000000,725.560000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24311,fff93baddcc61ede310a0bdf21e77c393345613ef3669b...,ff94388159b0e7fafe8b47e990aefed1efddc34c31ab90...,2024-08-14,2024-08-29 15:12:15.274,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,,,1.110417,0.066000,15.0,1,53.30,0.000000,1.066000,53.300000
24312,fffa7d663d32bfa90ca35a874ef5b2a842595b7627dd39...,627575c514eec900ec0ac9f1780fb41c92708b3889b58e...,2024-07-27,2024-07-31 23:03:34.911,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,8.493333,0.01920,0.772121,0.019200,2.0,3,50.96,0.226884,0.339733,50.960000
24313,fffb5b06cc5ef2d4fd3d9321bc797d95b0bdb75ac77215...,4f1efc1e1af62ccdbc89ac564d33c22ed3021c6d3be748...,2024-04-11,2024-04-12 15:31:41.127,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.445310,0.00640,0.290867,0.006400,1.0,1,50.32,0.000000,1.006400,50.320000
24314,fffccf877a6b7745194286d6683b55d9d69ce2a800e64f...,ef9fa866ffbbd757283c5ade094cef617518b80cdb7bbc...,2023-01-23,2023-01-23 15:07:56.905,2.4,3000.0,executed,1d83f7f96a6a3a06b30bc683b94a428225fe072e60959f...,2022-08-29,B,...,,,88.189394,-0.029917,9.0,10,2910.25,1.071057,0.097008,3184.908196


In [6]:
# Check available columns in loan features
print("Loan features columns:")
[print(f"- {col}") for col in loan_features_df.columns.tolist()]


# Show unique statuses
if 'status_at_decision_time' in loan_features_df.columns:
    print(f"\nUnique statuses at decision time:")

Loan features columns:
- loan_id
- user_id
- created_at
- updated_at
- annual_interest
- loan_amount
- status_at_decision_time
- batch
- allowlisted_date
- batch_letter
- cohort_start
- created_at_h_days
- updated_at_h_days
- loan_amount_log
- loan_size_decile
- days_since_loan_issuance
- days_allowlist_to_loan
- loan_amount_x_interest
- repayment_velocity_30d
- loan_roi_30d
- repayment_velocity_60d
- loan_roi_60d
- repayment_velocity_90d
- loan_roi_90d
- days_to_first_repayment
- num_repayments
- total_repaid_amount
- repayment_consistency_cv
- avg_repayment_relative

Unique statuses at decision time:


In [7]:
loan_features_df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,created_at,updated_at,annual_interest,loan_amount,allowlisted_date,cohort_start,created_at_h_days,updated_at_h_days,loan_amount_log,loan_size_decile,...,loan_roi_30d,repayment_velocity_60d,loan_roi_60d,repayment_velocity_90d,loan_roi_90d,days_to_first_repayment,num_repayments,total_repaid_amount,repayment_consistency_cv,avg_repayment_relative
count,15600,15600,15600.0,15600.0,15600,15600,15600.0,15600.0,15600.0,15600.0,...,7489.0,11703.0,11703.0,15600.0,15600.0,14427.0,15600.0,15600.0,14427.0,15600.0
mean,2024-01-09 13:15:52.615384832,2024-01-24 11:25:43.192004352,3.093154,777.341782,2023-12-04 07:35:54.461538304,2023-12-04 07:35:54.461538304,36.23609,50.541731,5.381216,2.5275,...,-0.348259,inf,-0.218508,inf,-0.165384,3.797117,6.001026,507.141129,0.563953,0.462583
min,2022-08-30 00:00:00,2022-10-10 12:42:27.332963,1.7,10.0,2022-08-29 00:00:00,2022-08-29 00:00:00,0.0,1.0,2.397895,1.0,...,-1.0,-30.3,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0
25%,2023-12-21 00:00:00,2024-02-19 19:36:49.975749888,3.2,50.0,2023-12-05 00:00:00,2023-12-05 00:00:00,10.0,29.0,3.931826,1.0,...,-0.8708,1.071332,-0.502487,0.767924,0.0044,1.0,1.0,50.448011,0.0,0.05394
50%,2024-04-13 00:00:00,2024-04-21 17:12:21.503000064,3.4,150.0,2024-04-04 00:00:00,2024-04-04 00:00:00,32.0,51.0,5.01728,2.0,...,0.00508,2.797778,0.009764,2.326133,0.012,1.0,2.0,100.44,0.53347,0.344253
75%,2024-05-16 00:00:00,2024-05-23 13:12:39.761499904,3.4,500.0,2024-04-04 00:00:00,2024-04-04 00:00:00,60.0,77.0,6.216606,3.0,...,0.01675,10.88081,0.0264,9.694699,0.0336,3.0,5.0,427.8925,0.936745,1.0054
max,2024-07-03 00:00:00,2024-07-03 23:58:06.645000,3.4,20000.0,2024-04-04 00:00:00,2024-04-04 00:00:00,90.0,90.0,9.903538,10.0,...,1.03547,inf,1.03547,inf,1.03547,89.0,85.0,21792.89,5.149347,2.03547
std,,,0.517323,1776.244728,,,27.923151,26.240553,1.514808,2.092762,...,0.444907,,0.402649,,0.375346,7.636067,10.794759,1325.103478,0.592859,0.407511


## 2. Cohort-Level Features

### Portfolio Concentration Metrics
- Gini coefficient of loan amounts
- Herfindahl-Hirschman Index (HHI)
- Loan amount percentiles (P10, P25, P50, P75, P90, P95)

### Risk Distribution Metrics
- Cohort size (number of loans)
- Value-weighted average loan amount
- Statistical measures: standard deviation, skewness, coefficient of variation

In [8]:
# Create cohort-level features
print("Creating cohort-level features...")
cohort_features_df = create_cohort_level_features(
    loans_and_cohort=loans_and_cohort,
    repayments_and_loans=repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS
)

print(f"Created {len(cohort_features_df.columns)} cohort-level features for {len(cohort_features_df)} cohorts")
print("\nFeature columns:")
for col in sorted(cohort_features_df.columns):
    print(f"  - {col}")

Creating cohort-level features...
Creating cohort-level features...
Creating loan-level features with decision time t=90 days...
Base features dataset: 15600 unique loans
Creating repayment behavior features...
Final loan features dataset: 15600 loans with 29 features
Final cohort features dataset: 7 cohorts with 45 features
Created 45 cohort-level features for 7 cohorts

Feature columns:
  - amount_weighted_avg_roi_90d
  - avg_days_allowlist_to_loan
  - avg_days_since_loan_issuance
  - avg_days_to_first_repayment
  - avg_interest_rate
  - avg_loan_amount
  - avg_loan_amount_x_interest
  - avg_loan_roi_30d
  - avg_loan_roi_60d
  - avg_loan_roi_90d
  - avg_repayment_consistency
  - avg_repayment_velocity_30d
  - avg_repayment_velocity_60d
  - avg_repayment_velocity_90d
  - batch_letter
  - cohort_size
  - loan_amount_cv
  - loan_amount_hhi
  - loan_amount_p25
  - loan_amount_p75
  - loan_amount_p90
  - loan_amount_skewness
  - median_days_to_first_repayment
  - median_interest_rate
  - 

  repayments_filtered.groupby("loan_id").apply(calc_consistency).reset_index()
  features_df.groupby("batch_letter").apply(calc_group_metrics).reset_index()


In [9]:
# Display cohort-level features
print("Cohort-level features:")
display(cohort_features_df)

Cohort-level features:


Unnamed: 0,batch_letter,cohort_size,total_loan_amount,avg_loan_amount,median_loan_amount,loan_amount_skewness,avg_interest_rate,median_interest_rate,std_interest_rate,total_repaid_amount,...,pct_positive_roi_90d,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment,pct_executed,pct_debt_collection,pct_debt_repaid,pct_repaid,avg_loan_amount_x_interest,amount_weighted_avg_roi_90d
0,A,1114,130669.96,117.297989,50.0,4.946587,3.39982,3.4,0.005992,115412.8,...,0.910233,0.79982,0.005386,0.194794,0.194794,0.005386,0.002693,0.797127,398.60311,-0.116761
1,B,791,2489534.0,3147.324905,2250.0,3.005904,2.4,2.4,0.0,1575898.0,...,0.433628,0.404551,0.0,0.595449,0.595449,0.0,0.0,0.404551,7553.579772,-0.366991
2,C,1142,3790750.88,3319.396567,2250.0,2.881781,2.005867,1.7,0.347358,2527218.0,...,0.471979,0.43958,0.0,0.56042,0.56042,0.0,0.0,0.43958,6652.458067,-0.33332
3,D,2246,1074555.97,478.430975,500.0,-2.908027,3.2,3.2,0.0,559611.6,...,0.431879,0.407836,0.0,0.592164,0.592164,0.0,0.0,0.407836,1530.97912,-0.479216
4,E,707,2405450.0,3402.333805,2250.0,2.67387,2.024752,1.7,0.349335,1261806.0,...,0.384724,0.362093,0.0,0.637907,0.637907,0.0,0.0,0.362093,6818.274399,-0.475439
5,F,982,1117859.72,1138.35002,750.0,4.126454,2.4,2.4,0.0,881910.4,...,0.725051,0.716904,0.028513,0.254582,0.254582,0.028513,0.008147,0.708758,2732.040049,-0.211072
6,G,8618,1117711.27,129.694972,50.0,11.054964,3.4,3.4,0.0,989545.0,...,0.916918,0.829195,0.049431,0.121374,0.121374,0.049431,0.090392,0.738803,440.962905,-0.114669


## Feature Summary and Statistics

In [10]:
# Loan-level feature statistics
print("=== LOAN-LEVEL FEATURE STATISTICS ===")
print(f"Total loans: {len(loan_features_df)}")
print(f"Total features: {len(loan_features_df.columns)}")
print(f"Missing values per feature:")
missing_values = loan_features_df.isnull().sum()
for feature, missing in missing_values[missing_values > 0].items():
    print(f"  {feature}: {missing} ({missing/len(loan_features_df)*100:.1f}%)")

print("\n=== COHORT-LEVEL FEATURE STATISTICS ===")
print(f"Total cohorts: {len(cohort_features_df)}")
print(f"Total features: {len(cohort_features_df.columns)}")
print(f"Missing values per feature:")
missing_values_cohort = cohort_features_df.isnull().sum()
for feature, missing in missing_values_cohort[missing_values_cohort > 0].items():
    print(f"  {feature}: {missing} ({missing/len(cohort_features_df)*100:.1f}%)")

=== LOAN-LEVEL FEATURE STATISTICS ===
Total loans: 15600
Total features: 29
Missing values per feature:
  repayment_velocity_30d: 8111 (52.0%)
  loan_roi_30d: 8111 (52.0%)
  repayment_velocity_60d: 3897 (25.0%)
  loan_roi_60d: 3897 (25.0%)
  days_to_first_repayment: 1173 (7.5%)
  repayment_consistency_cv: 1173 (7.5%)

=== COHORT-LEVEL FEATURE STATISTICS ===
Total cohorts: 7
Total features: 45
Missing values per feature:


## Save Features to Database

We'll save both loan-level and cohort-level features to separate tables in the database for easy access in modeling.

In [11]:
# Save features to database
print("Saving features to database...")
save_features_to_database(
    loan_features_df=loan_features_df,
    cohort_features_df=cohort_features_df,
    database_path=DATABASE_PATH,
    decision_time_days=DECISION_TIME_DAYS
)

print("Features saved successfully!")
print(f"Loan-level features saved to: loan_features_t{DECISION_TIME_DAYS}")
print(f"Cohort-level features saved to: cohort_features_t{DECISION_TIME_DAYS}")

Saving features to database...
Saved 15600 loan features to table: loan_features_t90
Saved 7 cohort features to table: cohort_features_t90
Features saved successfully!
Loan-level features saved to: loan_features_t90
Cohort-level features saved to: cohort_features_t90


## Next Steps

The feature engineering is complete. Key outputs:

1. **Loan-level features** (`loan_features_t90` table): Individual loan characteristics and early behavior signals
2. **Cohort-level features** (`cohort_features_t90` table): Portfolio composition and risk metrics

### For Modeling:
- **Strategy A (Loan-level → Aggregate)**: Use loan-level features to predict individual outcomes, then aggregate to cohort level
- **Strategy B (Direct Cohort)**: Use cohort-level features to directly predict cohort ROI

### Key Considerations:
- All features respect the decision time constraint (t=90 days)
- Missing values are handled appropriately for each feature type
- Features are saved in database tables for easy access in modeling notebook
- Complex calculations are modularized in `src/features.py` for reusability

Ready for the modeling phase!