In [49]:
import sys
sys.path.append("..")

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Feature Engineering for Cohort Profitability Prediction

This notebook creates features for predicting ROI at horizon H using only information available up to decision time t.

## Key Parameters
- **Decision Time (t)**: 90 days after cohort creation (parametrized for easy modification)
- **Horizon (H)**: Based on EDA findings, we use the full observation period for final ROI calculation
- **Feature Scope**: Only information available at or before time t is used

## Feature Categories
1. **Loan-Level Features**: Individual loan characteristics and early behavior signals
2. **Cohort-Level Features**: Portfolio composition and risk distribution metrics

In [50]:
# Parameters - easily configurable
from src.config import DECISION_TIME_DAYS, TIME_HORIZON_DAYS, DATABASE_PATH

print(f"Decision time set to: {DECISION_TIME_DAYS} days after cohort creation")
print(f"Time horizon set to: {TIME_HORIZON_DAYS} days for target variable")

Decision time set to: 180 days after cohort creation
Time horizon set to: 400 days for target variable


## Data Loading and Preparation

In [51]:
from src.dataset.data_manipulation import load_data

# Load all data
allowlist, loans, repayments, loans_and_cohort, repayments_and_loans = load_data(
    # DATABASE_PATH, remove_loans_with_errors=
    DATABASE_PATH, remove_loans_with_errors=False
)

## Feature Engineering Functions

We'll import feature engineering functions from a dedicated module to keep the notebook clean and functions reusable.

In [52]:
from src.features import (
    create_loan_level_features,
    create_cohort_level_features,
    save_features_to_database
)

## 1. Loan-Level Features

### Loan Characteristics
- Loan amount (raw and log-transformed)
- Annual interest rate
- Loan size decile within cohort

### Temporal Features
- Time since loan issuance at decision time t
- Time between allowlist date and loan creation

### Interaction Terms
- Loan amount × interest rate
- Loan ROI at 30/60/90 days

### Early Repayment Behavior
- Days to first repayment
- Repayment velocity (30/60/90 days)
- Repayment consistency metrics

### Repayment Quality Indicators
- Average repayment amount relative to loan size
- Repayment acceleration/deceleration trends

### Billing Payment Indicators
- Time in billing process
- Is in normal repayment process (boolean)

In [53]:
loan_features_df = create_loan_level_features(
    loans_and_cohort,
    repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS,
    time_horizon_days=TIME_HORIZON_DAYS,
)
loan_features_df

Creating loan-level features with decision time t=180 days...
Base features dataset: 24462 unique loans
Creating repayment behavior features...


  repayments_filtered.groupby("loan_id").apply(calc_consistency).reset_index()


Final loan features dataset: 24462 loans with 30 features


Unnamed: 0,loan_id,user_id,created_at,updated_at,annual_interest,loan_amount,status_at_decision_time,batch,allowlisted_date,batch_letter,...,repayment_velocity_120d,loan_roi_120d,repayment_velocity_180d,loan_roi_180d,days_to_first_repayment,num_repayments,total_repaid_amount,repayment_consistency_cv,avg_repayment_relative,repayment_at_H
0,0000634b4de08f4d798a4546bd104aa5d3e43af416bd48...,e00cc67f993040157c1a5d15b35d8b6182e567c405fff9...,2024-03-11,2024-06-06 23:45:31.989,2.4,4000.0,executed,9a65c2254d6d2b240f353b95df7061928c7a9869417325...,2023-12-19,F,...,1.382703,-0.98721,1.977320,-0.952050,1.0,24,191.80000,1.450956,0.001998,191.800000
1,00022546590af574f1785cb5e4c17bb1898de7bce40977...,1532d16402c104350db26e145d562e7b9ef392e16e9c99...,2023-12-07,2024-03-20 12:01:00.658,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.406780,-0.90400,0.269663,-0.904000,9.0,1,48.00000,0.000000,0.096000,48.000000
2,000402c18c2931e31e9cd68b5a01d1389337e55572859a...,35bd33ed5eb7a85c88c2b1baf1ec368adc994b9bdc9f5e...,2024-08-12,2024-08-12 11:56:37.160,3.4,50.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,,,1.004800,0.004800,0.0,1,50.24000,0.000000,1.004800,50.240000
3,000dca06cc48943ca84d7516f817709f2b7768468a9a02...,445a2b25d6692ec55caf314c6bc998c517ea9022c65735...,2024-06-01,2024-06-03 12:02:32.785,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.813548,0.00880,0.413443,0.008800,1.0,2,50.44000,0.984140,0.504400,50.440000
4,000eb39b9c161b1f71e9ad6e36194639ee58fd61a3dad4...,9d4b09514327fecbf514ec885540846ffe6aafc0753e50...,2024-04-24,2024-05-12 20:20:24.473,3.2,500.0,executed,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.000000,-1.00000,0.000000,-1.000000,,0,0.00000,,0.000000,725.560000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24457,fffa7d663d32bfa90ca35a874ef5b2a842595b7627dd39...,627575c514eec900ec0ac9f1780fb41c92708b3889b58e...,2024-07-27,2024-07-31 23:03:34.911,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,8.493333,0.01920,0.772121,0.019200,2.0,3,50.96000,0.226884,0.339733,50.960000
24458,fffb5b06cc5ef2d4fd3d9321bc797d95b0bdb75ac77215...,4f1efc1e1af62ccdbc89ac564d33c22ed3021c6d3be748...,2024-04-11,2024-04-12 15:31:41.127,3.4,50.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.445310,0.00640,0.290867,0.006400,1.0,1,50.32000,0.000000,1.006400,50.320000
24459,fffccf877a6b7745194286d6683b55d9d69ce2a800e64f...,ef9fa866ffbbd757283c5ade094cef617518b80cdb7bbc...,2023-01-23,2023-01-23 15:07:56.905,2.4,3000.0,executed,1d83f7f96a6a3a06b30bc683b94a428225fe072e60959f...,2022-08-29,B,...,,,88.189394,-0.029917,9.0,10,2910.25000,1.071057,0.097008,3184.908196
24460,fffcffd247c02bfc1d42974623254a88eeee39b46dbd6b...,b1862108e0314a10a21ad8b4ea4193016fce49014a3868...,2024-05-29,2024-06-03 21:12:11.047,3.4,100.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,1.571538,0.02150,0.817200,0.021500,1.0,2,102.15000,0.608419,0.510750,102.150000


In [54]:
loan_features_df[loan_features_df.repayment_at_H.isnull()]

Unnamed: 0,loan_id,user_id,created_at,updated_at,annual_interest,loan_amount,status_at_decision_time,batch,allowlisted_date,batch_letter,...,repayment_velocity_120d,loan_roi_120d,repayment_velocity_180d,loan_roi_180d,days_to_first_repayment,num_repayments,total_repaid_amount,repayment_consistency_cv,avg_repayment_relative,repayment_at_H
163,01c101ec6059c04e657b292143a01e1571ef728be57d16...,72b8db5832cc45d74deb3f6cef3c313b56d7f4420887ac...,2024-01-25,2024-05-08 12:00:08.782,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
303,03280e5e427d0061dcac018bf704fbdd4cc1706bcfe780...,ef86918bd9d19a17de0aa5dca195663192f7c67c451469...,2023-12-10,2024-03-23 12:00:14.798,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
319,0359a0b0bb831c4999576d34e09ec1430d6acd8ca66007...,f409053bdc1e4f974ec698b175219f86d8e20e2e11e1a3...,2024-04-29,2024-05-13 22:00:27.759,3.4,50.0,debt_collection,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
346,0396fdf9eeea90f2d27fa0bbdbef9b901c3bebdfd9e650...,ad5c5fb0ea986b58ef15a759f12fb787a0be09b5d0edc9...,2024-01-28,2024-05-11 12:00:10.198,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
395,0420f5c12efd1db47a9d9cc89d126bae51cabe0afe1a49...,5f0ac893e3720b708df37add617ce6d50a7922d9a3f9b3...,2024-04-06,2024-04-20 22:00:32.506,3.4,300.0,debt_collection,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24162,fd227832c1a53900809dd7751a202c9387d6c101b7dece...,1bcb1991726e9ba206d7b6c6fd7c637fe90dfc79daed22...,2024-04-05,2024-04-19 22:01:00.901,3.4,50.0,debt_collection,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
24260,fe2be747e8e245cb986450df138cde5f84bb43fe4cd4e6...,f03b2434843255e4acf62262e7cdc6ff7b5adabed7be15...,2023-12-06,2024-03-19 12:02:45.201,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
24310,fea4de4c3623caaff497f9a3328a16e9024221c1c343d1...,4ea3540d3005b77a923f7b746cf8a3813d9ac2eb03110d...,2024-03-18,2024-03-18 14:32:28.560,3.2,500.0,executed,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,
24417,ffa14460dd4ff0ff7648ae70b653c47e4302e529d59090...,a48cdbe43c1e22589d0ab1daa64f6bf38759c11b8e3878...,2024-02-18,2024-06-01 12:00:08.389,3.2,500.0,debt_collection,4398a3e49d78f4b1b816ced315f34a5da5e830b1f53640...,2023-12-05,D,...,0.0,-1.0,0.0,-1.0,,0,0.0,,0.0,


In [55]:
loan_id = "4dc6209ade5525396a30910e26e006749df5f878e137cb87bb123267f970bce3"
user_id = "3487c5129cdf4c202d16febed9fa29c680e54b40859d72110ad76191d31525b7"
unique_loans = loans_and_cohort[loans_and_cohort.user_id == user_id].sort_values("updated_at")[
    "loan_id"
].unique()

loans_and_cohort[loans_and_cohort.user_id == user_id].sort_values("updated_at")

Unnamed: 0,loan_id,user_id,created_at,updated_at,annual_interest,loan_amount,status,batch,allowlisted_date,batch_letter,cohort_start,created_at_h_days,updated_at_h_days
115695,4dc6209ade5525396a30910e26e006749df5f878e137cb...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-05,2024-04-05 15:59:24.600,3.4,200.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,1,1
115696,4dc6209ade5525396a30910e26e006749df5f878e137cb...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-05,2024-04-05 15:59:24.611,3.4,200.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,1,1
115694,4dc6209ade5525396a30910e26e006749df5f878e137cb...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-05,2024-04-05 23:46:14.231,3.4,200.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,1,1
115693,4dc6209ade5525396a30910e26e006749df5f878e137cb...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-05,2024-04-06 08:33:50.718,3.4,200.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,1,2
115698,8f5dab3bc9019dbf8b4c19a9b12d294963c4af8798dd09...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-06,2024-04-06 09:09:13.662,3.4,100.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,2,2
115699,8f5dab3bc9019dbf8b4c19a9b12d294963c4af8798dd09...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-06,2024-04-06 09:09:13.669,3.4,100.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,2,2
115697,8f5dab3bc9019dbf8b4c19a9b12d294963c4af8798dd09...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-06,2024-04-06 09:09:40.726,3.4,100.0,repaid,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,2,2
115700,7cc32caae581a1ae76e15cb84cc74ae060e78e12230c7e...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-06,2024-04-06 14:21:32.197,3.4,100.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,2,2
115702,7cc32caae581a1ae76e15cb84cc74ae060e78e12230c7e...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-06,2024-04-06 14:21:32.206,3.4,100.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,2,2
115701,7cc32caae581a1ae76e15cb84cc74ae060e78e12230c7e...,3487c5129cdf4c202d16febed9fa29c680e54b40859d72...,2024-04-06,2024-04-06 23:53:53.566,3.4,100.0,executed,e6a25e071c60243b0c51c698db5302b54ef61338c6747a...,2024-04-04,G,2024-04-04,2,2


In [56]:
repayments_and_loans[repayments_and_loans.loan_id.isin([loan_id])]

Unnamed: 0,date,loan_id,repayment_amount,billings_amount,batch_letter,allowlisted_date,loan_amount,cohort_start,created_at,created_at_h_days,h_days,repayment_total
398227,2024-04-06,4dc6209ade5525396a30910e26e006749df5f878e137cb...,201.42,0.0,G,2024-04-04,200.0,2024-04-04,2024-04-05,1,2,201.42


In [57]:
# Check available columns in loan features
print("Loan features columns:")
[print(f"- {col}") for col in loan_features_df.columns.tolist()]


# Show unique statuses
if 'status_at_decision_time' in loan_features_df.columns:
    print(f"\nUnique statuses at decision time:")

Loan features columns:
- loan_id
- user_id
- created_at
- updated_at
- annual_interest
- loan_amount
- status_at_decision_time
- batch
- allowlisted_date
- batch_letter
- cohort_start
- created_at_h_days
- updated_at_h_days
- loan_amount_log
- loan_size_decile
- days_since_loan_issuance
- days_allowlist_to_loan
- loan_amount_x_interest
- repayment_velocity_60d
- loan_roi_60d
- repayment_velocity_120d
- loan_roi_120d
- repayment_velocity_180d
- loan_roi_180d
- days_to_first_repayment
- num_repayments
- total_repaid_amount
- repayment_consistency_cv
- avg_repayment_relative
- repayment_at_H

Unique statuses at decision time:


In [58]:
loan_features_df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,created_at,updated_at,annual_interest,loan_amount,allowlisted_date,cohort_start,created_at_h_days,updated_at_h_days,loan_amount_log,loan_size_decile,...,repayment_velocity_120d,loan_roi_120d,repayment_velocity_180d,loan_roi_180d,days_to_first_repayment,num_repayments,total_repaid_amount,repayment_consistency_cv,avg_repayment_relative,repayment_at_H
count,24462,24462,24462.0,24462.0,24462,24462,24462.0,24462.0,24462.0,24462.0,...,19039.0,19039.0,24462.0,24462.0,23364.0,24462.0,24462.0,23364.0,24462.0,24066.0
mean,2024-01-18 08:41:44.194260480,2024-02-07 11:27:39.983498240,3.028922,954.434037,2023-11-08 17:52:43.796909312,2023-11-08 17:52:43.796909312,70.617366,90.146799,5.552772,2.665604,...,inf,-0.126553,inf,-0.087659,4.974876,7.848663,769.086928,0.661333,0.465634,953.833637
min,2022-08-30 00:00:00,2022-10-10 12:42:27.332963,1.7,10.0,2022-08-29 00:00:00,2022-08-29 00:00:00,0.0,1.0,2.397895,1.0,...,-50.22,-1.0,-0.03897959,-1.01528,-1.0,0.0,-3.82,0.0,-0.005093,-3.82
25%,2023-12-10 00:00:00,2024-02-02 20:15:34.700000,2.4,50.0,2023-12-05 00:00:00,2023-12-05 00:00:00,22.0,43.0,3.931826,1.0,...,0.6115035,0.0048,0.4240582,0.0064,1.0,1.0,50.63,0.0,0.062239,50.869999
50%,2024-04-20 00:00:00,2024-05-02 03:20:00.038500096,3.4,200.0,2024-04-04 00:00:00,2024-04-04 00:00:00,64.5,92.0,5.303305,1.0,...,1.934615,0.0142,1.414338,0.0172,1.0,2.0,104.21,0.613055,0.342675,153.741802
75%,2024-06-13 00:00:00,2024-06-19 18:05:39.643500032,3.4,1000.0,2024-04-04 00:00:00,2024-04-04 00:00:00,113.0,134.0,6.908755,4.0,...,9.213636,0.041699,8.650105,0.0522,5.0,7.0,610.155,0.988,1.0058,851.32
max,2024-10-01 00:00:00,2024-10-01 23:58:18.891000,3.4,31200.0,2024-04-04 00:00:00,2024-04-04 00:00:00,180.0,180.0,10.348205,10.0,...,inf,1.03547,inf,1.4636,174.0,117.0,27611.56,7.839744,2.03547,36504.44
std,,,0.556806,2009.153543,,,53.119436,52.023422,1.606799,2.37711,...,,0.353428,,0.325817,11.572243,13.260091,1789.931512,0.696682,0.409657,2072.840707


## 2. Cohort-Level Features

### Portfolio Concentration Metrics
- Gini coefficient of loan amounts
- Herfindahl-Hirschman Index (HHI)
- Loan amount percentiles (P10, P25, P50, P75, P90, P95)

### Risk Distribution Metrics
- Cohort size (number of loans)
- Value-weighted average loan amount
- Statistical measures: standard deviation, skewness, coefficient of variation

In [59]:
# Create cohort-level features
print("Creating cohort-level features...")
cohort_features_df = create_cohort_level_features(
    loans_and_cohort=loans_and_cohort,
    repayments_and_loans=repayments_and_loans,
    decision_time_days=DECISION_TIME_DAYS
)

print(f"Created {len(cohort_features_df.columns)} cohort-level features for {len(cohort_features_df)} cohorts")
print("\nFeature columns:")
for col in sorted(cohort_features_df.columns):
    print(f"  - {col}")

Creating cohort-level features...
Creating cohort-level features...
Creating loan-level features with decision time t=180 days...
Base features dataset: 24462 unique loans
Creating repayment behavior features...


  repayments_filtered.groupby("loan_id").apply(calc_consistency).reset_index()


Final loan features dataset: 24462 loans with 30 features
Final cohort features dataset: 7 cohorts with 45 features
Created 45 cohort-level features for 7 cohorts

Feature columns:
  - amount_weighted_avg_roi_180d
  - avg_days_allowlist_to_loan
  - avg_days_since_loan_issuance
  - avg_days_to_first_repayment
  - avg_interest_rate
  - avg_loan_amount
  - avg_loan_amount_x_interest
  - avg_loan_roi_120d
  - avg_loan_roi_180d
  - avg_loan_roi_60d
  - avg_repayment_consistency
  - avg_repayment_velocity_120d
  - avg_repayment_velocity_180d
  - avg_repayment_velocity_60d
  - batch_letter
  - cohort_size
  - loan_amount_cv
  - loan_amount_hhi
  - loan_amount_p25
  - loan_amount_p75
  - loan_amount_p90
  - loan_amount_skewness
  - median_days_to_first_repayment
  - median_interest_rate
  - median_loan_amount
  - median_loan_roi_120d
  - median_loan_roi_180d
  - median_loan_roi_60d
  - median_repayment_velocity_120d
  - median_repayment_velocity_180d
  - median_repayment_velocity_60d
  - pct_d

  features_df.groupby("batch_letter").apply(calc_group_metrics).reset_index()


In [60]:
# Display cohort-level features
print("Cohort-level features:")
display(cohort_features_df)

Cohort-level features:


Unnamed: 0,batch_letter,cohort_size,total_loan_amount,avg_loan_amount,median_loan_amount,loan_amount_skewness,avg_interest_rate,median_interest_rate,std_interest_rate,total_repaid_amount,...,pct_positive_roi_180d,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment,pct_executed,pct_debt_collection,pct_debt_repaid,pct_repaid,avg_loan_amount_x_interest,amount_weighted_avg_roi_180d
0,A,2018,262093.46,129.87783,50.0,4.91577,3.399901,3.4,0.004452,232929.2,...,0.937066,0.858771,0.017344,0.123885,0.123885,0.017344,0.016848,0.841923,441.468664,-0.111274
1,B,1566,5007890.0,3197.886335,2250.0,3.206658,2.4,2.4,0.0,4057992.0,...,0.66539,0.636015,0.085568,0.266284,0.266284,0.085568,0.031928,0.604087,7674.927203,-0.18968
2,C,2343,7838327.88,3345.423764,2300.0,2.83176,2.012207,1.7,0.348028,6337690.0,...,0.673922,0.658557,0.065301,0.256509,0.256509,0.065301,0.017926,0.640632,6722.861251,-0.191449
3,D,3149,1505098.34,477.96073,500.0,-3.30834,3.2,3.2,0.0,1071774.0,...,0.610352,0.578596,0.27215,0.148619,0.148619,0.27215,0.022229,0.556367,1529.474337,-0.287904
4,E,1348,4421515.79,3280.056224,2250.0,2.872324,2.035979,1.7,0.349849,3281730.0,...,0.60089,0.583086,0.127596,0.280415,0.280415,0.127596,0.031899,0.551187,6608.328148,-0.257782
5,F,1791,2517548.67,1405.666482,750.0,6.151156,2.4,2.4,0.0,2163413.0,...,0.839196,0.826354,0.038526,0.13512,0.13512,0.038526,0.019542,0.806812,3373.599558,-0.140667
6,G,12247,1794891.27,146.557628,50.0,7.716126,3.4,3.4,0.0,1667875.0,...,0.952397,0.864783,0.032743,0.102392,0.102392,0.032743,0.077897,0.786887,498.295935,-0.070765


## Feature Summary and Statistics

In [61]:
# Loan-level feature statistics
print("=== LOAN-LEVEL FEATURE STATISTICS ===")
print(f"Total loans: {len(loan_features_df)}")
print(f"Total features: {len(loan_features_df.columns)}")
print(f"Missing values per feature:")
missing_values = loan_features_df.isnull().sum()
for feature, missing in missing_values[missing_values > 0].items():
    print(f"  {feature}: {missing} ({missing/len(loan_features_df)*100:.1f}%)")

print("\n=== COHORT-LEVEL FEATURE STATISTICS ===")
print(f"Total cohorts: {len(cohort_features_df)}")
print(f"Total features: {len(cohort_features_df.columns)}")
print(f"Missing values per feature:")
missing_values_cohort = cohort_features_df.isnull().sum()
for feature, missing in missing_values_cohort[missing_values_cohort > 0].items():
    print(f"  {feature}: {missing} ({missing/len(cohort_features_df)*100:.1f}%)")

=== LOAN-LEVEL FEATURE STATISTICS ===
Total loans: 24462
Total features: 30
Missing values per feature:
  repayment_velocity_60d: 12828 (52.4%)
  loan_roi_60d: 12828 (52.4%)
  repayment_velocity_120d: 5423 (22.2%)
  loan_roi_120d: 5423 (22.2%)
  days_to_first_repayment: 1098 (4.5%)
  repayment_consistency_cv: 1098 (4.5%)
  repayment_at_H: 396 (1.6%)

=== COHORT-LEVEL FEATURE STATISTICS ===
Total cohorts: 7
Total features: 45
Missing values per feature:


## Save Features to Database

We'll save both loan-level and cohort-level features to separate tables in the database for easy access in modeling.

In [None]:
# Save features to database
print("Saving features to database...")
save_features_to_database(
    loan_features_df=loan_features_df,
    cohort_features_df=cohort_features_df,
    database_path=DATABASE_PATH,
    decision_time_days=DECISION_TIME_DAYS,
    time_horizon_days=TIME_HORIZON_DAYS
)

print("Features saved successfully!")
print(f"Loan-level features saved to: loan_features_t{DECISION_TIME_DAYS}")
print(f"Cohort-level features saved to: cohort_features_t{DECISION_TIME_DAYS}")

Saving features to database...
Saved 24462 loan features to table: loan_features_t180_h400
Saved 7 cohort features to table: cohort_features_t180_h400
Features saved successfully!
Loan-level features saved to: loan_features_t180
Cohort-level features saved to: cohort_features_t180


In [63]:
cohort_features_df

Unnamed: 0,batch_letter,cohort_size,total_loan_amount,avg_loan_amount,median_loan_amount,loan_amount_skewness,avg_interest_rate,median_interest_rate,std_interest_rate,total_repaid_amount,...,pct_positive_roi_180d,pct_loans_totally_repaid,pct_loans_in_billing,pct_loans_normal_repayment,pct_executed,pct_debt_collection,pct_debt_repaid,pct_repaid,avg_loan_amount_x_interest,amount_weighted_avg_roi_180d
0,A,2018,262093.46,129.87783,50.0,4.91577,3.399901,3.4,0.004452,232929.2,...,0.937066,0.858771,0.017344,0.123885,0.123885,0.017344,0.016848,0.841923,441.468664,-0.111274
1,B,1566,5007890.0,3197.886335,2250.0,3.206658,2.4,2.4,0.0,4057992.0,...,0.66539,0.636015,0.085568,0.266284,0.266284,0.085568,0.031928,0.604087,7674.927203,-0.18968
2,C,2343,7838327.88,3345.423764,2300.0,2.83176,2.012207,1.7,0.348028,6337690.0,...,0.673922,0.658557,0.065301,0.256509,0.256509,0.065301,0.017926,0.640632,6722.861251,-0.191449
3,D,3149,1505098.34,477.96073,500.0,-3.30834,3.2,3.2,0.0,1071774.0,...,0.610352,0.578596,0.27215,0.148619,0.148619,0.27215,0.022229,0.556367,1529.474337,-0.287904
4,E,1348,4421515.79,3280.056224,2250.0,2.872324,2.035979,1.7,0.349849,3281730.0,...,0.60089,0.583086,0.127596,0.280415,0.280415,0.127596,0.031899,0.551187,6608.328148,-0.257782
5,F,1791,2517548.67,1405.666482,750.0,6.151156,2.4,2.4,0.0,2163413.0,...,0.839196,0.826354,0.038526,0.13512,0.13512,0.038526,0.019542,0.806812,3373.599558,-0.140667
6,G,12247,1794891.27,146.557628,50.0,7.716126,3.4,3.4,0.0,1667875.0,...,0.952397,0.864783,0.032743,0.102392,0.102392,0.032743,0.077897,0.786887,498.295935,-0.070765


## Next Steps

The feature engineering is complete. Key outputs:

1. **Loan-level features** (`loan_features_t90` table): Individual loan characteristics and early behavior signals
2. **Cohort-level features** (`cohort_features_t90` table): Portfolio composition and risk metrics

### For Modeling:
- **Strategy A (Loan-level → Aggregate)**: Use loan-level features to predict individual outcomes, then aggregate to cohort level
- **Strategy B (Direct Cohort)**: Use cohort-level features to directly predict cohort ROI

### Key Considerations:
- All features respect the decision time constraint (t=90 days)
- Missing values are handled appropriately for each feature type
- Features are saved in database tables for easy access in modeling notebook
- Complex calculations are modularized in `src/features.py` for reusability

Ready for the modeling phase!