# Preprocessing and Modeling Dataset Setup

This notebook rebuilds the modeling base with the cleaned Lending Club data, applies light feature engineering, and produces two preprocessed datasets: one tailored for regularized logistic regression and another for XGBoost.

# Setup
Load the cleaned dataset, utilities, and set display preferences for the exploratory checkpoints.

In [1]:
import json
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 120)

# Load data and rebuild the target

We repeat the eligibility and censoring rules from the exploration notebook:

- keep only finalized outcomes (`Fully Paid` vs `Charged Off`)
- restrict to 36-month loans issued before June 2017 to avoid censoring
- derive the binary target `default_binary` for downstream modeling

In [2]:
data = pd.read_parquet("../data/lending_club_exploration.parquet")
print(f"Raw shape: {data.shape}")
print(data["default_binary"].value_counts(normalize=True).rename("default_rate"))

Raw shape: (1059906, 144)
default_binary
0    0.854119
1    0.145881
Name: default_rate, dtype: float64


# Feature list and light type cleaning

We start from the curated `model_features.json`, then add a few structured tweaks so downstream vectorizers receive consistent inputs.

In [3]:
features_path = Path("../data/model_features.json")
base_features = json.loads(features_path.read_text())


def parse_employment_length(value: str):
    if pd.isna(value):
        return np.nan
    value = str(value).strip().lower()
    if value in {"n/a", "na", "nan"}:
        return np.nan
    if "<" in value:
        return 0.5
    digits = "".join(ch for ch in value if ch.isdigit())
    return float(digits) if digits else np.nan


# harmonize types before splitting columns
data["employment_length_years"] = data["employment_length_years"].apply(parse_employment_length)
data["employment_title"] = data["employment_title"].astype("string")
data["loan_purpose"] = data["loan_purpose"].astype("string")
data["loan_title"] = data["loan_title"].astype("string")
data["home_ownership_status"] = data["home_ownership_status"].astype("string")
data["state"] = data["state"].astype("string")
data["income_verification_status"] = data["income_verification_status"].astype("string")

# credit history timing
data["earliest_credit_line_year"] = data["earliest_credit_line_date"].dt.year
data["credit_history_age_years"] = (
    data["loan_issue_date"] - data["earliest_credit_line_date"]
).dt.days / 365.25

# final feature set swaps out the raw date for engineered variants
feature_columns = [c for c in base_features if c != "earliest_credit_line_date"]
feature_columns += ["earliest_credit_line_year", "credit_history_age_years"]

data[feature_columns].isna().mean().sort_values(ascending=False).head(10)

months_since_recent_inquiry        0.138733
accounts_120days_past_due          0.087083
months_since_oldest_installment    0.081588
employment_title                   0.070113
employment_length_years            0.064802
percent_trades_never_delinquent    0.048835
average_current_balance            0.048735
months_since_oldest_revolving      0.048728
revolving_accounts_count           0.048728
months_since_recent_revolving      0.048728
dtype: float64

# Column blocks and preprocessing recipes

A few design choices tailored to each model family:

- **Logistic regression**: median/mode imputation, scale numeric inputs after imputation, and a small set of log-scaled monetary variables to tame heavy tails.
- **XGBoost**: keep numeric inputs unscaled (trees handle monotonicity), but retain one-hot/text encodings.

In the following sections, we will train two models: a simple baseline logistic regression (LR), and a stronger baseline using XGBoost (XGB). This is to allow us to compare their performance and better understand the value added by a more complex model.

Note that for the XGBoost model, we do not perform any imputation for missing values. XGBoost can natively handle missing values, which can appear in several forms: Missing at Random (MAR), Missing Completely at Random (MCAR) — for example, in columns dropped earlier when they were only recently introduced — and Missing Not at Random (MNAR), which may encode valuable information about the borrower. For instance, if a borrower chooses not to disclose their income, it is reasonable to suspect that their income distribution differs from those who do disclose. Thus, missing values themselves may be informative for the model. Although we must impute missing values for logistic regression (and here we do so simply, since we do not want to spend much effort on the naive baseline), XGBoost does not require this, so we keep missing values as-is.

Additionally, for the regression model, preprocessing includes feature scaling, as linear models perform better with features on similar scales. In an ideal scenario, we would spend time deciding for each missing value which imputation strategy to use (using only the training data), and for scaling, we would choose between scaling methods like log-scaling or standard scaling for each feature depending on their distributions. But for the sake of simplicity, we instead use MinMaxScaling across all features.

Finally, we transform text columns (after basic cleaning to ensure minimal standardization and relevance), by creating new columns using one-hot/style encodings, in a similar way to categorical features. This preserves interpretability. If maximizing performance was the goal, more powerful techniques like embeddings could be considered, or used as a reference to guide the creation of alternative, interpretable features that capture similar information.

In [4]:
def build_text_transformers(text_blocks, k_map=None):
    """
    For each specified text column:
      - Fill missing values with empty strings (necessary for text vectorization).
      - Convert to 1D input (as required by TfidfVectorizer).
      - Extract text features using TF-IDF (with bigrams, limited vocab size for memory/performance).
      - Select top-k features via chi-squared test (default k=50 unless overridden).
    Creates pipelines for each text column and returns them as a list for inclusion in a ColumnTransformer.
    Spaces in feature names are replaced with underscores to avoid issues downstream.
    """
    transformers = []
    to_1d = FunctionTransformer(
        lambda x: x[:, 0] if isinstance(x, np.ndarray) else x.iloc[:, 0],
        validate=False,
        feature_names_out="one-to-one",
    )

    for name, (col, max_feat) in text_blocks.items():
        # Use specific k per column if provided, else default to 50
        k = k_map[name] if k_map else 50

        class NoSpaceSelectKBest(SelectKBest):
            def get_feature_names_out(self, input_features=None):
                # Clean up feature names: remove spaces to maintain compatibility
                names_out = super().get_feature_names_out(input_features)
                return np.array([str(f).replace(" ", "_") for f in names_out])

        pipe = Pipeline(
            [
                # Explanation:
                # - Impute: Force NA -> '', so all text inputs are valid for TF-IDF
                ("imputer", SimpleImputer(strategy="constant", fill_value="", missing_values=pd.NA)),
                # - 1D: Some vectorizers expect 1D input
                ("to_1d", to_1d),
                # - Tfidf: Main text feature extraction (limit vocab, use bigrams)
                (
                    "tfidf",
                    TfidfVectorizer(
                        max_features=max_feat,
                        ngram_range=(1, 2),
                        stop_words="english",
                        sublinear_tf=True,
                        norm="l2",
                        lowercase=True,
                    ),
                ),
                # - SelectKBest: Cut it down to k most relevant features
                ("select", NoSpaceSelectKBest(chi2, k=k)),
            ]
        )

        transformers.append((name, pipe, [col]))

    return transformers

In [5]:
text_columns = ["employment_title", "loan_title"]
categorical_columns = ["home_ownership_status", "state", "income_verification_status", "loan_purpose"]

log_scale_columns = [
    "loan_amount_requested",
    "annual_income",
    "annual_income_joint",
    "revolving_balance",
    "total_current_balance",
    "total_high_credit_limit",
    "total_balance_excluding_mortgage",
    "total_bankcard_limit",
    "total_installment_balance",
    "total_installment_credit_limit",
    "bankcard_open_to_buy",
]
log_scale_columns = [c for c in log_scale_columns if c in feature_columns]

numeric_columns = [c for c in feature_columns if c not in text_columns + categorical_columns]
linear_numerical_cols = [c for c in numeric_columns if c not in log_scale_columns]

# Shared text blocks and helper
text_blocks = {
    "employment_title_tfidf": ("employment_title", 4096),
    "loan_title_tfidf": ("loan_title", 2048),
}

# Logistic Regression preprocessing
log_numeric_lr = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("log1p", FunctionTransformer(np.log1p, feature_names_out="one-to-one")),
        ("scaler", MinMaxScaler()),
    ]
)

linear_numeric_lr = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", MinMaxScaler()),
    ]
)

categorical_lr = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="constant", fill_value="nan")),
        ("onehot", OneHotEncoder(handle_unknown="ignore", min_frequency=0.01, sparse_output=False)),
    ]
)

preprocessor_lr = ColumnTransformer(
    transformers=[
        ("log_numeric", log_numeric_lr, log_scale_columns),
        ("linear_numeric", linear_numeric_lr, linear_numerical_cols),
        ("categorical", categorical_lr, categorical_columns),
        *build_text_transformers(text_blocks),
    ],
    remainder="passthrough",
    sparse_threshold=0.0,
    verbose_feature_names_out=False,
)

# XGBoost preprocessing
categorical_xgb = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="constant", fill_value="nan")),
        ("onehot", OneHotEncoder(handle_unknown="ignore", min_frequency=0.01, sparse_output=False)),
    ]
)

preprocessor_xgb = ColumnTransformer(
    transformers=[
        ("categorical", categorical_xgb, categorical_columns),
        *build_text_transformers(text_blocks),
    ],
    remainder="passthrough",
    sparse_threshold=0.0,
    verbose_feature_names_out=False,
)

# Dataset Splitting

For predictive modeling tasks such as this, where the goal is to forecast outcomes for future users, the most relevant way to evaluate model performance is through out-of-time (OOT) validation. Therefore, it is essential to choose data ranges that provide enough samples for robust model training as well as reliable, meaningful evaluation.

In [6]:
data["loan_issue_date"].value_counts().sort_index().tail(7)

loan_issue_date
2016-11-01    26179
2016-12-01    27911
2017-01-01    23965
2017-02-01    20314
2017-03-01    27588
2017-04-01    21416
2017-05-01    24518
Name: count, dtype: int64

We have a sufficiently large dataset to set aside a couple of recent months for testing, ensuring an effective out-of-time evaluation. Our training set remains very robust even after reserving these periods for validation and testing. 

Instead of using the classic 75/25 split or similar, our focus here is just to allocate enough data to reliably assess model performance.

In [7]:
test_start = pd.to_datetime("2017-04-01")
validation_start = pd.to_datetime("2017-02-01")

data["dataset"] = None

# Test set: after April 2017 (includes May 2017 and onwards)
data.loc[data["loan_issue_date"] >= test_start, "dataset"] = "test"

# Validation set: between February 2017 and April 2017 (inclusive)
mask_validation = (data["loan_issue_date"] >= validation_start) & (data["loan_issue_date"] < test_start)
data.loc[mask_validation, "dataset"] = "validation"

# Train + calibration set: before February 2017
mask_train_calib = data["loan_issue_date"] < validation_start
data_train_calibration = data.loc[mask_train_calib].copy()

# Split 5% for calibration and 95% for training
train_indices, calibration_indices = train_test_split(
    data_train_calibration.index,
    test_size=0.05,
    random_state=34,
    shuffle=True,
    stratify=data_train_calibration["default_binary"],
)

data.loc[train_indices, "dataset"] = "train"
data.loc[calibration_indices, "dataset"] = "calibration"

# Check the distribution of the splits
split_counts = data["dataset"].value_counts().rename_axis("split").reset_index(name="count")
split_ranges = []
for split in ["train", "validation", "calibration", "test"]:
    df_split = data[data["dataset"] == split]
    if not df_split.empty:
        min_date = df_split["loan_issue_date"].min()
        max_date = df_split["loan_issue_date"].max()
        count = len(df_split)
        split_ranges.append(
            {"split": split, "count": count, "min_loan_issue_date": min_date, "max_loan_issue_date": max_date}
        )

data_split_info = pd.DataFrame(split_ranges)
display(data_split_info)

Unnamed: 0,split,count,min_loan_issue_date,max_loan_issue_date
0,train,917766,2007-06-01,2017-01-01
1,validation,47902,2017-02-01,2017-03-01
2,calibration,48304,2007-07-01,2017-01-01
3,test,45934,2017-04-01,2017-05-01


Splitting the data into three sets — train, validation, and test — is a key practice to ensure we get a more reliable estimate of model performance and avoid unpleasant surprises when the model goes into production. Using only train and test sets often leads to indirect model tuning on the test data, which can result in overly optimistic performance estimates and models that do not generalize well to new or unseen scenarios.

By having a dedicated validation set, we can make unbiased choices on model selection and hyperparameter tuning, keeping the test set as a final, untouched benchmark of how the model is expected to perform in real-world applications. 

The calibration set, in turn, can be discussed and used separately if we wish to further adjust probability predictions after model development.

In [8]:
data["loan_id"].shape, data["loan_id"].nunique()

((1059906,), 1059906)

In [9]:
X = data[["loan_id"] + feature_columns]
y = data["default_binary"].astype(int)

X_train = X[data["dataset"] == "train"]
y_train = y[data["dataset"] == "train"]

preprocessor_lr.fit(X_train, y_train)
X_train_lr = preprocessor_lr.transform(X_train)
X_lr = preprocessor_lr.transform(X)

preprocessor_xgb.fit(X_train, y_train)
X_train_xgb = preprocessor_xgb.transform(X_train)
X_xgb = preprocessor_xgb.transform(X)

feature_names_lr = preprocessor_lr.get_feature_names_out()
feature_names_xgb = preprocessor_xgb.get_feature_names_out()


def summarize_matrix(matrix, label):
    return {"label": label, "shape": matrix.shape}


summaries = [
    summarize_matrix(X_train_lr, "log_reg_train"),
    summarize_matrix(X_lr, "lr"),
    summarize_matrix(X_train_xgb, "xgb_train"),
    summarize_matrix(X_xgb, "xgb"),
]

pd.DataFrame(summaries)

Unnamed: 0,label,shape
0,log_reg_train,"(917766, 198)"
1,lr,"(1059906, 198)"
2,xgb_train,"(917766, 198)"
3,xgb,"(1059906, 198)"


In [10]:
assert set(feature_names_xgb.tolist()) == set(feature_names_lr.tolist())

processed_model_features = [f for f in feature_names_xgb if f != "loan_id"]

with open("../data/processed_model_features.json", "w") as f:
    json.dump(list(feature_names_xgb), f, indent=2)

In [11]:
xgb_features_data = pd.DataFrame(X_xgb, columns=feature_names_xgb)

print(data.shape, xgb_features_data.shape, len(base_features))
xgb_data = pd.merge(
    data.drop(
        columns=[col for col in xgb_features_data.columns if ((col != "loan_id") & (col in data.columns))]
    ),
    xgb_features_data,
    on="loan_id",
    how="left",
)

print(xgb_data.shape)
xgb_data.head()

(1059906, 147) (1059906, 198) 57
(1059906, 292)


Unnamed: 0,loan_id,loan_amount_funded,loan_amount_funded_investors,interest_rate,monthly_payment,loan_grade,loan_subgrade,employment_title,home_ownership_status,income_verification_status,loan_issue_date,loan_status,payment_plan_flag,loan_listing_url,loan_purpose,loan_title,zip_code_first3,state,earliest_credit_line_date,months_since_last_delinquency,months_since_last_public_record,initial_listing_status,outstanding_principal,outstanding_principal_investors,total_payments_received,total_payments_received_investors,total_principal_received,total_interest_received,total_late_fees_received,recoveries_post_chargeoff,collection_recovery_fee,last_payment_date,last_payment_amount,next_payment_date,last_credit_pull_date,fico_score_high_last,fico_score_low_last,collections_12months_excluding_medical,months_since_major_derogatory,policy_code,application_type,annual_income_joint,debt_to_income_ratio_joint,verification_status_joint,total_collection_amount,open_trades_last_6months,active_installment_trades,installment_accounts_opened_12m,installment_accounts_opened_24m,months_since_recent_installment,total_installment_balance,installment_utilization,revolving_trades_opened_12m,revolving_trades_opened_24m,max_balance_bankcard,all_util,total_rev_hi_lim,finance_inquiries,finance_trades_count,inquiries_last_12months,chargeoffs_within_12months,months_since_recent_bankcard_delinquency,months_since_recent_revolving_delinquency,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,secondary_app_earliest_credit_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,secondary_app_active_installment_trades,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,hardship_plan_flag,hardship_plan_type,hardship_plan_reason,hardship_plan_status,hardship_deferral_months,hardship_plan_monthly_payment,hardship_plan_start_date,hardship_plan_end_date,hardship_payment_plan_start_date,hardship_plan_length_months,hardship_plan_days_past_due,hardship_plan_loan_status,hardship_projected_interest,hardship_payoff_balance,hardship_last_payment_amount,debt_settlement_flag,loan_maturity_date,loan_censored,default_binary,dataset,home_ownership_status_MORTGAGE,home_ownership_status_OWN,home_ownership_status_RENT,home_ownership_status_infrequent_sklearn,state_AL,state_AZ,state_CA,state_CO,state_CT,state_FL,state_GA,state_IL,state_IN,state_LA,state_MA,state_MD,state_MI,state_MN,state_MO,state_NC,state_NJ,state_NV,state_NY,state_OH,state_OR,state_PA,state_SC,state_TN,state_TX,state_VA,state_WA,state_WI,state_infrequent_sklearn,income_verification_status_Not Verified,income_verification_status_Source Verified,income_verification_status_Verified,loan_purpose_car,loan_purpose_credit_card,loan_purpose_debt_consolidation,loan_purpose_home_improvement,loan_purpose_major_purchase,loan_purpose_medical,loan_purpose_other,loan_purpose_small_business,loan_purpose_infrequent_sklearn,accountant,administrator,analyst,architect,attorney,bartender,bus,business_analyst,cashier,chef,clerk,cna,controller,cook,dealer,developer,director,driver,engineer,firefighter,operator,owner,painter,pastor,physician,pilot,police,president,product,professor,program_manager,project,project_manager,receptionist,sales,senior,server,software,software_engineer,sr,stocker,systems,teacher,truck,truck_driver,university,vice,vice_president,vp,welder,bills,business,business_loan,buying,car,car_financing,card,card_consolidation,card_debt,card_payoff,card_refi,card_refinance,card_refinancing,cc,cc_consolidation,citi,consolidate,consolidation,consolidation_loan,credit,credit_card,debt,debt_consolidation,debt_free,engagement,engagement_ring,expenses,financing,home,home_buying,home_improvement,improvement,loan,medical,medical_expenses,motorcycle,moving,moving_relocation,pay_bills,payoff,pool,pool_loan,refi,refinance,refinance_loan,refinancing,relocation,restaurant,ring,small_business,employment_length_years,annual_income,loan_amount_requested,loan_term_months,fico_score_low,fico_score_high,debt_to_income_ratio,delinquencies_past_2years,accounts_currently_delinquent,delinquent_amount,accounts_30days_past_due,accounts_120days_past_due,accounts_90plus_days_past_due_24m,accounts_ever_120days_past_due,public_records_count,public_records_bankruptcies,tax_liens_count,open_credit_lines,total_credit_lines,revolving_balance,revolving_utilization_rate,revolving_accounts_count,revolving_trades_with_balance,open_revolving_trades,active_revolving_trades,months_since_oldest_revolving,months_since_recent_revolving,installment_accounts_count,months_since_oldest_installment,total_installment_credit_limit,bankcard_accounts_count,active_bankcard_accounts,satisfactory_bankcard_accounts,bankcard_utilization,bankcard_open_to_buy,months_since_recent_bankcard,percent_bankcard_over_75pct_limit,inquiries_last_6months,months_since_recent_inquiry,accounts_opened_past_12months,accounts_opened_past_24months,total_current_balance,total_high_credit_limit,total_balance_excluding_mortgage,total_bankcard_limit,average_current_balance,satisfactory_accounts_count,months_since_recent_account,mortgage_accounts_count,percent_trades_never_delinquent,earliest_credit_line_year,credit_history_age_years
0,1077501,5000.0,4975.0,0.1065,162.87,B,B2,,RENT,Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1077501,credit_card,Computer,860xx,AZ,1985-01-01,,,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,2015-01-01,171.62,NaT,2020-05-01,704.0,700.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,24000.0,5000.0,36.0,735.0,739.0,27.65,0.0,0.0,0.0,,,,,0.0,0.0,0.0,3.0,9.0,13648.0,0.837,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,1985.0,26.913073
1,1077175,2400.0,2400.0,0.1596,84.33,C,C5,,RENT,Not Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1077175,small_business,real estate business,606xx,IL,2001-11-01,,,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,2014-06-01,649.91,NaT,2017-06-01,739.0,735.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.279469,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,12252.0,2400.0,36.0,735.0,739.0,8.72,0.0,0.0,0.0,,,,,0.0,0.0,0.0,2.0,10.0,2956.0,0.985,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,2001.0,10.080767
2,1076863,10000.0,10000.0,0.1349,339.31,C,C1,AIR RESOURCES BOARD,RENT,Source Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1076863,other,personel,917xx,CA,1996-02-01,35.0,,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,2015-01-01,357.48,NaT,2016-04-01,604.0,600.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,49200.0,10000.0,36.0,690.0,694.0,20.0,0.0,0.0,0.0,,,,,0.0,0.0,0.0,10.0,37.0,5598.0,0.21,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,1996.0,15.830253
3,1075269,5000.0,5000.0,0.079,156.46,A,A4,Veolia Transportaton,RENT,Source Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1075269,wedding,My wedding loan I promise to pay back,852xx,AZ,2004-11-01,,,f,0.0,0.0,5632.21,5632.21,5000.0,632.21,0.0,0.0,0.0,2015-01-01,161.03,NaT,2017-02-01,564.0,560.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.337778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,36000.0,5000.0,36.0,730.0,734.0,11.2,0.0,0.0,0.0,,,,,0.0,0.0,0.0,9.0,12.0,7963.0,0.283,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,,,,2004.0,7.080082
4,1072053,3000.0,3000.0,0.1864,109.43,E,E1,MKC Accounting,RENT,Source Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1072053,car,Car Downpayment,900xx,CA,2007-01-01,,,f,0.0,0.0,3939.135294,3939.14,3000.0,939.14,0.0,0.0,0.0,2015-01-01,111.34,NaT,2014-12-01,689.0,685.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.440429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,48000.0,3000.0,36.0,660.0,664.0,5.35,0.0,0.0,0.0,,,,,0.0,0.0,0.0,4.0,4.0,8221.0,0.875,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,2007.0,4.914442


In [12]:
lr_features_data = pd.DataFrame(X_lr, columns=feature_names_lr)

print(data.shape, lr_features_data.shape, len(base_features))
lr_data = pd.merge(
    data.drop(
        columns=[col for col in lr_features_data.columns if ((col != "loan_id") & (col in data.columns))]
    ),
    lr_features_data,
    on="loan_id",
    how="left",
)

print(lr_data.shape)
lr_data.head()

(1059906, 147) (1059906, 198) 57
(1059906, 292)


Unnamed: 0,loan_id,loan_amount_funded,loan_amount_funded_investors,interest_rate,monthly_payment,loan_grade,loan_subgrade,employment_title,home_ownership_status,income_verification_status,loan_issue_date,loan_status,payment_plan_flag,loan_listing_url,loan_purpose,loan_title,zip_code_first3,state,earliest_credit_line_date,months_since_last_delinquency,months_since_last_public_record,initial_listing_status,outstanding_principal,outstanding_principal_investors,total_payments_received,total_payments_received_investors,total_principal_received,total_interest_received,total_late_fees_received,recoveries_post_chargeoff,collection_recovery_fee,last_payment_date,last_payment_amount,next_payment_date,last_credit_pull_date,fico_score_high_last,fico_score_low_last,collections_12months_excluding_medical,months_since_major_derogatory,policy_code,application_type,annual_income_joint,debt_to_income_ratio_joint,verification_status_joint,total_collection_amount,open_trades_last_6months,active_installment_trades,installment_accounts_opened_12m,installment_accounts_opened_24m,months_since_recent_installment,total_installment_balance,installment_utilization,revolving_trades_opened_12m,revolving_trades_opened_24m,max_balance_bankcard,all_util,total_rev_hi_lim,finance_inquiries,finance_trades_count,inquiries_last_12months,chargeoffs_within_12months,months_since_recent_bankcard_delinquency,months_since_recent_revolving_delinquency,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,secondary_app_earliest_credit_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,secondary_app_active_installment_trades,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,hardship_plan_flag,hardship_plan_type,hardship_plan_reason,hardship_plan_status,hardship_deferral_months,hardship_plan_monthly_payment,hardship_plan_start_date,hardship_plan_end_date,hardship_payment_plan_start_date,hardship_plan_length_months,hardship_plan_days_past_due,hardship_plan_loan_status,hardship_projected_interest,hardship_payoff_balance,hardship_last_payment_amount,debt_settlement_flag,loan_maturity_date,loan_censored,default_binary,dataset,loan_amount_requested,annual_income,revolving_balance,total_current_balance,total_high_credit_limit,total_balance_excluding_mortgage,total_bankcard_limit,total_installment_credit_limit,bankcard_open_to_buy,employment_length_years,loan_term_months,fico_score_low,fico_score_high,debt_to_income_ratio,delinquencies_past_2years,accounts_currently_delinquent,delinquent_amount,accounts_30days_past_due,accounts_120days_past_due,accounts_90plus_days_past_due_24m,accounts_ever_120days_past_due,public_records_count,public_records_bankruptcies,tax_liens_count,open_credit_lines,total_credit_lines,revolving_utilization_rate,revolving_accounts_count,revolving_trades_with_balance,open_revolving_trades,active_revolving_trades,months_since_oldest_revolving,months_since_recent_revolving,installment_accounts_count,months_since_oldest_installment,bankcard_accounts_count,active_bankcard_accounts,satisfactory_bankcard_accounts,bankcard_utilization,months_since_recent_bankcard,percent_bankcard_over_75pct_limit,inquiries_last_6months,months_since_recent_inquiry,accounts_opened_past_12months,accounts_opened_past_24months,average_current_balance,satisfactory_accounts_count,months_since_recent_account,mortgage_accounts_count,percent_trades_never_delinquent,earliest_credit_line_year,credit_history_age_years,home_ownership_status_MORTGAGE,home_ownership_status_OWN,home_ownership_status_RENT,home_ownership_status_infrequent_sklearn,state_AL,state_AZ,state_CA,state_CO,state_CT,state_FL,state_GA,state_IL,state_IN,state_LA,state_MA,state_MD,state_MI,state_MN,state_MO,state_NC,state_NJ,state_NV,state_NY,state_OH,state_OR,state_PA,state_SC,state_TN,state_TX,state_VA,state_WA,state_WI,state_infrequent_sklearn,income_verification_status_Not Verified,income_verification_status_Source Verified,income_verification_status_Verified,loan_purpose_car,loan_purpose_credit_card,loan_purpose_debt_consolidation,loan_purpose_home_improvement,loan_purpose_major_purchase,loan_purpose_medical,loan_purpose_other,loan_purpose_small_business,loan_purpose_infrequent_sklearn,accountant,administrator,analyst,architect,attorney,bartender,bus,business_analyst,cashier,chef,clerk,cna,controller,cook,dealer,developer,director,driver,engineer,firefighter,operator,owner,painter,pastor,physician,pilot,police,president,product,professor,program_manager,project,project_manager,receptionist,sales,senior,server,software,software_engineer,sr,stocker,systems,teacher,truck,truck_driver,university,vice,vice_president,vp,welder,bills,business,business_loan,buying,car,car_financing,card,card_consolidation,card_debt,card_payoff,card_refi,card_refinance,card_refinancing,cc,cc_consolidation,citi,consolidate,consolidation,consolidation_loan,credit,credit_card,debt,debt_consolidation,debt_free,engagement,engagement_ring,expenses,financing,home,home_buying,home_improvement,improvement,loan,medical,medical_expenses,motorcycle,moving,moving_relocation,pay_bills,payoff,pool,pool_loan,refi,refinance,refinance_loan,refinancing,relocation,restaurant,ring,small_business
0,1077501,5000.0,4975.0,0.1065,162.87,B,B2,,RENT,Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1077501,credit_card,Computer,860xx,AZ,1985-01-01,,,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,2015-01-01,171.62,NaT,2020-05-01,704.0,700.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.525287,0.627446,0.639799,0.698181,0.711753,0.702243,0.686046,0.70696,0.637477,1.0,0.0,0.5,0.497738,0.027678,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034884,0.04023,0.093803,0.101562,0.084746,0.088608,0.084746,0.18,0.021505,0.037736,0.176796,0.1,0.06383,0.065574,0.188751,0.020344,0.5,0.125,0.2,0.0625,0.0625,0.006647,0.117647,0.020761,0.016393,0.974,0.65,0.297991,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1077175,2400.0,2400.0,0.1596,84.33,C,C5,,RENT,Not Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1077175,small_business,real estate business,606xx,IL,2001-11-01,,,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,2014-06-01,649.91,NaT,2017-06-01,739.0,735.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.357766,0.58562,0.537024,0.698181,0.711753,0.702243,0.686046,0.70696,0.637477,1.0,0.0,0.5,0.497738,0.008729,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.045977,0.110389,0.101562,0.084746,0.088608,0.084746,0.18,0.021505,0.037736,0.176796,0.1,0.06383,0.065574,0.188751,0.020344,0.5,0.25,0.2,0.0625,0.0625,0.006647,0.117647,0.020761,0.016393,0.974,0.85,0.088254,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.279469,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1076863,10000.0,10000.0,0.1349,339.31,C,C1,AIR RESOURCES BOARD,RENT,Source Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1076863,other,personel,917xx,CA,1996-02-01,35.0,,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,2015-01-01,357.48,NaT,2016-04-01,604.0,600.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.683515,0.672101,0.579923,0.698181,0.711753,0.702243,0.686046,0.70696,0.637477,1.0,0.0,0.295455,0.294118,0.02002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116279,0.201149,0.023535,0.101562,0.084746,0.088608,0.084746,0.18,0.021505,0.037736,0.176796,0.1,0.06383,0.065574,0.188751,0.020344,0.5,0.125,0.2,0.0625,0.0625,0.006647,0.117647,0.020761,0.016393,0.974,0.7875,0.159895,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1075269,5000.0,5000.0,0.079,156.46,A,A4,Veolia Transportaton,RENT,Source Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1075269,wedding,My wedding loan I promise to pay back,852xx,AZ,2004-11-01,,,f,0.0,0.0,5632.21,5632.21,5000.0,632.21,0.0,0.0,0.0,2015-01-01,161.03,NaT,2017-02-01,564.0,560.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.525287,0.652669,0.603599,0.698181,0.711753,0.702243,0.686046,0.70696,0.637477,0.263158,0.0,0.477273,0.475113,0.011211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104651,0.057471,0.031716,0.101562,0.084746,0.088608,0.084746,0.18,0.021505,0.037736,0.176796,0.1,0.06383,0.065574,0.188751,0.020344,0.5,0.375,0.2,0.0625,0.0625,0.006647,0.117647,0.020761,0.016393,0.974,0.8875,0.050865,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.337778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1072053,3000.0,3000.0,0.1864,109.43,E,E1,MKC Accounting,RENT,Source Verified,2011-12-01,Fully Paid,n,https://lendingclub.com/browse/loanDetail.action?loan_id=1072053,car,Car Downpayment,900xx,CA,2007-01-01,,,f,0.0,0.0,3939.135294,3939.14,3000.0,939.14,0.0,0.0,0.0,2015-01-01,111.34,NaT,2014-12-01,689.0,685.0,0.0,,1.0,Individual,,,,,,,,,,,,,,,,,,,,0.0,,,,,,NaT,,,,,,,,,N,,,,,,NaT,NaT,NaT,,,,,,,N,2014-11-30 18:00:00,False,0,train,0.408692,0.670565,0.605741,0.698181,0.711753,0.702243,0.686046,0.70696,0.637477,0.894737,0.0,0.159091,0.158371,0.005355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046512,0.011494,0.098061,0.101562,0.084746,0.088608,0.084746,0.18,0.021505,0.037736,0.176796,0.1,0.06383,0.065574,0.188751,0.020344,0.5,0.25,0.2,0.0625,0.0625,0.006647,0.117647,0.020761,0.016393,0.974,0.925,0.02388,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.440429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Persist preprocessed datasets and metadata

With the datasets already processed and at least somewhat optimized for each method (Logistic Regression and XGBoost), we now have two main next steps: (a) move forward with feature engineering — combining existing features to create new variables with higher predictive potential (for example, through discretization or feature crossing, which may both help the model identify what is relevant and also provide features that add more context about a specific data point); or (b) proceed directly to building an MVP (minimum viable product) with the current features. To avoid spending too much time at this stage and have an MVP ready from the beginning, I will follow the feature engineering path first.

This is often a smart strategy in real-world settings, especially when choosing between getting a first version into production soon versus prolonging development for extra feature work. Having a simple MVP model in production can already deliver significant value to the business and provide valuable early feedback from real user data. Meanwhile, incremental improvements — such as more advanced feature engineering and experimentation — can be developed in parallel, so that a next, more accurate model can be released in a future update, building on the live MVP.

In [13]:
xgb_data.to_parquet("../data/xgb_processed_data.parquet")
lr_data.to_parquet("../data/lr_processed_data.parquet")

Now we can proceed to the next steps: feature selection and modelling.