# About

This is an experimental notebook to model a Churn Classifier. The functions are in progress, so they are a bit messy at the start. Gradually as I modularize the code more, they become a bit more organized.

I underwent the following steps:
- `Preparation`: Transform seed data format to be suitable for feature engineering.
- Feature Transformation
    - `Test Feature Transformation Pipeline 1`
        - `Feature Engineering`: Generate features using the following information:
            - Window Features
            - RFM Features
            - Activity Trend Features
                - Slope
                - Statistics
            - I gathered 106 features in total.
        - `Data Split`: Split data into train, test, val sets with dependent and independent variables.
        - `Feature Processing`: Transform features (impute, scale) then select features with good Mutual Information contribution.
        - `Write Transformation Models`: Temporarily dump the transformed dataframes and transformers so I can continue experimenting without having to rerun the notebook.
    - `Complete Feature Transformation Pipeline 1`: Wrap the entire Pipeline 1 into respective functions. Then write the dataframes and transformers again.
- `Model Test`: Try some Tree classifiers on one target. The results were horrendous (around 0.5 PR-AUC), so I wondered if raw features were better.
- `Test Feature Transformation Pipeline 2`: Instead of the transformed features, use the raw features to see if AUC is any better. Results: Similar results as transformed version.
-  `Log Results`:
    - Data Pipeline: Wrap the data pipelines above into callable functions that generates and writes down:
        - Raw features dataframes
        - Transformed features dataframes
        - Transformers (imputer, scaler, selected features)
    - Train & Log Models: Use mlflow to log models and metrics with the best performance for each target using different types of data.
        - Conclusion: AUC gets worse when the prediction window is larger. The model can not predict for 90 days at all. The transformed features are mostly better than the raw features and results in PR-AUC that is slightly better than random guessing (0.55)
- `Call Models`: Get the current production model for respective target for inference.

# Preparation

## Libraries

In [2]:
import pandas as pd

In [3]:
import numpy as np

In [4]:
from dotenv import load_dotenv
import os

In [5]:
import maika_eda_pandas as mk

In [6]:
from scipy import stats

In [7]:
from src.core.transforms import (
    transform_transactions_df,
    transform_customers_df,
    get_customers_screenshot_summary_from_transactions_df,
    add_churn_status,
)

In [8]:
import plotly.express as px
import plotly.graph_objects as go

In [9]:
# Features Processing

from sklearn.model_selection import train_test_split

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import mutual_info_classif

In [10]:
import joblib
import json
from pathlib import Path

In [11]:
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

In [12]:
from imblearn.over_sampling import SMOTE

In [13]:
import matplotlib.pyplot as plt

In [14]:
from sklearn.metrics import (
    roc_auc_score,
    accuracy_score,
    average_precision_score,
    precision_score,
    recall_score,
    confusion_matrix,
    classification_report
)

In [15]:
import mlflow
from mlflow.models import infer_signature

In [16]:
import tempfile

In [17]:
from sklearn.model_selection import GridSearchCV

In [18]:
from sklearn.metrics import make_scorer

## Environment

In [19]:
load_dotenv()

True

In [20]:
PROJECT_ROOT = Path.cwd().parent

In [21]:
MAX_DATA_DATE = pd.Timestamp('2025-12-31')
MAX_DATA_DATE_STR = MAX_DATA_DATE.strftime("%d_%m_%Y")
TRAIN_SNAPSHOT_DATE = MAX_DATA_DATE - pd.Timedelta(90, 'day')

In [22]:
BASE_GOLD_DIR = PROJECT_ROOT / "data" / "gold" / MAX_DATA_DATE_STR

In [23]:
# This is no longer used. Updated to mlflow instead.
#ARTIFACT_DIR = PROJECT_ROOT / MAX_DATA_DATE_STR / "/src/models/preprocessing"

In [24]:
MLRUNS_DIR = PROJECT_ROOT / "mlruns"

In [25]:
SEED_CUSTOMERS=os.getenv("SEED_CUSTOMERS")
SEED_TRANSACTIONS=os.getenv("SEED_TRANSACTIONS")

In [26]:
targets = [
    "is_churn_30_days",
    "is_churn_60_days",
    "is_churn_90_days",
]

In [27]:
EXPERIMENT_NAME = "churn-lightgbm"

In [28]:
ARTIFACT_DIR = Path(tempfile.mkdtemp())

In [29]:
PREPROCESSING_REF_DIR = (
    BASE_GOLD_DIR / "reference" / "preprocessing"
)
PREPROCESSING_REF_DIR.mkdir(parents=True, exist_ok=True)

In [30]:
mlflow.set_tracking_uri(f"file://{MLRUNS_DIR}")

In [31]:
print("Tracking URI:", mlflow.get_tracking_uri())

Tracking URI: file:///home/hong-mai/Desktop/HONGMAI/Coding/ai-customer-growth-retention/mlruns


In [32]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

## Custom Wrappers

### Feature Engineering

In [None]:
def get_rfm_window_features(customers_df, transactions_df, observed_date):

    rfm_time_windows = ["all_time", "30d", "60d", "90d"]

    for rfm_time_window in rfm_time_windows:

        if rfm_time_window == "all_time":
            filtered_transactions_df = transactions_df
        else:
            # Limit data to the new cutoff
            days = int(rfm_time_window.strip("d"))
            filtered_transactions_df = transactions_df[
                (transactions_df['transaction_date'] <= observed_date - pd.Timedelta(days=days))
            ]

        # Get a Customers Screenshot Summary DataFrame. It has RFM features and other variables that RFM features depend on.
        summary_modeling_df = get_customers_screenshot_summary_from_transactions_df(
            transactions_df=filtered_transactions_df,
            observed_date=observed_date,
            column_names=["customer_id", "transaction_date", "amount"]
        )

        # Keep only customer_id and the RFM columns we care about
        summary_modeling_df = summary_modeling_df[[
            'customer_id',
            'days_until_observed',
            'period_transaction_count',
            'period_total_amount',
            'period_tenure_days'
        ]]

        # Rename columns in the summary DF, not the main DF
        summary_modeling_df = summary_modeling_df.rename(columns={
            'days_until_observed': f'rfm_recency_{rfm_time_window}',
            'period_transaction_count': f'rfm_frequency_{rfm_time_window}',
            'period_total_amount': f'rfm_monetary_{rfm_time_window}',
            'period_tenure_days': f'tenure_{rfm_time_window}'
        })
        
        # Merge with current data used for modelling.
        customers_df = pd.merge(
            customers_df,
            summary_modeling_df,
            on="customer_id",
            how="left"
        )

    return customers_df

In [None]:
def get_slope_features(customers_df, transactions_df, observed_date, feature_list):

    time_windows = ["all_time", "30d", "60d", "90d"]

    for time_window in time_windows:

        if time_window == "all_time":
            filtered_transactions_df = transactions_df
        else:
            # Limit data to the new cutoff
            days = int(time_window.strip("d"))
            filtered_transactions_df = transactions_df[
                (transactions_df['transaction_date'] <= observed_date - pd.Timedelta(days=days))
            ]

    customers_list = filtered_transactions_df['customer_id'].unique()

    slopes = {}

    for customer_id in customers_list:

        customer_transactions = filtered_transactions_df[filtered_transactions_df['customer_id'] == customer_id]

        x = np.arange(len(customer_transactions)) #time axis
        slopes[customer_id] = {} #initiate value list

        for feature_name in feature_list:
            y = customer_transactions[feature_name].values
            x_valid = x[~np.isnan(y)]
            y_valid = y[~np.isnan(y)]

            if len(y_valid) < 2:
                slopes[customer_id][feature_name] = np.nan
            else:
                slope = np.polyfit(x_valid, y_valid, 1)[0]
                slopes[customer_id][feature_name] = slope

    # Convert dict of dicts into dataframe
    slope_features_df = pd.DataFrame.from_dict(slopes, orient='index')

    # Rename columns to have slope_ prefix
    slope_features_df = slope_features_df.rename(columns={f: f'slope_{f}' for f in slope_features_df.columns})

    # Reset index to have customer_id as a column
    slope_features_df = slope_features_df.reset_index().rename(columns={'index': 'customer_id'})

    # Merge with current data used for modelling.
    customers_df = pd.merge(
        customers_df,
        slope_features_df,
        on="customer_id",
        how="left"
    )

    return customers_df

In [None]:
def get_transaction_statistics_features(customers_df, transactions_df, observed_date, feature_list):

    time_windows = ["all_time", "30d", "60d", "90d"]

    all_stats_df_list = []

    for time_window in time_windows:

        if time_window == "all_time":
            filtered_transactions_df = transactions_df
        else:
            # Limit data to the new cutoff
            days = int(time_window.strip("d"))
            filtered_transactions_df = transactions_df[
                (transactions_df['transaction_date'] <= observed_date - pd.Timedelta(days=days))
            ]

        customers_list = filtered_transactions_df['customer_id'].unique()
        stats_dict = {}

        for customer_id in customers_list:

            customer_transactions = filtered_transactions_df[
                filtered_transactions_df['customer_id'] == customer_id
            ]

            stats_dict[customer_id] = {}

            for feature_name in feature_list:

                y = customer_transactions[feature_name].dropna().values

                if len(y) < 2:
                    # Less than 2 observations -> return NaN for all stats
                    stats_dict[customer_id][f"min_{feature_name}"] = np.nan
                    stats_dict[customer_id][f"mean_{feature_name}"] = np.nan
                    stats_dict[customer_id][f"mode_{feature_name}"] = np.nan
                    stats_dict[customer_id][f"max_{feature_name}"] = np.nan
                    for q in [1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 99]:
                        stats_dict[customer_id][f"q{q}_{feature_name}"] = np.nan
                    continue

                # Compute stats
                stats_dict[customer_id][f"min_{feature_name}"] = np.min(y)
                stats_dict[customer_id][f"mean_{feature_name}"] = np.mean(y)

                # Compute mode safely
                mode_result = stats.mode(y, nan_policy='omit')
                if hasattr(mode_result.mode, "__len__"):
                    # old SciPy: mode is array
                    mode_val = mode_result.mode[0] if len(mode_result.mode) > 0 else np.nan
                else:
                    # new SciPy: mode is scalar
                    mode_val = mode_result.mode if mode_result.count > 0 else np.nan

                stats_dict[customer_id][f"mode_{feature_name}"] = mode_val

                stats_dict[customer_id][f"max_{feature_name}"] = np.max(y)

                # Quantiles
                for q in [1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 99]:
                    stats_dict[customer_id][f"q{q}_{feature_name}"] = np.percentile(y, q)

        # Convert to dataframe
        stats_df = pd.DataFrame.from_dict(stats_dict, orient='index').reset_index().rename(columns={'index': 'customer_id'})
        all_stats_df_list.append(stats_df)

    # Merge with customers_df (only keep last time_window stats)
    final_stats_df = all_stats_df_list[-1]  # or merge all windows if needed
    customers_df = pd.merge(customers_df, final_stats_df, on='customer_id', how='left')

    return customers_df

In [4]:
def build_training_base(
    seed_customers_path,
    seed_transactions_path,
    train_snapshot_date,
    churn_windows=[30, 60, 90],
):
    """
    Reads raw data, transforms it, limits it to modeling window,
    builds customer modeling table, and adds churn labels.
    """

    # --- Read data ---
    customers_df = pd.read_csv(seed_customers_path)
    transactions_df = pd.read_csv(seed_transactions_path)

    # --- Transform data ---
    transactions_df = transform_transactions_df(transactions_df)
    customers_df = transform_customers_df(customers_df)

    # --- Derive MAX_DATA_DATE internally ---
    max_data_date = transactions_df["transaction_date"].max()

    # --- Limit transactions to snapshot ---
    transactions_modeling_df = transactions_df.loc[
        transactions_df["transaction_date"] <= train_snapshot_date
    ]

    # --- Build customer modeling base ---
    customers_modeling_df = (
        pd.DataFrame({
            "customer_id": transactions_modeling_df["customer_id"].unique()
        })
        .merge(customers_df, on="customer_id", how="inner")
        .drop(columns=["signup_date", "true_lifetime_days", "termination_date"])
    )

    # --- Add churn labels ---
    for nday in churn_windows:
        var_name = f"is_churn_{nday}_days"
        observed_date = max_data_date - pd.Timedelta(days=nday)

        customers_modeling_df = add_churn_status(
            transformed_customers_df=customers_df,
            observed_date=observed_date,
            desired_df=None,
        )
        
        customers_modeling_df = customers_modeling_df.rename(columns={'is_churn': var_name})

    return transactions_modeling_df, customers_modeling_df

In [None]:
def add_transaction_time_features(transactions_df):
    """
    Add time-based and order-based transaction features.

    Parameters
    ----------
    transactions_df : pd.DataFrame
        Must contain: customer_id, transaction_date

    Returns
    -------
    pd.DataFrame
        Copy of transactions_df with added features
    """

    df = transactions_df.sort_values(
        ["customer_id", "transaction_date"]
    ).copy()

    df["customer_transaction_order"] = (
        df.groupby("customer_id").cumcount()
    )

    df["prev_transaction_date"] = (
        df.groupby("customer_id")["transaction_date"].shift(1)
    )

    df["next_transaction_date"] = (
        df.groupby("customer_id")["transaction_date"].shift(-1)
    )

    df["days_since_previous_transaction"] = (
        df["transaction_date"] - df["prev_transaction_date"]
    ).dt.days

    df["days_until_next_transaction"] = (
        df["next_transaction_date"] - df["transaction_date"]
    ).dt.days

    df["first_transaction_date"] = (
        df.groupby("customer_id")["transaction_date"]
        .transform("min")
    )

    df["days_since_first_transaction"] = (
        df["transaction_date"] - df["first_transaction_date"]
    ).dt.days

    return df

### Helpers

In [85]:
def check_nan_in_df_cols(df):
    # Get relative percentage of nulls by column
    null_features_proportion = (
        df.isna().sum() / len(df)
    ).sort_values(ascending=False)

    high_proportion = []
    medium_proportion = []
    low_proportion = []

    for feature, proportion in null_features_proportion.items():
        if proportion >= 0.20:
            high_proportion.append(feature)
        elif 0.05 <= proportion < 0.20:
            medium_proportion.append(feature)
        else:
            low_proportion.append(feature)

    # Build features DataFrame
    features_df = null_features_proportion.reset_index()
    features_df.columns = ["feature", "nan_proportion"]

    features_df["NaN group"] = features_df["feature"].apply(
        lambda f: (
            "High" if f in high_proportion
            else "Medium" if f in medium_proportion
            else "Low"
        )
    )

    # Print counts (same behavior as before)
    print("Total features:", len(df.columns))
    print("Information on NaN values")
    print("====================================")
    print("Number of High Proportion Features:", len(high_proportion))
    print("Number of Medium Proportion Features:", len(medium_proportion))
    print("Number of Low Proportion Features:", len(low_proportion))

    return features_df


In [87]:
def save_X_csv(X_by_target, BASE_GOLD_DIR):

    for target in X_by_target.keys():

        target_dir = BASE_GOLD_DIR / target
        target_dir.mkdir(parents=True, exist_ok=True)

        X_by_target[target].to_csv(
            target_dir / "X_train.csv",
            index=True,
        )

        print(f"[{target}] written to {target_dir}")
    
    return "All data saved successfully."

In [88]:
def save_y_csv(
        X_by_target,
        y,
        BASE_GOLD_DIR
    ):

    for target in targets:
        target_dir = BASE_GOLD_DIR / target
        target_dir.mkdir(parents=True, exist_ok=True)

        # ----------------------------
        # TRAIN labels
        # ----------------------------
        y.loc[
            X_by_target[target].index, target
        ].to_csv(
            target_dir / "y_train.csv",
            header=True,
        )
    
    return "All data saved successfully."

In [None]:
def save_raw_features_csv(df, split, base_gold_dir, index_name='customer_id'):
    
    path = Path(base_gold_dir) / "raw"
    path.mkdir(parents=True, exist_ok=True)

    print("WRITING TO:", path.resolve())

    df.index.name = index_name
    df.to_csv(
        path / f"{split}_features.csv",
        index=True, # keep customer_id
    )

In [90]:
def save_transformed_by_target_csv(X_by_target, split, base_gold_dir, index_name='customer_id'):

    for target, df in X_by_target.items():
        
        base_path = Path(base_gold_dir) / "transformed" / target
        base_path.mkdir(parents=True, exist_ok=True)

        df.index.name = index_name
        df.to_csv(
            base_path / f"X_{split}.csv",
            index=True,  # keep customer_id
        )

In [145]:
def load_transformed(BASE_GOLD_DIR, split, target):
    return pd.read_csv(
        BASE_GOLD_DIR / "transformed" / target / f"X_{split}.csv",
        index_col=0,
    )

### Feature Transformation

In [92]:
def mutual_information_feature_selection(
    X_train,
    y_train,
    target,
    cutoff=0.0,
    random_state=42
):
    """
    Perform mutual information–based feature selection for a given target.

    Returns:
        selected_df: DataFrame with selected features
        mi_scores: DataFrame with MI scores per feature
        selected_features: Index of selected feature names
    """

    assert X_train.index.equals(y_train.index)

    mi_train = mutual_info_classif(
        X_train,
        y_train[target],
        random_state=random_state
    )

    mi_scores = (
        pd.DataFrame(
            mi_train,
            index=X_train.columns,
            columns=["mutual_info"]
        )
        .sort_values(by="mutual_info", ascending=False)
    )

    selected_features = mi_scores.loc[
        mi_scores["mutual_info"] > cutoff
    ].index

    selected_df = X_train[selected_features]

    return selected_df, mi_scores, selected_features

### Feature Processing Pipeline

In [94]:
def build_customer_features(
    transactions_modeling_df,
    customers_modeling_df,
    observed_date,
    feature_list=[
        "amount",
        "days_since_previous_transaction",
        "days_until_next_transaction",
        "customer_transaction_order",
        "days_since_first_transaction",
    ],
):
    """
    Build raw customer-level features from transactions and customers data.
    No imputing, scaling, or selection is performed here.
    """

    # 1. Transaction-level features
    transactions_df = add_transaction_time_features(
        transactions_modeling_df
    )

    # 2. RFM window features
    customers_df = get_rfm_window_features(
        customers_df=customers_modeling_df,
        transactions_df=transactions_df,
        observed_date=observed_date,
    )

    # 3. Activity trend (slopes)
    customers_df = get_slope_features(
        customers_df=customers_df,
        transactions_df=transactions_df,
        observed_date=observed_date,
        feature_list=feature_list,
    )

    # 4. Transaction statistics
    customers_df = get_transaction_statistics_features(
        customers_df=customers_df,
        transactions_df=transactions_df,
        observed_date=observed_date,
        feature_list=feature_list,
    )

    return customers_df

In [95]:
def fit_numeric_transformers(
    X_train_numeric_df,
    imputer_params=None,
    scaler_params=None,
):
    """
    Fit numeric imputer and scaler on training data only.

    Returns
    -------
    X_train_numeric_imputed_scaled_df : pd.DataFrame
    numeric_imputer : fitted IterativeImputer
    scaler : fitted StandardScaler
    """

    # -------------------------------
    # Defaults
    # -------------------------------
    if imputer_params is None:
        imputer_params = dict(
            estimator=LinearRegression(),
            max_iter=20,
            random_state=42,
        )

    if scaler_params is None:
        scaler_params = {}

    # -------------------------------
    # Imputation (FIT)
    # -------------------------------
    numeric_imputer = IterativeImputer(**imputer_params)
    X_train_numeric_imputed = numeric_imputer.fit_transform(X_train_numeric_df)

    X_train_numeric_imputed_df = pd.DataFrame(
        X_train_numeric_imputed,
        columns=X_train_numeric_df.columns,
        index=X_train_numeric_df.index,
    )

    # -------------------------------
    # Scaling (FIT)
    # -------------------------------
    scaler = StandardScaler(**scaler_params)
    X_train_numeric_imputed_scaled = scaler.fit_transform(
        X_train_numeric_imputed_df
    )

    X_train_numeric_imputed_scaled_df = pd.DataFrame(
        X_train_numeric_imputed_scaled,
        columns=X_train_numeric_df.columns,
        index=X_train_numeric_df.index,
    )

    return (
        numeric_imputer,
        scaler,
    )

In [96]:
def transform_customers_numeric_features(
    X_numeric,
    numeric_imputer,
    scaler,
):
    """
    Apply fitted numeric imputer and scaler.
    """

    X_numeric_imputed = numeric_imputer.transform(X_numeric)
    X_numeric_imputed_df = pd.DataFrame(
        X_numeric_imputed,
        columns=X_numeric.columns,
        index=X_numeric.index,
    )

    X_numeric_scaled = scaler.transform(X_numeric_imputed_df)
    X_numeric_scaled_df = pd.DataFrame(
        X_numeric_scaled,
        columns=X_numeric.columns,
        index=X_numeric.index,
    )

    return X_numeric_scaled_df


In [97]:
def select_features_per_target(
    X_train_transformed_df,
    y_train,
    targets,
    artifact_dir=None,
    cutoff=0.0,
    random_state=42,
):
    """
    Perform feature selection per target using mutual information.
    """

    assert X_train_transformed_df.index.equals(y_train.index), (
        "X_train and y_train must be index-aligned"
    )

    X_train_by_target = {}
    selected_features_by_target = {}
    mi_scores_by_target = {}

    for target in targets:
        X_selected_df, mi_scores, selected_features = (
            mutual_information_feature_selection(
                X_train=X_train_transformed_df,
                y_train=y_train,
                target=target,
                cutoff=cutoff,
                random_state=random_state,
            )
        )

        if artifact_dir is not None:
            with open(
                artifact_dir / f"selected_features_{target}.json",
                "w",
            ) as f:
                json.dump(list(selected_features), f)

        X_train_by_target[target] = X_selected_df
        selected_features_by_target[target] = list(selected_features)
        mi_scores_by_target[target] = mi_scores

        print(f"[{target}] selected {len(selected_features)} features")

    return (
        X_train_by_target,
        selected_features_by_target,
        mi_scores_by_target,
    )

In [98]:
def get_features_per_target(
    X_transformed_df,
    selected_features_by_target
):
    """
    Perform feature selection per target using mutual information.
    """

    X_by_target = {}

    for target, selected_features in selected_features_by_target.items():

        missing_features = set(selected_features) - set(
            X_transformed_df.columns
        )
        if missing_features:
            raise ValueError(
                f"Missing selected features at inference time: {missing_features}"
            )

        X_selected_features = X_transformed_df[selected_features]
        X_by_target[target] = X_selected_features

    return X_by_target

In [99]:
def split_train_test_val(
    customers_modeling_df,
    targets,
    test_size=0.33,
    val_size=0.33,
    random_state=42,
):
    """
    Split customer modeling dataframe into train / val / test sets.

    Parameters
    ----------
    customers_modeling_df : pd.DataFrame
        Must contain customer_id and target columns.
    targets : list[str]
        Target column names.
    test_size : float
        Proportion of data used for test+val split.
    val_size : float
        Proportion of test split used for validation.
    random_state : int

    Returns
    -------
    X_train, X_val, X_test, y_train, y_val, y_test
    """

    # -------------------------------
    # Feature / target separation
    # -------------------------------
    X_df = customers_modeling_df.drop(columns=targets)
    X_df = X_df.set_index("customer_id", drop=True)

    y_df = customers_modeling_df[["customer_id"] + targets]
    y_df = y_df.set_index("customer_id", drop=True)

    # -------------------------------
    # Train / temp split
    # -------------------------------
    X_train, X_temp, y_train, y_temp = train_test_split(
        X_df,
        y_df,
        test_size=test_size,
        random_state=random_state,
    )

    # -------------------------------
    # Test / validation split
    # -------------------------------
    X_test, X_val, y_test, y_val = train_test_split(
        X_temp,
        y_temp,
        test_size=val_size,
        random_state=random_state,
    )

    return X_train, X_val, X_test, y_train, y_val, y_test

In [None]:
def build_and_transform_customer_features_pipeline_train(
    transactions_modeling_df,
    X_train,
    y_train,
    observed_date,
    targets,
    ARTIFACT_DIR=None,
    feature_list=[
        "amount",
        "days_since_previous_transaction",
        "days_until_next_transaction",
        "customer_transaction_order",
        "days_since_first_transaction",
    ],
):
    """
    End-to-end pipeline for TRAIN data.
    """

    # --------------------------------------------------
    # 1. Build raw customer features
    # --------------------------------------------------
    X_train_raw_features_df = build_customer_features(
        transactions_modeling_df=transactions_modeling_df,
        customers_modeling_df=X_train,
        observed_date=observed_date,
        feature_list=feature_list,
    )

    # --------------------------------------------------
    # 2. Numeric transform (impute + scale)
    # --------------------------------------------------
    X_train_raw_features_df = X_train_raw_features_df.set_index("customer_id", drop=False)
    X_train_raw_features_numeric_df = X_train_raw_features_df.select_dtypes(include="number")

    numeric_imputer, scaler = fit_numeric_transformers(
        X_train_raw_features_numeric_df,
        imputer_params=None,
        scaler_params=None,
    )

    X_train_transformed_df = transform_customers_numeric_features(
        X_train_raw_features_numeric_df,
        numeric_imputer,
        scaler,
    )

    # --------------------------------------------------
    # 3. Feature selection per target (EXTRACTED)
    # --------------------------------------------------
    (
        X_train_by_target,
        selected_features_by_target,
        mi_scores_by_target,
    ) = select_features_per_target(
        X_train_transformed_df=X_train_transformed_df,
        y_train=y_train,
        targets=targets,
        artifact_dir=ARTIFACT_DIR,
    )

    # --------------------------------------------------
    # 4. Save transformers ONCE
    # --------------------------------------------------
    if ARTIFACT_DIR is not None:
        joblib.dump(
            numeric_imputer,
            ARTIFACT_DIR / "numeric_imputer.joblib",
        )
        joblib.dump(
            scaler,
            ARTIFACT_DIR / "scaler.joblib",
        )

    X_train_raw_features_df = X_train_raw_features_df.drop(columns=['customer_id'])

    return (
        X_train_raw_features_df,
        X_train_by_target,
        selected_features_by_target,
        mi_scores_by_target,
        numeric_imputer,
        scaler,
    )

In [101]:
def build_and_transform_customer_features_pipeline_test(
    transactions_modeling_df,
    X_test,
    observed_date,
    numeric_imputer,
    scaler,
    selected_features,
    feature_list=[
        "amount",
        "days_since_previous_transaction",
        "days_until_next_transaction",
        "customer_transaction_order",
        "days_since_first_transaction",
    ],
):
    """
    End-to-end pipeline for TEST / VAL / INFERENCE data.

    Steps
    -----
    1. Build raw customer-level features from transactions
    2. Remove customer_id from feature space
    3. Apply fitted numeric transformations (imputer + scaler)
    4. Select precomputed feature subset (STRICT reuse)
    """

    # --------------------------------------------------
    # 1. Build raw customer features
    # --------------------------------------------------
    X_test_features_df = build_customer_features(
        transactions_modeling_df=transactions_modeling_df,
        customers_modeling_df=X_test,
        observed_date=observed_date,
        feature_list=feature_list,
    )

    # --------------------------------------------------
    # 2. Set customer_id as index and REMOVE from features
    # --------------------------------------------------
    if "customer_id" not in X_test_features_df.columns:
        raise ValueError("customer_id column missing after feature building")

    X_test_features_df = X_test_features_df.set_index("customer_id", drop=True)

    # --------------------------------------------------
    # 3. Select numeric features and enforce column order
    # --------------------------------------------------
    X_test_numeric_features_df = X_test_features_df.select_dtypes(include="number")

    # Enforce training-time column order (critical for IterativeImputer)
    X_test_numeric_features_df = X_test_numeric_features_df[
        numeric_imputer.feature_names_in_
    ]

    # --------------------------------------------------
    # 4. Apply fitted numeric transformations (NO FIT)
    # --------------------------------------------------
    X_test_numeric_features_transformed_df = transform_customers_numeric_features(
        X_test_numeric_features_df,
        numeric_imputer,
        scaler,
    )

    # --------------------------------------------------
    # 5. Feature selection (STRICT reuse)
    # --------------------------------------------------
    missing_features = set(selected_features) - set(
        X_test_numeric_features_transformed_df.columns
    )
    if missing_features:
        raise ValueError(
            f"Missing selected features at inference time: {missing_features}"
        )

    X_test_final_df = X_test_numeric_features_transformed_df[selected_features]

    return X_test_final_df

In [102]:
def transform_and_select_for_multiple_targets_test(
    X_test_raw_features_df,
    numeric_imputer,
    scaler,
    selected_features_by_target
):
    """
    Build and transform customer features for multiple targets
    (test / val / inference).

    Returns
    -------
    X_by_target : dict[str, pd.DataFrame]
    """

    X_by_target = {}

    # Select numeric features and enforce column order
    X_test_numeric_features_df = X_test_raw_features_df.select_dtypes(include="number")

    # Enforce training-time column order (critical for IterativeImputer)
    X_test_numeric_features_df = X_test_numeric_features_df[
        numeric_imputer.feature_names_in_
    ]

    X_test_transformed_df = transform_customers_numeric_features(
        X_test_numeric_features_df,
        numeric_imputer,
        scaler,
    )

    X_by_target = get_features_per_target(
        X_test_transformed_df,
        selected_features_by_target
    )

    return X_by_target

In [103]:
def build_and_transform_for_multiple_targets(
    transactions_modeling_df,
    X_df,
    observed_date,
    numeric_imputer,
    scaler,
    selected_features_by_target,
):
    """
    Build and transform customer features for multiple targets
    (test / val / inference).

    Returns
    -------
    X_by_target : dict[str, pd.DataFrame]
    """

    X_by_target = {}

    for target, selected_features in selected_features_by_target.items():
        X_by_target[target] = build_and_transform_customer_features_pipeline_test(
            transactions_modeling_df=transactions_modeling_df,
            X_test=X_df,
            observed_date=observed_date,
            numeric_imputer=numeric_imputer,
            scaler=scaler,
            selected_features=selected_features,
            feature_list=[
                "amount",
                "days_since_previous_transaction",
                "days_until_next_transaction",
                "customer_transaction_order",
                "days_since_first_transaction",
            ],
        )

    return X_by_target

### Model

In [105]:
def plot_lgb_feature_importance(
    model,
    importance_type="gain",   # "gain" or "split"
    normalize=False,
    top_n=None,
    title=None,
    height=600,
    as_percent=True
):
    """
    Plot LightGBM feature importance for sklearn API models.
    """

    # --- Extract feature names ---
    if hasattr(model, "feature_name_"):
        features = model.feature_name_
    else:
        raise ValueError("Model does not contain feature names")

    # --- Extract importance correctly ---
    if importance_type == "split":
        importance = model.feature_importances_
    elif importance_type == "gain":
        importance = model.booster_.feature_importance(importance_type="gain")
    else:
        raise ValueError("importance_type must be 'gain' or 'split'")

    df = pd.DataFrame({
        "feature": features,
        "importance": importance
    })

    # Remove zero-importance features
    df = df[df["importance"] > 0]

    # --- Normalize if requested ---
    if normalize:
        total = df["importance"].sum()
        df["importance"] = df["importance"] / total
        if as_percent:
            df["importance"] *= 100
            importance_label = "Normalized Gain (%)"
            text_fmt = ".2f"
        else:
            importance_label = "Normalized Gain"
            text_fmt = ".4f"
    else:
        importance_label = (
            "Gain" if importance_type == "gain" else "Split Count"
        )
        text_fmt = ".2f"

    # Sort and keep top N
    df = df.sort_values("importance", ascending=False)
    if top_n is not None:
        df = df.head(top_n)

    # Reverse for horizontal bar chart
    df = df.sort_values("importance", ascending=True)

    if title is None:
        norm_tag = " (Normalized)" if normalize else ""
        title = f"LightGBM Feature Importance ({importance_type.capitalize()}){norm_tag}"

    fig = px.bar(
        df,
        x="importance",
        y="feature",
        orientation="h",
        title=title,
        labels={
            "importance": importance_label,
            "feature": "Feature"
        },
        text=df["importance"]
    )

    fig.update_traces(
        texttemplate=f"%{{text:{text_fmt}}}",
        textposition="outside",
        cliponaxis=False
    )

    fig.update_layout(
        height=height,
        yaxis=dict(categoryorder="total ascending"),
        margin=dict(r=120)
    )

    fig.show()

In [106]:
def evaluate_binary_model(model, X, y, threshold=0.5):
    """
    Evaluate a binary classifier.
    """

    y_proba = model.predict(X, num_iteration=model.best_iteration_)
    y_pred = (y_proba >= threshold).astype(int)

    metrics = {
        "roc_auc": roc_auc_score(y, y_proba),
        "pr_auc": average_precision_score(y, y_proba),
        "confusion_matrix": confusion_matrix(y, y_pred)
    }

    return metrics

In [107]:
def show_styled_df_confusion_matrix(cm):

    cm_df = pd.DataFrame(
        cm,
        index=["Actual 0", "Actual 1"],
        columns=["Predicted 0", "Predicted 1"]
    )

    styled_df = (
        cm_df.style
        .background_gradient(cmap="Blues")
        .format("{:.0f}")
    )
    
    return styled_df

In [108]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test, X_val, y_val, threshold=0.5):
    """
    Evaluate a binary classifier on train, validation, and test sets.
    Prints:
    - ROC-AUC
    - PR-AUC (Precision–Recall)
    - Accuracy
    - Confusion Matrix
    - Classification Report
    """
    print(f"\n===== {name} =====")

    for split_name, X, y in [
        ("TRAIN", X_train, y_train),
        ("TEST", X_test, y_test),
        ("VALIDATION", X_val, y_val),
    ]:
        # Predicted probabilities and labels
        y_proba = model.predict_proba(X)[:, 1]
        y_pred = (y_proba >= threshold).astype(int)

        # Metrics
        roc_auc = roc_auc_score(y, y_proba)
        pr_auc = average_precision_score(y, y_proba)
        acc = accuracy_score(y, y_pred)
        cm = confusion_matrix(y, y_pred)
        cm_df = pd.DataFrame(
            cm,
            index=["Actual 0", "Actual 1"],
            columns=["Predicted 0", "Predicted 1"]
        )

        # Print results
        print(f"\n{split_name}")
        print("-" * len(split_name))
        print(f"ROC-AUC:      {roc_auc:.4f}")
        print(f"PR-AUC:       {pr_auc:.4f}")
        print(f"Accuracy:     {acc:.4f}")
        print("\nConfusion Matrix:")
        print(cm_df)
        print("\nClassification Report:")
        print(classification_report(y, y_pred))

In [109]:
def train_lgbm(
    X_train,
    y_train,
    X_val,
    y_val,
    target,
    dataset_version,
):
    param_grid = {
        "num_leaves": [31, 63],
        "learning_rate": [0.05, 0.1],
        "n_estimators": [200, 400],
        "max_depth": [-1, 6],
    }

    model = LGBMClassifier(
        objective="binary",
        random_state=42,
        n_jobs=-1,
    )

    grid = GridSearchCV(
        model,
        param_grid=param_grid,
        scoring="average_precision",
        cv=3,
        verbose=0,
    )

    grid.fit(X_train, y_train[target])

    best_model = grid.best_estimator_

    # ---------- Validation predictions ----------
    val_proba = best_model.predict_proba(X_val)[:, 1]
    val_pred = (val_proba >= 0.5).astype(int)  # explicit threshold

    # ---------- Metrics ----------
    roc_auc = roc_auc_score(y_val[target], val_proba)
    pr_auc = average_precision_score(y_val[target], val_proba)
    precision = precision_score(y_val[target], val_pred)
    recall = recall_score(y_val[target], val_pred)

    cm = confusion_matrix(y_val[target], val_pred)
    cm_df = pd.DataFrame(
        cm,
        index=["actual_0", "actual_1"],
        columns=["pred_0", "pred_1"],
    )

    # ---------- MLflow ----------
    input_example = X_train.iloc[:5]
    signature = infer_signature(
        X_train,
        best_model.predict_proba(X_train)[:, 1],
    )

    mlflow.log_param("dataset_version", dataset_version)
    mlflow.log_param("target", target)
    mlflow.log_params(grid.best_params_)

    mlflow.log_metric("val_roc_auc", roc_auc)
    mlflow.log_metric("val_pr_auc", pr_auc)
    mlflow.log_metric("val_precision", precision)
    mlflow.log_metric("val_recall", recall)

    mlflow.log_text(
        cm_df.to_string(),
        artifact_file=f"confusion_matrix/{dataset_version}_{target}.txt",
    )

    mlflow.lightgbm.log_model(
        best_model,
        name=f"{dataset_version}_{target}",
        input_example=input_example,
        signature=signature,
    )

### Inference

In [33]:
def promote_to_production(run_id):
    client.set_tag(run_id, "stage", "production")

In [34]:
def get_production_runs():
    return mlflow.search_runs(
        filter_string="tags.stage = 'production'",
        search_all_experiments=True,
        output_format="pandas",
    )

    return runs

In [35]:
def load_production_models():
    prod_runs = get_production_runs()

    models = {}
    metadata = {}

    for _, row in prod_runs.iterrows():
        target = row["params.target"]
        dataset_version = row["params.dataset_version"]
        run_id = row["run_id"]

        model_uri = f"runs:/{run_id}/{dataset_version}_{target}"
        model = mlflow.lightgbm.load_model(model_uri)

        models[target] = model
        metadata[target] = {
            "dataset_version": dataset_version,
            "run_id": run_id,
        }

    return models, metadata

In [36]:
def get_customer_features(
    customer_ids,
    target,
    metadata,
    raw_features_df,
    transformed_features_by_target,
):
    if isinstance(customer_ids, str):
        customer_ids = [customer_ids]

    dataset_version = metadata[target]["dataset_version"]

    if dataset_version == "raw":
        X = raw_features_df.loc[customer_ids]
    elif dataset_version == "transformed":
        X = transformed_features_by_target[target].loc[customer_ids]
    else:
        raise ValueError(f"Unknown dataset version: {dataset_version}")

    return X

In [37]:
def predict_churn(
    customer_id: str,
    horizon_days: int,
    raw_features_df,
    transformed_features_by_target,
    models,
    metadata,
):
    # ------------------
    # Validate horizon
    # ------------------
    if horizon_days not in {30, 60, 90}:
        raise ValueError("horizon_days must be one of {30, 60, 90}")

    target = f"is_churn_{horizon_days}_days"

    if target not in models:
        raise KeyError(f"No production model loaded for target: {target}")

    # ------------------
    # Select features
    # ------------------
    X = get_customer_features(
        customer_ids=[customer_id],
        target=target,
        metadata=metadata,
        raw_features_df=raw_features_df,
        transformed_features_by_target=transformed_features_by_target,
    )

    # ------------------
    # Predict
    # ------------------
    model = models[target]
    churn_prob = float(model.predict_proba(X)[0, 1])

    # ------------------
    # Risk labeling (explicit, adjustable)
    # ------------------
    if churn_prob >= 0.7:
        churn_label = "high_risk"
    elif churn_prob >= 0.4:
        churn_label = "medium_risk"
    else:
        churn_label = "low_risk"

    return {
        "churn_probability": round(churn_prob, 4),
        "churn_label": churn_label,
    }

In [38]:
def predict_churns(
    customer_ids: list[str],
    horizon_days: int,
    raw_features_df,
    transformed_features_by_target,
    models,
    metadata,
):
    # ------------------
    # Validate horizon
    # ------------------
    if horizon_days not in {30, 60, 90}:
        raise ValueError("horizon_days must be one of {30, 60, 90}")

    target = f"is_churn_{horizon_days}_days"

    if target not in models:
        raise KeyError(f"No production model loaded for target: {target}")

    # ------------------
    # Feature extraction (BULK)
    # ------------------
    X = get_customer_features(
        customer_ids=customer_ids,
        target=target,
        metadata=metadata,
        raw_features_df=raw_features_df,
        transformed_features_by_target=transformed_features_by_target,
    )

    # ------------------
    # Predict (BULK)
    # ------------------
    model = models[target]
    churn_probs = model.predict_proba(X)[:, 1]

    # ------------------
    # Risk labeling (vectorized)
    # ------------------
    churn_labels = np.where(
        churn_probs >= 0.7,
        "high_risk",
        np.where(
            churn_probs >= 0.4,
            "medium_risk",
            "low_risk",
        ),
    )

    # ------------------
    # Output (aligned, explicit)
    # ------------------
    return (
        pd.DataFrame(
            {
                "customer_id": customer_ids,
                "churn_probability": churn_probs.round(4),
                "churn_label": churn_labels,
            }
        )
        .set_index("customer_id")
    )

In [39]:
def load_features(
        dataset_version,
        gold_data_version=MAX_DATA_DATE_STR,
        gold_dir="default",
        targets=targets
    ):
    '''
        The service preloads the feature dataframes for faster search.
    '''
    if gold_dir == "default":
        PROJECT_ROOT = Path.cwd().parent
        gold_dir = PROJECT_ROOT / "data" / "gold" / gold_data_version
    
    if dataset_version == "raw":
        feature_df = pd.read_csv(gold_dir / dataset_version / "all_features.csv", index_col=0)
        return feature_df
    elif dataset_version == "transformed":
        X_by_target = {}
        for target in targets:
            feature_df = pd.read_csv(gold_dir / dataset_version / target / "X_all.csv", index_col=0)
            X_by_target[target] = feature_df
        return X_by_target
    else:
        return "Invalid dataset version. Please use only `raw` and `transformed`."

## Data

### Read all time data

In [44]:
customers_df = pd.read_csv(f"../{SEED_CUSTOMERS}")

In [54]:
transactions_df = pd.read_csv(f"../{SEED_TRANSACTIONS}")

In [None]:
mk.read_data_info(transactions_df)

Number of columns: 3
Column names: ['customer_id', 'transaction_date', 'amount']
Number of rows: 46,704
Data Preview: 

  customer_id transaction_date  amount
0      C00000       2025-09-10  195.78
1      C00000       2025-09-12   50.87
2      C00000       2025-10-01  133.25
3      C00000       2025-10-16   37.44
4      C00000       2025-10-18  101.95


: 

In [None]:
mk.read_data_info(customers_df)

Number of columns: 3
Column names: ['customer_id', 'signup_date', 'true_lifetime_days']
Number of rows: 3,000
Data Preview: 

  customer_id signup_date  true_lifetime_days
0      C00000  2025-08-22                 204
1      C00001  2025-03-07                 365
2      C00002  2025-08-18                  48
3      C00003  2025-09-22                  84
4      C00004  2025-05-28                 113


: 

### Transform all time data

In [55]:
transactions_df = transform_transactions_df(transactions_df)

In [45]:
customers_df = transform_customers_df(customers_df)

### Limit data

In [56]:
transactions_modeling_df = transactions_df[transactions_df['transaction_date'] <= TRAIN_SNAPSHOT_DATE]

In [None]:
customers_modeling_df = pd.merge(
    pd.DataFrame({'customer_id': transactions_modeling_df['customer_id'].unique()}),
    customers_df,
    on='customer_id',
    how='inner'
)

: 

In [None]:
customers_modeling_df = customers_modeling_df.drop(columns=['signup_date', 'true_lifetime_days', 'termination_date'])

: 

In [None]:
customers_modeling_df

Unnamed: 0,customer_id
0,C00000
1,C00001
2,C00002
3,C00004
4,C00006
...,...
2259,C02990
2260,C02993
2261,C02994
2262,C02996


: 

### Define churn labels

Logic to create training set:
- MAX_DATA_DATE: cut off of observation time.
- MAX_DATA_DATE - 90: the observation time cutoff for the data used to train our models.

In [None]:
CUTOFF_TRAINING_DATE = MAX_DATA_DATE - pd.Timedelta(90, unit='day')

: 

In [None]:
ndays = [30, 60, 90]
for nday in ndays:
    var_name = f"is_churn_{nday}_days"
    timestamp_date = MAX_DATA_DATE - pd.Timedelta(nday, unit='day')
    customers_modeling_df[var_name] = add_churn_status(transformed_customers_df=customers_df, observed_date=timestamp_date, desired_df=None)

: 

# Test Feature Transformation Pipeline 1

## Feature Engineering

### Window Features

Adding more features to transactions data so I can compute dependency features:
- days_since_last_transaction
- days_until_next_transaction
- customer_transaction_order

Technically I should compute this on only the train set. However, since the function computing section only uses customers from the train set, it should not matter.

In [81]:
transactions_modeling_df = transactions_modeling_df.sort_values(['customer_id', 'transaction_date'])

In [82]:
transactions_modeling_df['customer_transaction_order'] = transactions_modeling_df.groupby('customer_id').cumcount()

In [83]:
transactions_modeling_df['prev_transaction_date'] = transactions_modeling_df.groupby('customer_id')['transaction_date'].shift(1)
transactions_modeling_df['next_transaction_date'] = transactions_modeling_df.groupby('customer_id')['transaction_date'].shift(-1)

In [84]:
transactions_modeling_df['days_since_previous_transaction'] = (transactions_modeling_df['transaction_date'] - transactions_modeling_df['prev_transaction_date']).dt.days
transactions_modeling_df['days_until_next_transaction'] = (transactions_modeling_df['next_transaction_date'] - transactions_modeling_df['transaction_date']).dt.days

In [85]:
# Get the first transaction date for each customer
transactions_modeling_df['first_transaction_date'] = transactions_modeling_df.groupby('customer_id')['transaction_date'].transform('min')

# Compute days since first transaction
transactions_modeling_df['days_since_first_transaction'] = (
    transactions_modeling_df['transaction_date'] - transactions_modeling_df['first_transaction_date']
).dt.days

In [86]:
transactions_modeling_df

Unnamed: 0,customer_id,transaction_date,amount,customer_transaction_order,prev_transaction_date,next_transaction_date,days_since_previous_transaction,days_until_next_transaction,first_transaction_date,days_since_first_transaction
0,C00000,2025-09-10,195.78,0,NaT,2025-09-12,,2.0,2025-09-10,0
1,C00000,2025-09-12,50.87,1,2025-09-10,2025-10-01,2.0,19.0,2025-09-10,2
2,C00000,2025-10-01,133.25,2,2025-09-12,NaT,19.0,,2025-09-10,21
12,C00001,2025-03-17,66.11,0,NaT,2025-04-23,,37.0,2025-03-17,0
13,C00001,2025-04-23,38.28,1,2025-03-17,2025-05-22,37.0,29.0,2025-03-17,37
...,...,...,...,...,...,...,...,...,...,...
46670,C02999,2025-09-16,8.02,41,2025-09-14,2025-09-16,2.0,0.0,2025-05-26,113
46671,C02999,2025-09-16,30.10,42,2025-09-16,2025-09-28,0.0,12.0,2025-05-26,113
46672,C02999,2025-09-28,11.59,43,2025-09-16,2025-09-28,12.0,0.0,2025-05-26,125
46673,C02999,2025-09-28,103.22,44,2025-09-28,2025-10-02,0.0,4.0,2025-05-26,125


In [87]:
check_nan_in_df_cols(transactions_modeling_df)

Total features: 10
Information on NaN values
Number of High Proportion Features: 0
Number of Medium Proportion Features: 4
Number of Low Proportion Features: 6


Unnamed: 0,feature,nan_proportion,NaN group
0,prev_transaction_date,0.088327,Medium
1,next_transaction_date,0.088327,Medium
2,days_since_previous_transaction,0.088327,Medium
3,days_until_next_transaction,0.088327,Medium
4,customer_id,0.0,Low
5,transaction_date,0.0,Low
6,amount,0.0,Low
7,customer_transaction_order,0.0,Low
8,first_transaction_date,0.0,Low
9,days_since_first_transaction,0.0,Low


### RFM Features

RFM can be used to show two information:
- lifetime behavior
- behavior trends

So I wrote a loop to create RFM features based on different time windows: All time, within the last 30 days, within the last 60 days and within the last 90 days. I technically can add more.
- I also added tenure: Days between the first purchase and the cutoff observed date. If the time window is 30: It is days between the first purchase and 30 days before the cutoff observed date.
- Reason: I believe tenure is a reflection of a customer's loyalty. Also, the summary table has enough data to create this feature easily.

In [89]:
customers_modeling_df = get_rfm_window_features(customers_df=customers_modeling_df, transactions_df=transactions_modeling_df, observed_date=CUTOFF_TRAINING_DATE)

In [90]:
customers_modeling_df

Unnamed: 0,customer_id,signup_date,true_lifetime_days,termination_date,is_churn_30_days,is_churn_60_days,is_churn_90_days,rfm_recency_all_time,rfm_frequency_all_time,rfm_monetary_all_time,...,rfm_monetary_30d,tenure_30d,rfm_recency_60d,rfm_frequency_60d,rfm_monetary_60d,tenure_60d,rfm_recency_90d,rfm_frequency_90d,rfm_monetary_90d,tenure_90d
0,C00000,2025-08-22,204,2026-03-14,0,0,0,1,3,379.90,...,,,,,,,,,,
1,C00001,2025-03-07,365,2026-03-07,0,0,0,21,11,620.79,...,585.34,138.0,61.0,10.0,585.34,138.0,100.0,6.0,226.67,99.0
2,C00002,2025-08-18,48,2025-10-05,1,1,0,6,11,910.64,...,620.80,11.0,,,,,,,,
3,C00004,2025-05-28,113,2025-09-18,0,0,0,18,19,2018.94,...,1866.80,69.0,61.0,13.0,1451.43,55.0,95.0,6.0,663.50,21.0
4,C00006,2025-08-22,117,2025-12-17,1,1,1,28,1,20.20,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2259,C02990,2025-02-01,307,2025-12-05,1,1,1,4,18,2207.01,...,1666.78,209.0,61.0,10.0,1588.01,180.0,117.0,9.0,1479.59,124.0
2260,C02993,2025-03-01,134,2025-07-13,0,0,0,102,8,1090.93,...,1090.93,112.0,102.0,8.0,1090.93,112.0,102.0,8.0,1090.93,112.0
2261,C02994,2025-01-30,112,2025-05-22,0,0,0,140,20,1474.70,...,1474.70,102.0,140.0,20.0,1474.70,102.0,140.0,20.0,1474.70,102.0
2262,C02996,2025-06-03,308,2026-04-07,1,1,1,4,6,235.07,...,206.78,72.0,72.0,3.0,163.75,32.0,104.0,1.0,30.96,0.0


In [91]:
customers_modeling_df.count()

customer_id               2264
signup_date               2264
true_lifetime_days        2264
termination_date          2264
is_churn_30_days          2264
is_churn_60_days          2264
is_churn_90_days          2264
rfm_recency_all_time      2264
rfm_frequency_all_time    2264
rfm_monetary_all_time     2264
tenure_all_time           2264
rfm_recency_30d           1994
rfm_frequency_30d         1994
rfm_monetary_30d          1994
tenure_30d                1994
rfm_recency_60d           1731
rfm_frequency_60d         1731
rfm_monetary_60d          1731
tenure_60d                1731
rfm_recency_90d           1443
rfm_frequency_90d         1443
rfm_monetary_90d          1443
tenure_90d                1443
dtype: int64

In [92]:
customers_modeling_df.columns

Index(['customer_id', 'signup_date', 'true_lifetime_days', 'termination_date',
       'is_churn_30_days', 'is_churn_60_days', 'is_churn_90_days',
       'rfm_recency_all_time', 'rfm_frequency_all_time',
       'rfm_monetary_all_time', 'tenure_all_time', 'rfm_recency_30d',
       'rfm_frequency_30d', 'rfm_monetary_30d', 'tenure_30d',
       'rfm_recency_60d', 'rfm_frequency_60d', 'rfm_monetary_60d',
       'tenure_60d', 'rfm_recency_90d', 'rfm_frequency_90d',
       'rfm_monetary_90d', 'tenure_90d'],
      dtype='object')

In [93]:
check_nan_in_df_cols(customers_modeling_df)

Total features: 23
Information on NaN values
Number of High Proportion Features: 8
Number of Medium Proportion Features: 4
Number of Low Proportion Features: 11


Unnamed: 0,feature,nan_proportion,NaN group
0,tenure_90d,0.362633,High
1,rfm_monetary_90d,0.362633,High
2,rfm_frequency_90d,0.362633,High
3,rfm_recency_90d,0.362633,High
4,tenure_60d,0.235424,High
5,rfm_monetary_60d,0.235424,High
6,rfm_frequency_60d,0.235424,High
7,rfm_recency_60d,0.235424,High
8,rfm_frequency_30d,0.119258,Medium
9,tenure_30d,0.119258,Medium


It is expected that the window RFM features will have lots of NaNs. This is because transactions occur more at the later dates.

### Activity Trend Features

Some possile features:
- Number of actions (activity) -> Unavailable
- Slope of transaction features
    - Say a customer k have n transactions.
    - For each customer, we fit a linear regression line: y = b0 + b1*x1
        - where y is a feature from the transactions dataset
        - x1 is the time index (starts at 0, first signup day of all customers)
- Statistics of transaction features
    - Min
    - Mean
    - Mode
    - Max
    - q1
    - q5
    - q10
    - q20
    - q30
    - ...
    - q90
    - q95
    - q99

#### Slope

In [95]:
customers_modeling_df = get_slope_features(
    customers_df=customers_modeling_df,
    transactions_df=transactions_modeling_df,
    observed_date=CUTOFF_TRAINING_DATE,
    feature_list=[
        'amount',
        'days_since_previous_transaction',
        'days_until_next_transaction',
        'customer_transaction_order',
        'days_since_first_transaction'
    ]
)

In [96]:
customers_modeling_df.count()

customer_id                              2264
signup_date                              2264
true_lifetime_days                       2264
termination_date                         2264
is_churn_30_days                         2264
is_churn_60_days                         2264
is_churn_90_days                         2264
rfm_recency_all_time                     2264
rfm_frequency_all_time                   2264
rfm_monetary_all_time                    2264
tenure_all_time                          2264
rfm_recency_30d                          1994
rfm_frequency_30d                        1994
rfm_monetary_30d                         1994
tenure_30d                               1994
rfm_recency_60d                          1731
rfm_frequency_60d                        1731
rfm_monetary_60d                         1731
tenure_60d                               1731
rfm_recency_90d                          1443
rfm_frequency_90d                        1443
rfm_monetary_90d                  

In [97]:
customers_modeling_df.columns

Index(['customer_id', 'signup_date', 'true_lifetime_days', 'termination_date',
       'is_churn_30_days', 'is_churn_60_days', 'is_churn_90_days',
       'rfm_recency_all_time', 'rfm_frequency_all_time',
       'rfm_monetary_all_time', 'tenure_all_time', 'rfm_recency_30d',
       'rfm_frequency_30d', 'rfm_monetary_30d', 'tenure_30d',
       'rfm_recency_60d', 'rfm_frequency_60d', 'rfm_monetary_60d',
       'tenure_60d', 'rfm_recency_90d', 'rfm_frequency_90d',
       'rfm_monetary_90d', 'tenure_90d', 'slope_amount',
       'slope_days_since_previous_transaction',
       'slope_days_until_next_transaction', 'slope_customer_transaction_order',
       'slope_days_since_first_transaction'],
      dtype='object')

In [98]:
check_nan_in_df_cols(customers_modeling_df)

Total features: 28
Information on NaN values
Number of High Proportion Features: 13
Number of Medium Proportion Features: 4
Number of Low Proportion Features: 11


Unnamed: 0,feature,nan_proportion,NaN group
0,slope_days_since_previous_transaction,0.5,High
1,slope_days_until_next_transaction,0.463781,High
2,slope_days_since_first_transaction,0.435954,High
3,slope_customer_transaction_order,0.435954,High
4,slope_amount,0.435954,High
5,tenure_90d,0.362633,High
6,rfm_monetary_90d,0.362633,High
7,rfm_frequency_90d,0.362633,High
8,rfm_recency_90d,0.362633,High
9,rfm_recency_60d,0.235424,High


#### Statistics

In [101]:
customers_modeling_df = get_transaction_statistics_features(
    customers_df=customers_modeling_df,
    transactions_df=transactions_modeling_df,
    observed_date=CUTOFF_TRAINING_DATE,
    feature_list=[
        'amount',
        'days_since_previous_transaction',
        'days_until_next_transaction',
        'customer_transaction_order',
        'days_since_first_transaction'
    ]
)

In [102]:
check_nan_in_df_cols(customers_modeling_df)

Total features: 113
Information on NaN values
Number of High Proportion Features: 98
Number of Medium Proportion Features: 4
Number of Low Proportion Features: 11


Unnamed: 0,feature,nan_proportion,NaN group
0,q60_days_since_previous_transaction,0.5,High
1,q50_days_since_previous_transaction,0.5,High
2,min_days_since_previous_transaction,0.5,High
3,slope_days_since_previous_transaction,0.5,High
4,mean_days_since_previous_transaction,0.5,High
...,...,...,...
108,rfm_recency_all_time,0.0,Low
109,rfm_frequency_all_time,0.0,Low
110,rfm_monetary_all_time,0.0,Low
111,tenure_all_time,0.0,Low


In [232]:
customers_modeling_df.count()

customer_id                         2264
signup_date                         2264
true_lifetime_days                  2264
termination_date                    2264
is_churn_30_days                    2264
                                    ... 
q70_days_since_first_transaction    1476
q80_days_since_first_transaction    1476
q90_days_since_first_transaction    1476
q95_days_since_first_transaction    1476
q99_days_since_first_transaction    1476
Length: 113, dtype: int64

In [233]:
customers_modeling_df.columns

Index(['customer_id', 'signup_date', 'true_lifetime_days', 'termination_date',
       'is_churn_30_days', 'is_churn_60_days', 'is_churn_90_days',
       'rfm_recency_all_time', 'rfm_frequency_all_time',
       'rfm_monetary_all_time',
       ...
       'q20_days_since_first_transaction', 'q30_days_since_first_transaction',
       'q40_days_since_first_transaction', 'q50_days_since_first_transaction',
       'q60_days_since_first_transaction', 'q70_days_since_first_transaction',
       'q80_days_since_first_transaction', 'q90_days_since_first_transaction',
       'q95_days_since_first_transaction', 'q99_days_since_first_transaction'],
      dtype='object', length=113)

In [234]:
customers_modeling_df

Unnamed: 0,customer_id,signup_date,true_lifetime_days,termination_date,is_churn_30_days,is_churn_60_days,is_churn_90_days,rfm_recency_all_time,rfm_frequency_all_time,rfm_monetary_all_time,...,q20_days_since_first_transaction,q30_days_since_first_transaction,q40_days_since_first_transaction,q50_days_since_first_transaction,q60_days_since_first_transaction,q70_days_since_first_transaction,q80_days_since_first_transaction,q90_days_since_first_transaction,q95_days_since_first_transaction,q99_days_since_first_transaction
0,C00000,2025-08-22,204,2026-03-14,0,0,0,1,3,379.90,...,0.8,1.2,1.6,2.0,5.8,9.6,13.4,17.2,19.1,20.62
1,C00001,2025-03-07,365,2026-03-07,0,0,0,21,11,620.79,...,119.0,122.4,129.2,136.0,136.8,137.6,146.0,162.0,170.0,176.40
2,C00002,2025-08-18,48,2025-10-05,1,1,0,6,11,910.64,...,4.0,5.0,8.0,11.0,15.0,15.0,22.0,27.0,32.0,36.00
3,C00004,2025-05-28,113,2025-09-18,0,0,0,18,19,2018.94,...,36.0,43.2,50.8,55.0,59.2,60.0,65.4,83.4,91.4,96.68
4,C00006,2025-08-22,117,2025-12-17,1,1,1,28,1,20.20,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2259,C02990,2025-02-01,307,2025-12-05,1,1,1,4,18,2207.01,...,199.4,211.0,215.2,220.0,225.6,229.4,231.0,232.2,234.6,236.52
2260,C02993,2025-03-01,134,2025-07-13,0,0,0,102,8,1090.93,...,,,,,,,,,,
2261,C02994,2025-01-30,112,2025-05-22,0,0,0,140,20,1474.70,...,,,,,,,,,,
2262,C02996,2025-06-03,308,2026-04-07,1,1,1,4,6,235.07,...,31.4,37.6,48.8,60.0,64.8,69.6,77.6,88.8,94.4,98.88


In [None]:
#customers_modeling_df.to_csv(f"../data/gold/customers_features_{MAX_DATA_DATE.strftime("%d_%m_%Y")}.csv", index=None)

## Data Split

In [103]:
customers_modeling_df = pd.read_csv('../data/gold/customers_features_31_12_2025.csv')

In [104]:
customers_modeling_df = customers_modeling_df.drop(columns=['signup_date', 'true_lifetime_days', 'termination_date'])

In [105]:
X_df = customers_modeling_df.drop(columns=['is_churn_30_days', 'is_churn_60_days', 'is_churn_90_days'])
X_df = X_df.set_index('customer_id', drop=True)

In [106]:
y_df =customers_modeling_df[['customer_id', 'is_churn_30_days', 'is_churn_60_days', 'is_churn_90_days']]
y_df = y_df.set_index('customer_id', drop=True)

In [107]:
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.33, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.33, random_state=42)

## Feature Processing

Available techniques:
- Filter methods: Evaluate feaftures using statistical properties of the data, not model performance.
- Wrapper methods: Use different combination of features to learn an algorithm.
    - Forward selection
    - Backward elimination
    - Recursive feature elimination
- Embedded methods

### Split to Numeric and Categorical

There isn't a numeric feature, I'm just adding it for clarity.

In [108]:
X_train_numeric_df = X_train.select_dtypes(include="number")
X_train_categorical_df = X_train.select_dtypes(exclude="number")

### Impute

Since there are lots of Nans in my data (the Nans actually have meaning though), and I don't want the lack of values to affect my model performance, so I'm imputing them. I'm using a model so the imputation is as similar to the range of each feature as possible.
I'm using an IterativeImputer from sklearn. It:
- Do a random guess for values of NaN cells.
- Pick a feature with NaN and use that as target
- Split the data into two sets:
    - Rows where target feature is non-null (training data)
    - Rows where target feature is null (prediction input)
- Train the regression model
- Predict missing values
- Move to the next column
- Iterate (use new column values to train a new model)
    - Total models p x k
    - p: number of columns with at least 1 NaN
    - k: max_iter in IterativeImputer

In [121]:
numeric_imputer = IterativeImputer(
    estimator=LinearRegression(),
    max_iter=20,
    random_state=42
)

In [122]:
X_train_numeric_imputed = numeric_imputer.fit_transform(X_train_numeric_df)


[IterativeImputer] Early stopping criterion not reached.



In [128]:
X_train_numeric_imputed_df = pd.DataFrame(
    X_train_numeric_imputed,
    columns=X_train_numeric_df.columns,
    index=X_train_numeric_df.index
)

### Scale

In [135]:
scaler = StandardScaler()

In [136]:
X_train_numeric_imputed_scaled = scaler.fit_transform(X_train_numeric_imputed_df)

X_train_numeric_imputed_scaled_df = pd.DataFrame(
    X_train_numeric_imputed_scaled,
    columns=X_train_numeric_df.columns,
    index=X_train_numeric_df.index
)

### Feature Selection

#### Information Gain

Information Gain: measures how much a feature provides about the target variable.
- Higher information gain -> More useful features

In [197]:
target = 'is_churn_30_days'

X_train_numeric_imputed_scaled_selected_df1, mi_scores1, selected_features1 = mutual_information_feature_selection(
    X_train=X_train_numeric_imputed_scaled_df,
    y_train=y_train,
    target='is_churn_30_days',
    cutoff=0.0,
    random_state=42
)

In [198]:
mk.distribution_statistics_table(mi_scores1, value_col='mutual_info')

Unnamed: 0,statistic,Index
0,count,106.0
1,non_null,106.0
2,,0.0
3,mean,0.0065
4,mode,0.0
5,std,0.009
6,skew,1.5825
7,kurtosis,2.2522
8,min,0.0
9,max,0.0395


Half of the features have 0 information gain. I doubt including these features will be useful in my tree. Hence I am remove them using a threshold: Information Gain has to be > 0.

## Write Transformation Models

In [194]:
numeric_imputer

0,1,2
,estimator,LinearRegression()
,missing_values,
,sample_posterior,False
,max_iter,20
,tol,0.001
,n_nearest_features,
,initial_strategy,'mean'
,fill_value,
,imputation_order,'ascending'
,skip_complete,False

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [195]:
scaler

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [199]:
selected_features1

Index(['q95_days_since_previous_transaction', 'rfm_frequency_all_time',
       'q10_customer_transaction_order', 'q50_days_since_first_transaction',
       'q5_customer_transaction_order', 'q80_customer_transaction_order',
       'tenure_all_time', 'q20_days_since_previous_transaction', 'mode_amount',
       'q95_customer_transaction_order', 'q20_customer_transaction_order',
       'q1_days_until_next_transaction', 'rfm_monetary_30d',
       'q30_customer_transaction_order', 'q90_days_until_next_transaction',
       'max_days_since_first_transaction', 'rfm_frequency_90d',
       'mean_days_since_first_transaction',
       'q90_days_since_previous_transaction',
       'mode_days_since_first_transaction',
       'q30_days_since_previous_transaction',
       'q50_days_since_previous_transaction',
       'q10_days_until_next_transaction', 'q1_amount',
       'max_days_since_previous_transaction',
       'slope_days_until_next_transaction', 'rfm_recency_all_time',
       'q99_customer_trans

In [201]:
ARTIFACT_DIR = Path("../src/models/preprocessing")
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

# Save sklearn objects
dump(numeric_imputer, ARTIFACT_DIR / "numeric_imputer.joblib")
dump(scaler, ARTIFACT_DIR / "scaler.joblib")

# Save selected feature names
with open(ARTIFACT_DIR / "selected_features1.json", "w") as f:
    json.dump(list(selected_features1), f)

# Complete Feature Transformation Pipeline 1

## Wrapper

## Resplit Data

In [100]:
X_train, X_val, X_test, y_train, y_val, y_test = split_features_targets(
    customers_modeling_df,
    targets=targets,
)

## Get Features & Fit on Train

In [101]:
(
    X_train_by_target,
    selected_features_by_target,
    mi_scores_by_target,
    numeric_imputer,
    scaler,
) = build_and_transform_customer_features_pipeline_train(
    transactions_modeling_df=transactions_modeling_df,
    X_train=X_train,
    y_train=y_train,
    observed_date=TRAIN_SNAPSHOT_DATE,
    targets=targets,
    ARTIFACT_DIR=ARTIFACT_DIR
)


[IterativeImputer] Early stopping criterion not reached.



[is_churn_30_days] selected 62 features
[is_churn_60_days] selected 60 features
[is_churn_90_days] selected 62 features


In [102]:
with open(
    ARTIFACT_DIR / "selected_features_is_churn_30_days.json"
) as f:
    selected_features_is_churn_30_days = json.load(f)

with open(
    ARTIFACT_DIR / "selected_features_is_churn_60_days.json"
) as f:
    selected_features_is_churn_60_days = json.load(f)

with open(
    ARTIFACT_DIR / "selected_features_is_churn_90_days.json"
) as f:
    selected_features_is_churn_90_days = json.load(f)

## Get Final Features on Test & Val Set

In [103]:
X_test_by_target = build_and_transform_for_multiple_targets(
    transactions_modeling_df=transactions_modeling_df,
    X_df=X_test,
    observed_date=TRAIN_SNAPSHOT_DATE,
    numeric_imputer=numeric_imputer,
    scaler=scaler,
    selected_features_by_target=selected_features_by_target,
)

In [104]:
X_val_by_target = build_and_transform_for_multiple_targets(
    transactions_modeling_df=transactions_modeling_df,
    X_df=X_val,
    observed_date=TRAIN_SNAPSHOT_DATE,
    numeric_imputer=numeric_imputer,
    scaler=scaler,
    selected_features_by_target=selected_features_by_target,
)

## Temp: Write down transformed dataframes

In [107]:
for target in X_train_by_target.keys():

    target_dir = BASE_GOLD_DIR / target
    target_dir.mkdir(parents=True, exist_ok=True)

    # ----------------------------
    # TRAIN
    # ----------------------------
    X_train_by_target[target].to_csv(
        target_dir / "X_train.csv",
        index=True,
    )

    # ----------------------------
    # VALIDATION
    # ----------------------------
    X_val_by_target[target].to_csv(
        target_dir / "X_val.csv",
        index=True,
    )

    # ----------------------------
    # TEST
    # ----------------------------
    X_test_by_target[target].to_csv(
        target_dir / "X_test.csv",
        index=True,
    )

    print(f"[{target}] written to {target_dir}")

[is_churn_30_days] written to ../data/gold/is_churn_30_days
[is_churn_60_days] written to ../data/gold/is_churn_60_days
[is_churn_90_days] written to ../data/gold/is_churn_90_days


In [108]:
for target in targets:
    target_dir = BASE_GOLD_DIR / target
    target_dir.mkdir(parents=True, exist_ok=True)

    # ----------------------------
    # TRAIN labels
    # ----------------------------
    y_train.loc[
        X_train_by_target[target].index, target
    ].to_csv(
        target_dir / "y_train.csv",
        header=True,
    )

    # ----------------------------
    # VALIDATION labels
    # ----------------------------
    y_val.loc[
        X_val_by_target[target].index, target
    ].to_csv(
        target_dir / "y_val.csv",
        header=True,
    )

    # ----------------------------
    # TEST labels
    # ----------------------------
    y_test.loc[
        X_test_by_target[target].index, target
    ].to_csv(
        target_dir / "y_test.csv",
        header=True,
    )

    print(f"[{target}] y_train / y_val / y_test written")

[is_churn_30_days] y_train / y_val / y_test written
[is_churn_60_days] y_train / y_val / y_test written
[is_churn_90_days] y_train / y_val / y_test written


Instead of running this pipeline again, I will just read the existing sets.

# Model Test

## Test on is_churn_30_days

### Read Temp Saved Data

In [147]:
X_train = pd.read_csv(
    BASE_GOLD_DIR / "is_churn_30_days" / "X_train.csv",
    index_col=0,
)
X_val = pd.read_csv(
    BASE_GOLD_DIR / "is_churn_30_days" / "X_val.csv",
    index_col=0,
)
X_test = pd.read_csv(
    BASE_GOLD_DIR / "is_churn_30_days" / "X_test.csv",
    index_col=0,
)

y_train = pd.read_csv(
    BASE_GOLD_DIR / "is_churn_30_days" / "y_train.csv",
    index_col=0,
)
y_val = pd.read_csv(
    BASE_GOLD_DIR / "is_churn_30_days" / "y_val.csv",
    index_col=0,
)
y_test = pd.read_csv(
    BASE_GOLD_DIR / "is_churn_30_days" / "y_test.csv",
    index_col=0,
)

### LightGBM

In [139]:
lgbm_model = lgb.LGBMClassifier(
    objective="binary",
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.9,
    random_state=42
)

lgbm_model.fit(
    X_train,
    y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    eval_metric="auc"
)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



[LightGBM] [Info] Number of positive: 801, number of negative: 715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000780 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 14677
[LightGBM] [Info] Number of data points in the train set: 1516, number of used features: 60
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.528364 -> initscore=0.113578
[LightGBM] [Info] Start training from score 0.113578


0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.05
,n_estimators,1000
,subsample_for_bin,200000
,objective,'binary'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [140]:
train_metrics = evaluate_binary_model(lgbm_model, X_train, y_train)
test_metrics  = evaluate_binary_model(lgbm_model, X_test, y_test)
val_metrics   = evaluate_binary_model(lgbm_model, X_val, y_val)

train_metrics, val_metrics, test_metrics

({'roc_auc': 1.0,
  'pr_auc': 1.0,
  'confusion_matrix': array([[715,   0],
         [  0, 801]])},
 {'roc_auc': 0.5089594990674128,
  'pr_auc': 0.5672223127012446,
  'confusion_matrix': array([[47, 61],
         [58, 81]])},
 {'roc_auc': 0.4823934574313643,
  'pr_auc': 0.5464222200184634,
  'confusion_matrix': array([[ 86, 137],
         [117, 161]])})

In [120]:
show_styled_df_confusion_matrix(train_metrics["confusion_matrix"])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,715,0
Actual 1,0,801


In [None]:
show_styled_df_confusion_matrix(test_metrics["confusion_matrix"])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,86,137
Actual 1,117,161


In [None]:
show_styled_df_confusion_matrix(val_metrics["confusion_matrix"])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,86,137
Actual 1,117,161


In [None]:
plot_lgb_feature_importance(lgbm_model, importance_type="gain", normalize=True, top_n=30)

In [None]:
plot_lgb_feature_importance(lgbm_model, importance_type="split", normalize=True, top_n=30)

## Test on other models

In [136]:
log_reg = LogisticRegression(
    max_iter=1000,
    solver="lbfgs",
    n_jobs=-1
)

In [137]:
dt = DecisionTreeClassifier(
    max_depth=6,
    min_samples_leaf=50,
    random_state=42
)

In [138]:
xgb_model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="auc",
    random_state=42,
    n_jobs=-1
)

In [141]:
lgbm_model

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.05
,n_estimators,1000
,subsample_for_bin,200000
,objective,'binary'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [143]:
log_reg.fit(X_train, y_train)
dt.fit(X_train, y_train)
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.8
,device,
,early_stopping_rounds,
,enable_categorical,False


In [144]:
evaluate_model("Logistic Regression", log_reg,
               X_train, y_train, X_val, y_val, X_test, y_test)

evaluate_model("Decision Tree", dt,
               X_train, y_train, X_val, y_val, X_test, y_test)

evaluate_model("XGBoost", xgb_model,
               X_train, y_train, X_val, y_val, X_test, y_test)

evaluate_model("LightGBM", lgbm_model,
               X_train, y_train, X_val, y_val, X_test, y_test)


===== Logistic Regression =====

TRAIN
-----
ROC-AUC:  0.5785
Accuracy: 0.5620
              precision    recall  f1-score   support

           0       0.56      0.35      0.43       715
           1       0.56      0.75      0.64       801

    accuracy                           0.56      1516
   macro avg       0.56      0.55      0.54      1516
weighted avg       0.56      0.56      0.54      1516


VALIDATION
----------
ROC-AUC:  0.5154
Accuracy: 0.5668
              precision    recall  f1-score   support

           0       0.51      0.33      0.40       108
           1       0.59      0.75      0.66       139

    accuracy                           0.57       247
   macro avg       0.55      0.54      0.53       247
weighted avg       0.55      0.57      0.55       247


TEST
----
ROC-AUC:  0.4744
Accuracy: 0.4990
              precision    recall  f1-score   support

           0       0.40      0.25      0.30       223
           1       0.54      0.70      0.61       278



# EDA on Feature Sets

The results are terrible for tree-based models.

Quick EDA on Feature Sets to find out what could be the problem.

In [None]:
pd.merge(
    X_train,
    y_train,
    on='customer_id',
    how='inner'
).groupby('is_churn_30_days').mean()

Unnamed: 0_level_0,rfm_frequency_all_time,q60_days_since_first_transaction,q50_days_since_first_transaction,rfm_frequency_90d,max_days_since_previous_transaction,tenure_all_time,q20_days_since_first_transaction,rfm_recency_60d,q20_amount,q30_customer_transaction_order,...,q80_amount,mode_amount,tenure_60d,slope_days_since_first_transaction,q50_amount,q10_days_until_next_transaction,q70_customer_transaction_order,mean_days_since_first_transaction,q5_days_since_previous_transaction,q30_days_since_previous_transaction
is_churn_30_days,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.4e-05,-0.014711,-0.023684,-0.013327,-0.006902,0.027902,-0.032333,-0.010626,0.003962,-0.013336,...,0.029693,-0.02138,-0.004131,-0.007349,0.004937,-0.008814,-0.013335,-0.013833,-0.028923,-0.016127
1,-4.8e-05,0.013132,0.021141,0.011896,0.006161,-0.024907,0.028861,0.009485,-0.003537,0.011904,...,-0.026505,0.019085,0.003687,0.00656,-0.004407,0.007868,0.011904,0.012348,0.025818,0.014396


# Test Feature Transformation Pipeline 2

## Train & Evaluate

So my hypothesis is that the scaler and the imputer uses train set parameters that keeps the training feature distribution stuck to a specific region, making the path to the optimal region farther (harder to reach).

I'll test my hypothesis by removing the following steps:
- Scaler
- Imputer
- Feature Selection

And just use the raw features.

In [156]:
transactions_modeling_features_df = add_transaction_time_features(transactions_modeling_df)

In [None]:
X_train_ids = pd.DataFrame(X_train.reset_index()['customer_id'])

X_train_raw_features_df = build_customer_features(
    transactions_modeling_df=transactions_modeling_features_df,
    customers_modeling_df=X_train_ids,
    observed_date=TRAIN_SNAPSHOT_DATE
)

X_train_raw_features_df = X_train_raw_features_df.set_index(keys='customer_id', drop=True)

In [None]:
X_test_ids = pd.DataFrame(X_test.reset_index()['customer_id'])

X_test_raw_features_df = build_customer_features(
    transactions_modeling_df=transactions_modeling_features_df,
    customers_modeling_df=X_test_ids,
    observed_date=TRAIN_SNAPSHOT_DATE
)

X_test_raw_features_df = X_test_raw_features_df.set_index(keys='customer_id', drop=True)

In [None]:
X_val_ids = pd.DataFrame(X_val.reset_index()['customer_id'])

X_val_raw_features_df = build_customer_features(
    transactions_modeling_df=transactions_modeling_features_df,
    customers_modeling_df=X_val_ids,
    observed_date=TRAIN_SNAPSHOT_DATE
)

X_val_raw_features_df = X_val_raw_features_df.set_index(keys='customer_id', drop=True)

In [None]:
dt2 = DecisionTreeClassifier(
    max_depth=6,
    min_samples_leaf=50,
    random_state=42
)

dt2.fit(
    X_train_raw_features_df,
    y_train
)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,6
,min_samples_split,2
,min_samples_leaf,50
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [179]:
xgb_model2 = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="auc",
    random_state=42,
    n_jobs=-1
)

xgb_model2.fit(
    X_train_raw_features_df, y_train,
    eval_set=[(X_test_raw_features_df, y_test)],
    verbose=False
)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.8
,device,
,early_stopping_rounds,
,enable_categorical,False


In [180]:
lgbm_model2 = lgb.LGBMClassifier(
    objective="binary",
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.9,
    random_state=42
)

lgbm_model2.fit(
    X_train_raw_features_df,
    y_train,
    eval_set=[(X_train_raw_features_df, y_train), (X_test_raw_features_df, y_test)],
    eval_metric="auc"
)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



[LightGBM] [Info] Number of positive: 801, number of negative: 715
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002775 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16276
[LightGBM] [Info] Number of data points in the train set: 1516, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.528364 -> initscore=0.113578
[LightGBM] [Info] Start training from score 0.113578


0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.05
,n_estimators,1000
,subsample_for_bin,200000
,objective,'binary'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [185]:
evaluate_model("Decision Tree", dt2,
               X_train_raw_features_df, y_train, X_test_raw_features_df, y_test, X_val_raw_features_df, y_val)

evaluate_model("XGBoost", xgb_model2,
               X_train_raw_features_df, y_train, X_test_raw_features_df, y_test, X_val_raw_features_df, y_val)

evaluate_model("LightGBM", lgbm_model2,
               X_train_raw_features_df, y_train, X_test_raw_features_df, y_test, X_val_raw_features_df, y_val)


===== Decision Tree =====

TRAIN
-----
ROC-AUC:      0.6226
PR-AUC:       0.6195
Accuracy:     0.5956

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0          271          444
Actual 1          169          632

Classification Report:
              precision    recall  f1-score   support

           0       0.62      0.38      0.47       715
           1       0.59      0.79      0.67       801

    accuracy                           0.60      1516
   macro avg       0.60      0.58      0.57      1516
weighted avg       0.60      0.60      0.58      1516


TEST
----
ROC-AUC:      0.4857
PR-AUC:       0.5422
Accuracy:     0.5190

Confusion Matrix:
          Predicted 0  Predicted 1
Actual 0           58          165
Actual 1           76          202

Classification Report:
              precision    recall  f1-score   support

           0       0.43      0.26      0.32       223
           1       0.55      0.73      0.63       278

    accuracy                        

The performance of the models are roughly the same. Which means feature engineering doesn't have a clear impact on the performance of these tree models.

## Investigate Low AUC

In [None]:
pd.merge(
    X_train_raw_features_df,
    y_train,
    on='customer_id',
    how='inner'
).groupby('is_churn_30_days').mean()

Unnamed: 0_level_0,rfm_recency_all_time,rfm_frequency_all_time,rfm_monetary_all_time,tenure_all_time,rfm_recency_30d,rfm_frequency_30d,rfm_monetary_30d,tenure_30d,rfm_recency_60d,rfm_frequency_60d,...,q20_days_since_first_transaction,q30_days_since_first_transaction,q40_days_since_first_transaction,q50_days_since_first_transaction,q60_days_since_first_transaction,q70_days_since_first_transaction,q80_days_since_first_transaction,q90_days_since_first_transaction,q95_days_since_first_transaction,q99_days_since_first_transaction
is_churn_30_days,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,51.317483,11.593007,679.967944,78.202797,72.503925,10.577708,625.898791,71.103611,94.47585,9.545617,...,12.611192,19.426764,25.861314,31.903893,38.479805,44.956934,51.419951,58.016302,61.267153,63.921922
1,56.3196,11.59176,669.620062,74.923845,76.492958,10.738028,608.378549,69.539437,98.929508,9.97377,...,13.737418,20.416193,27.220131,33.342451,39.379431,45.73698,51.989497,58.259081,61.328228,63.903129


My guess is that the distributions between the two classes are so similar that the models can't find a way to differentiate them. Aka the current features are not useful.

In [189]:
temp_df = X_train_raw_features_df.copy()
temp_df['target'] = y_train

In [193]:
def kl_divergence_per_feature(df, target_col='target', bins=50):
    features = df.columns.drop(target_col)
    kl_dict = {}

    for col in features:
        # Separate feature by class, drop NaNs
        x0 = df[df[target_col] == 0][col].dropna().values
        x1 = df[df[target_col] == 1][col].dropna().values

        # Skip feature if one class is empty
        if len(x0) == 0 or len(x1) == 0:
            kl_dict[col] = np.nan
            continue

        # Histogram + probability distribution
        min_val = min(x0.min(), x1.min())
        max_val = max(x0.max(), x1.max())

        # If min == max, skip feature (no variance)
        if min_val == max_val:
            kl_dict[col] = 0.0
            continue

        hist0, _ = np.histogram(x0, bins=bins, range=(min_val, max_val), density=True)
        hist1, _ = np.histogram(x1, bins=bins, range=(min_val, max_val), density=True)

        # Smooth zeros
        hist0 += 1e-8
        hist1 += 1e-8

        # Normalize
        p0 = hist0 / hist0.sum()
        p1 = hist1 / hist1.sum()

        # Symmetric KL
        kl0_1 = stats.entropy(p0, p1)
        kl1_0 = stats.entropy(p1, p0)
        kl_avg = 0.5 * (kl0_1 + kl1_0)

        kl_dict[col] = kl_avg

    kl_df = pd.DataFrame.from_dict(kl_dict, orient='index', columns=['KL_divergence'])
    kl_df = kl_df.sort_values('KL_divergence', ascending=False)
    return kl_df

In [199]:
kl_df = kl_divergence_per_feature(temp_df, target_col='target', bins=50)


divide by zero encountered in divide


invalid value encountered in divide



In [201]:
print("KL divergence summary:")
kl_df['KL_divergence'].describe()

KL divergence summary:


count    105.000000
mean       0.212131
std        0.087263
min        0.000000
25%        0.174720
50%        0.201643
75%        0.249194
max        0.585427
Name: KL_divergence, dtype: float64

Around 25% of the features have < 0.2 KL, which partly explains my theory.

# Log Results

## MLflow setup

In [None]:
mlflow.set_experiment(EXPERIMENT_NAME)

<Experiment: artifact_location='file:///home/hong-mai/Desktop/HONGMAI/Coding/ai-customer-growth-retention/mlruns/632645343096231848', creation_time=1768534303122, experiment_id='632645343096231848', last_update_time=1768534303122, lifecycle_stage='active', name='churn-lightgbm', tags={}>

## Get Data

In [90]:
# --------------------------------------------------
# Build training base tables
# --------------------------------------------------
transactions_modeling_df, customers_modeling_df = build_training_base(
    seed_customers_path=f"../{SEED_CUSTOMERS}",
    seed_transactions_path=f"../{SEED_TRANSACTIONS}",
    train_snapshot_date=TRAIN_SNAPSHOT_DATE,
    churn_windows=(30, 60, 90),
    )

# --------------------------------------------------
# Split features / targets
# --------------------------------------------------
X_train, X_val, X_test, y_train, y_val, y_test = split_train_test_val(
    customers_modeling_df,
    targets=targets,
    test_size=0.33,
    val_size=0.33,
    random_state=42,
)

Number of columns: 3
Column names: ['customer_id', 'transaction_date', 'amount']
Number of rows: 46,704
Data Preview: 

  customer_id transaction_date  amount
0      C00000       2025-09-10  195.78
1      C00000       2025-09-12   50.87
2      C00000       2025-10-01  133.25
3      C00000       2025-10-16   37.44
4      C00000       2025-10-18  101.95
Number of columns: 3
Column names: ['customer_id', 'signup_date', 'true_lifetime_days']
Number of rows: 3,000
Data Preview: 

  customer_id signup_date  true_lifetime_days
0      C00000  2025-08-22                 204
1      C00001  2025-03-07                 365
2      C00002  2025-08-18                  48
3      C00003  2025-09-22                  84
4      C00004  2025-05-28                 113


In [94]:
# --------------------------------------------------
# TRAIN — build raw & transformed customer-level features
# --------------------------------------------------
(
    X_train_raw_features_df,
    X_train_by_target,
    selected_features_by_target,
    mi_scores_by_target,
    numeric_imputer,
    scaler
) = build_and_transform_customer_features_pipeline_train(
    transactions_modeling_df=transactions_modeling_df,
    X_train=X_train,
    y_train=y_train,
    observed_date=TRAIN_SNAPSHOT_DATE,
    targets=targets,
    ARTIFACT_DIR=None,
    feature_list=[
        "amount",
        "days_since_previous_transaction",
        "days_until_next_transaction",
        "customer_transaction_order",
        "days_since_first_transaction",
    ]
)


[IterativeImputer] Early stopping criterion not reached.



[is_churn_30_days] selected 62 features
[is_churn_60_days] selected 60 features
[is_churn_90_days] selected 62 features


In [174]:
# --------------------------------------------------
# TEST — build raw customer-level features
# --------------------------------------------------
X_test_raw_features_df = build_customer_features(
    transactions_modeling_df=transactions_modeling_df,
    customers_modeling_df=X_test,
    observed_date=TRAIN_SNAPSHOT_DATE,
)
X_test_raw_features_df = X_test_raw_features_df.set_index("customer_id", drop=True)

# --------------------------------------------------
# VAL — build raw customer-level features
# --------------------------------------------------
X_val_raw_features_df = build_customer_features(
    transactions_modeling_df=transactions_modeling_df,
    customers_modeling_df=X_val,
    observed_date=TRAIN_SNAPSHOT_DATE,
)
X_val_raw_features_df = X_val_raw_features_df.set_index("customer_id", drop=True)


In [175]:
# --------------------------------------------------
# TEST - Feature selection per target
# --------------------------------------------------
# --------------------------------------------------
X_test_by_target = transform_and_select_for_multiple_targets_test(
    X_test_raw_features_df=X_test_raw_features_df,
    numeric_imputer=numeric_imputer,
    scaler=scaler,
    selected_features_by_target=selected_features_by_target
)

# --------------------------------------------------
# VAL — build and transform customer-level features
# --------------------------------------------------
X_val_by_target = transform_and_select_for_multiple_targets_test(
    X_test_raw_features_df=X_val_raw_features_df,
    numeric_imputer=numeric_imputer,
    scaler=scaler,
    selected_features_by_target=selected_features_by_target
)

## Write Data

In [None]:
# -----------------------------
# TARGET
# -----------------------------

y_train.to_csv(BASE_GOLD_DIR / "target" / "y_train.csv")
y_test.to_csv(BASE_GOLD_DIR / "target" / "y_test.csv")
y_val.to_csv(BASE_GOLD_DIR / "target" / "y_val.csv")

temp_df = pd.concat(
    [y_train,
    y_test,
    y_val]
)

temp_df.to_csv(BASE_GOLD_DIR / "target" / "y_all.csv", index=True)

In [217]:
# -----------------------------
# RAW FEATURES
# -----------------------------
save_raw_features_csv(
    X_train_raw_features_df,
    split="train",
    base_gold_dir=BASE_GOLD_DIR,
)

save_raw_features_csv(
    X_val_raw_features_df,
    split="val",
    base_gold_dir=BASE_GOLD_DIR,
)

save_raw_features_csv(
    X_test_raw_features_df,
    split="test",
    base_gold_dir=BASE_GOLD_DIR,
)

WRITING TO: /home/hong-mai/Desktop/HONGMAI/Coding/ai-customer-growth-retention/data/gold/31_12_2025/raw
WRITING TO: /home/hong-mai/Desktop/HONGMAI/Coding/ai-customer-growth-retention/data/gold/31_12_2025/raw
WRITING TO: /home/hong-mai/Desktop/HONGMAI/Coding/ai-customer-growth-retention/data/gold/31_12_2025/raw


In [None]:
# Merge the dataset so inference is easier.
'''
temp_df = pd.concat(
    [
        X_train_raw_features_df,
        X_test_raw_features_df,
        X_val_raw_features_df
    ]
)

temp_df.to_csv(BASE_GOLD_DIR / "raw" / "all_features.csv", index=True)
'''

In [229]:
# -----------------------------
# TRANSFORMED FEATURES (by target)
# -----------------------------
save_transformed_by_target_csv(
    X_train_by_target,
    split="train",
    base_gold_dir=BASE_GOLD_DIR,
)

save_transformed_by_target_csv(
    X_val_by_target,
    split="val",
    base_gold_dir=BASE_GOLD_DIR,
)

save_transformed_by_target_csv(
    X_test_by_target,
    split="test",
    base_gold_dir=BASE_GOLD_DIR,
)

In [None]:
# Merge the dataset so inference is easier.
'''
for target in targets:

    test_df = pd.read_csv(BASE_GOLD_DIR / "transformed" / target / "X_test.csv", index_col=0)
    train_df = pd.read_csv(BASE_GOLD_DIR / "transformed" / target / "X_train.csv", index_col=0)
    val_df = pd.read_csv(BASE_GOLD_DIR / "transformed" / target / "X_val.csv", index_col=0)

    temp_df = pd.concat(
        [
            train_df,
            test_df,
            val_df
        ]
    )

    temp_df.to_csv(BASE_GOLD_DIR / "transformed" / target / "X_all.csv", index=True)
'''

In [259]:
joblib.dump(
    numeric_imputer,
    PREPROCESSING_REF_DIR / "numeric_imputer.joblib",
)

joblib.dump(
    scaler,
    PREPROCESSING_REF_DIR / "scaler.joblib",
)

['/home/hong-mai/Desktop/HONGMAI/Coding/ai-customer-growth-retention/data/gold/31_12_2025/reference/preprocessing/scaler.joblib']

In [262]:
with open(PREPROCESSING_REF_DIR / "selected_features_by_target.json", "w") as f:
    json.dump(selected_features_by_target, f, indent=2)

## Read Data

In [246]:
# --------------------------------------------------
# READ TARGETS
# --------------------------------------------------
y_train = pd.read_csv(BASE_GOLD_DIR / "target" / "y_train.csv")
y_val   = pd.read_csv(BASE_GOLD_DIR / "target" / "y_val.csv")
y_test  = pd.read_csv(BASE_GOLD_DIR / "target" / "y_test.csv")

# --------------------------------------------------
# READ RAW FEATURES
# --------------------------------------------------
X_train_raw = (
    pd.read_csv(BASE_GOLD_DIR / "raw" / "train_features.csv")
    .set_index("customer_id")
)
X_val_raw = (
    pd.read_csv(BASE_GOLD_DIR / "raw" / "val_features.csv")
    .set_index("customer_id")
)
X_test_raw = (
    pd.read_csv(BASE_GOLD_DIR / "raw" / "test_features.csv")
    .set_index("customer_id")
)

In [230]:
# --------------------------------------------------
# READ TRANSFORMED FEATURES
# --------------------------------------------------
X_train_transformed = {t: load_transformed("train", t) for t in targets}
X_val_transformed   = {t: load_transformed("val", t) for t in targets}
X_test_transformed  = {t: load_transformed("test", t) for t in targets}

## Train Models

In [282]:
# --------------------------------------------------
# ORCHESTRATION
# --------------------------------------------------
for dataset_version, X_tr, X_v in [
    ("raw", X_train_raw, X_val_raw),
    ("transformed", None, None),  # handled below
]:
    for target in targets:
        with mlflow.start_run(
            run_name=f"{dataset_version}_{target}"
        ):
            mlflow.log_param("gold_data_version", BASE_GOLD_DIR.name)
            mlflow.log_param("dataset_version", dataset_version)

            train_lgbm(
                X_train=X_tr if dataset_version == "raw" else X_train_transformed[target],
                y_train=y_train,
                X_val=X_v if dataset_version == "raw" else X_val_transformed[target],
                y_val=y_val,
                target=target,
                dataset_version=dataset_version,
            )

[LightGBM] [Info] Number of positive: 534, number of negative: 476
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003474 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12403
[LightGBM] [Info] Number of data points in the train set: 1010, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.528713 -> initscore=0.114978
[LightGBM] [Info] Start training from score 0.114978
[LightGBM] [Info] Number of positive: 534, number of negative: 477
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003313 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12799
[LightGBM] [Info] Number of data points in the train set: 1011, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.528190 -> initscore=0.112879
[LightGBM] [Info] Start training from score 0.112879
[LightGBM] [Info] Numb


Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details.



[LightGBM] [Info] Number of positive: 456, number of negative: 554
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002399 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12390
[LightGBM] [Info] Number of data points in the train set: 1010, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.451485 -> initscore=-0.194672
[LightGBM] [Info] Start training from score -0.194672
[LightGBM] [Info] Number of positive: 456, number of negative: 555
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002163 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12802
[LightGBM] [Info] Number of data points in the train set: 1011, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.451039 -> initscore=-0.196475
[LightGBM] [Info] Start training from score -0.196475
[LightGBM] [Info] 


Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details.



[LightGBM] [Info] Number of positive: 382, number of negative: 628
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001506 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12394
[LightGBM] [Info] Number of data points in the train set: 1010, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.378218 -> initscore=-0.497120
[LightGBM] [Info] Start training from score -0.497120
[LightGBM] [Info] Number of positive: 382, number of negative: 629
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001456 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12732
[LightGBM] [Info] Number of data points in the train set: 1011, number of used features: 106
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.377844 -> initscore=-0.498711
[LightGBM] [Info] Start training from score -0.498711
[LightGBM] [Info] 


Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details.



[LightGBM] [Info] Number of positive: 534, number of negative: 476
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001465 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 14152
[LightGBM] [Info] Number of data points in the train set: 1010, number of used features: 60
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.528713 -> initscore=0.114978
[LightGBM] [Info] Start training from score 0.114978
[LightGBM] [Info] Number of positive: 534, number of negative: 477
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001781 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 14120
[LightGBM] [Info] Number of data points in the train set: 1011, number of used features: 60
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.528190 -> initscore=0.112879
[LightGBM] [Info] Start training from score 0.112879
[LightGBM] [Info] Number

Using the Mlflow UI, we can compare the models:
- Transformed features: Better for 30 days and 60 days prediction window
- 90 days: Both models performed worse than random guessing.

![image.png](attachment:image.png)

# Call Models

## Promote to Production

In [128]:
## 30 days - Transformed Features
promote_to_production('7051a7a96e0442a8b9904bb8f05163c3')
## 60 days - Transformed Features
promote_to_production('85216ba359d243949e4acefd87fd1eed')
## 90 days - Raw Features
promote_to_production('43363f7766614a26b74485d845e90c28')

## Load Data and Models

In [40]:
raw_features_df = load_features("raw")

In [41]:
transformed_features_by_target = load_features("transformed")

In [42]:
models, metadata = load_production_models()

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

## Inference

In [178]:
predict_churn(
    customer_id="C01552",
    horizon_days=30,
    raw_features_df=raw_features_df,
    transformed_features_by_target=transformed_features_by_target,
    models=models,
    metadata=metadata
)

{'churn_probability': 0.0475, 'churn_label': 'low_risk'}

In [180]:
predict_churn(
    customer_id="C01553",
    horizon_days=30,
    raw_features_df=raw_features_df,
    transformed_features_by_target=transformed_features_by_target,
    models=models,
    metadata=metadata
)

{'churn_probability': 0.9268, 'churn_label': 'high_risk'}

## Log Inference Results

Purpose: to compare to other models later.

In [57]:
customer_ids=list(transactions_modeling_df['customer_id'].unique())

In [63]:
# 30 / 60 / 90 day predictions
p30 = (
    predict_churns(
        customer_ids=customer_ids,
        horizon_days=30,
        raw_features_df=raw_features_df,
        transformed_features_by_target=transformed_features_by_target,
        models=models,
        metadata=metadata,
    )["churn_probability"]
    .rename("p_is_churn_30_days")
)

p60 = (
    predict_churns(
        customer_ids=customer_ids,
        horizon_days=60,
        raw_features_df=raw_features_df,
        transformed_features_by_target=transformed_features_by_target,
        models=models,
        metadata=metadata,
    )["churn_probability"]
    .rename("p_is_churn_60_days")
)

p90 = (
    predict_churns(
        customer_ids=customer_ids,
        horizon_days=90,
        raw_features_df=raw_features_df,
        transformed_features_by_target=transformed_features_by_target,
        models=models,
        metadata=metadata,
    )["churn_probability"]
    .rename("p_is_churn_90_days")
)

In [70]:
# Combine (index-safe)
prediction_df = pd.concat([p30, p60, p90], axis=1)

In [None]:
prediction_df.head()

Unnamed: 0_level_0,p_is_churn_30_days,p_is_churn_60_days,p_is_churn_90_days
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C00000,0.1041,0.3694,0.287
C00001,0.086,0.1889,0.088
C00002,0.9755,0.7225,0.2212
C00004,0.0509,0.1205,0.0877
C00006,0.909,0.7637,0.7145


In [None]:
prediction_df.to_csv(BASE_GOLD_DIR / "inference" / "classifier_1" / "churn_prob.csv")

In [73]:
output_dir = BASE_GOLD_DIR / "inference" / "classifier_1"
with open(output_dir / "metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

I want to specify the set of models used for inference at the time, hence, I wrote the inference results in a folder, with metadata attached.