PHASE 4
Hazard Modeling: Time-Dependent Churn Risk

In [29]:
# STEP 4.1 — Load Phase 3 Artifact (Immutable)

In [30]:
import pandas as pd

df = pd.read_parquet("phase3_state_with_survival.parquet")
print(df.shape)
df.head()


(37039, 10)


Unnamed: 0,Customer ID,InvoiceDate,recency_days,frequency,monetary_avg,delta_revenue,delta_recency,is_alive,duration,event
0,12346.0,2009-12-14 08:34:00,,0,45.0,,,False,325,1
1,12346.0,2009-12-14 11:00:00,0.0,1,33.75,-22.5,,False,325,1
2,12346.0,2009-12-14 11:02:00,0.0,2,30.0,0.0,0.0,False,325,1
3,12346.0,2009-12-18 10:47:00,3.0,3,28.125,0.0,3.0,False,325,1
4,12346.0,2009-12-18 10:55:00,0.0,4,22.7,-21.5,-3.0,False,325,1


In [31]:
# STEP 4.2 — Choose Time Representation

### STEP 4.2 — Choice of Time Representation

In this phase, we model churn risk using a **discrete-time survival framework**.

The continuous survival duration (measured as days since last observed transaction)
is discretized into fixed-length time intervals (monthly bins).

This choice is motivated by:

1. Interpretability:  
   Discrete-time hazard directly represents the probability of churn
   in the next time interval, conditional on survival so far.

2. Data characteristics:  
   Transaction events occur at irregular time gaps, making discrete-time
   modeling more appropriate than continuous-time assumptions.

3. Practical and academic validity:  
   Discrete-time survival analysis is a standard and accepted approach
   in non-contractual customer lifetime modeling and decision science literature.

This representation enables a clear person-period dataset construction
and supports interpretable hazard estimation using generalized linear models.


In [32]:
# STEP 4.3 — Define Time Bins

In [33]:
df["time_bin"] = (df["duration"] // 30).astype(int)
df["time_bin"].describe()


Unnamed: 0,time_bin
count,37039.0
mean,2.609439
std,4.806882
min,0.0
25%,0.0
50%,0.0
75%,2.0
max,24.0


In [34]:
# STEP 4.4 — Expand to Person-Period Format

In [35]:
rows = []

for _, row in df.iterrows():
    for t in range(row["time_bin"] + 1):
        rows.append({
            "Customer ID": row["Customer ID"],
            "time_bin": t,
            "event": int((t == row["time_bin"]) and (row["event"] == 1)),
            "recency_days": row["recency_days"],
            "frequency": row["frequency"],
            "monetary_avg": row["monetary_avg"],
            "delta_revenue": row["delta_revenue"],
            "delta_recency": row["delta_recency"]
        })

person_period_df = pd.DataFrame(rows)
person_period_df.head()


Unnamed: 0,Customer ID,time_bin,event,recency_days,frequency,monetary_avg,delta_revenue,delta_recency
0,12346.0,0,0,,0,45.0,,
1,12346.0,1,0,,0,45.0,,
2,12346.0,2,0,,0,45.0,,
3,12346.0,3,0,,0,45.0,,
4,12346.0,4,0,,0,45.0,,


In [36]:
# STEP 4.5 — Sanity Checks

In [37]:
# Event happens once per customer max
person_period_df.groupby("Customer ID")["event"].sum().max()


155

In [38]:
# Hazard base rate by time
person_period_df.groupby("time_bin")["event"].mean().head(10)


Unnamed: 0_level_0,event
time_bin,Unnamed: 1_level_1
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0
6,0.099035
7,0.099374
8,0.087319
9,0.081105


In [39]:
# STEP 4.5.1 — Missing Value Handling

In [40]:
# STEP 4.5.1 — Handle Missing Values (CLV-safe)

from sklearn.impute import SimpleImputer

features = [
    "recency_days",
    "frequency",
    "monetary_avg",
    "delta_revenue",
    "delta_recency",
    "time_bin"
]

X = person_period_df[features]
y = person_period_df["event"]

# Median imputation for numerical stability
imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X)


In [41]:
# Logistic Regression

In [42]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_imputed, y)


In [43]:
# Quick sanity check
import numpy as np

np.isnan(X_imputed).sum()


np.int64(0)

### Missing Value Handling

Certain customer state variables (e.g., recency and behavioral deltas)
are undefined for early customer events, resulting in missing values.

We apply median imputation to numerical features prior to hazard modeling.
This choice preserves sample size, avoids survival bias, and maintains
interpretability of the discrete-time hazard model.


In [44]:
# STEP 4.6 — Fit a Simple Hazard Model (Interpretable)

In [45]:


from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

features = [
    "recency_days",
    "frequency",
    "monetary_avg",
    "delta_revenue",
    "delta_recency",
    "time_bin"
]

X = person_period_df[features]
y = person_period_df["event"]

# Median imputation (CLV-safe)
imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X)

# Discrete-time hazard model
hazard_model = LogisticRegression(max_iter=1000)
hazard_model.fit(X_imputed, y)


We estimate the discrete-time hazard function using logistic regression,
where the model predicts the probability of churn in the next time interval
conditional on survival up to that interval.

Logistic regression is chosen for its interpretability and stability,
allowing direct inspection of how customer state variables influence
churn risk over time.


In [46]:
# STEP 4.7 — Inspect Hazard Direction (VERY IMPORTANT)

In [47]:
coef_df = pd.DataFrame({
    "feature": features,
    "coef": hazard_model.coef_[0]
}).sort_values("coef")

coef_df



Unnamed: 0,feature,coef
4,delta_recency,-0.002123
2,monetary_avg,-1.5e-05
3,delta_revenue,3e-06
0,recency_days,0.004614
1,frequency,0.006016
5,time_bin,0.257832


In [48]:
hasattr(hazard_model, "coef_")


True

In [49]:
# STEP 4.8 — Save Phase 4 Artifact

In [50]:
person_period_df.to_parquet(
    "phase4_person_period_dataset.parquet",
    index=False
)
