## Maintenance Contract – Cluster Monitoring & Retraining

This notebook implements a “maintenance contract” for the customer segmentation model (M0). Its goal is to monitor when the clustering model degrades over time and to define a clear retraining cadence based on both cluster stability and feature drift.

**Key objectives:**
1. **Simulate Cluster Drift**  
   - Use the Adjusted Rand Index (ARI) to compare the original K-Means segmentation (M0) against fresh re-fits at weekly snapshots.  
   - Identify the week when ARI falls below our 0.8 threshold (≈90 % label agreement), triggering a model retrain.

2. **Assess Feature Stability**  
   - For each continuous feature (e.g. recency, log spend), run two-sample KS tests between week 0 and each subsequent week.  
   - For binary flags (e.g. returning‐customer), apply chi-square tests to detect any shifts in proportions.

3. **Define Maintenance Cadence**  
   - Combine ARI decline and feature-drift signals to recommend a retraining interval (e.g. every 12 weeks).  
   - Highlight which features drive most of the drift—and which segments require the most urgent attention.

## Load data

In [1]:
# Libraries
import warnings
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from scipy.stats import ks_2samp, chi2_contingency
import os
from dateutil.relativedelta import relativedelta
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

# Set-up environment
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_colwidth', None)
sns.set_theme(style="whitegrid", context="paper")
os.chdir('/Users/nataschajademinnitt/Documents/5. Data Analysis/segmenting_customers/')
print("Current directory:", os.getcwd())
warnings.filterwarnings("ignore")

Current directory: /Users/nataschajademinnitt/Documents/5. Data Analysis/segmenting_customers


In [3]:
# Load the data
df = pd.read_csv("./data/processed/df_maintenance.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96478 entries, 0 to 96477
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_unique_id        96478 non-null  object 
 1   order_purchase_timestamp  96478 non-null  object 
 2   order_id                  96478 non-null  object 
 3   m_price                   96478 non-null  float64
dtypes: float64(1), object(3)
memory usage: 2.9+ MB


## Functions

In [None]:
def simulate_ari_drift(df, features, model, scaler, k, max_weeks=24, threshold=0.6):
    """
    Simulate ARI-based maintenance drift.
    """
    results = []
    for w in range(0, min(df['weeks_since'].max(), max_weeks) + 1):
        snap = df[df['weeks_since'] >= w]
        Xw_s = scaler.transform(snap[features].values)
        C_init = model.predict(Xw_s)
        C_new = KMeans(n_clusters=k, random_state=42).fit_predict(Xw_s)
        ari = adjusted_rand_score(C_init, C_new)
        results.append({'weeks_since': w, 'ari': ari})
        if ari < threshold:
            break
    return pd.DataFrame(results)

In [None]:
def plot_ari_drift(df_ari, threshold=0.8, xtick_step=2,
                   xlabel='Weeks since initial fit (T0)',
                   ylabel='Adjusted Rand Index',
                   title='Model Validity Over Time',
                   outpath=None):
    """
    Plot ARI drift over time.
    """
    fig, ax = plt.subplots(figsize=(8, 4))
    ax.plot(df_ari['weeks_since'], df_ari['ari'], marker='o', linestyle='-')
    ax.axhline(threshold, color='red', linestyle='--', label=f'ARI = {threshold}')
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    max_week = df_ari['weeks_since'].max()
    ax.set_xticks(range(0, max_week + 1, xtick_step))
    ax.set_title(title)
    ax.legend()
    plt.tight_layout()
    if outpath:
        plt.savefig(outpath, dpi=150)
    plt.show()

In [None]:
def ks_drift_tests(df, features, max_week=12):
    """
    Perform KS tests for continuous feature drift.
    """
    ks_results = []
    baseline = df[df['weeks_since'] == 0]
    for feat in features:
        data0 = baseline[feat]
        for w in range(1, min(df['weeks_since'].max(), max_week) + 1):
            dataw = df[df['weeks_since'] == w][feat]
            if len(dataw) < 10:
                continue
            stat, pval = ks_2samp(data0, dataw)
            ks_results.append({
                'feature': feat,
                'week': w,
                'ks_stat': stat,
                'pvalue': pval
            })
    return pd.DataFrame(ks_results)

In [None]:
def chi2_drift_tests(df, feature, max_week=12):
    """
    Perform chi-square tests for binary feature drift.
    """
    base = df[df['weeks_since'] == 0][feature].value_counts().sort_index().reindex([0,1], fill_value=0)
    chi_results = []
    for w in range(1, min(df['weeks_since'].max(), max_week) + 1):
        comp = (df[df['weeks_since'] == w][feature]
                .value_counts().sort_index()
                .reindex([0,1], fill_value=0))
        table = pd.DataFrame({'week0': base, f'week{w}': comp})
        chi2, p, dof, expected = chi2_contingency(table.values)
        chi_results.append({'week': w, 'chi2_stat': chi2, 'p_value': p})
    return pd.DataFrame(chi_results).set_index('week')

In [None]:
def plot_feature_drift(
    df_ks,
    df_chi,
    max_week=12,
    save_fig=False,
    fig_path="./plots/feature_drift.png"
):
    """
    Plot KS p-values for continuous features and chi-square p-values for one binary feature.
    """
    weeks = range(1, max_week+1)
    fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(8, 6))

    # Top: KS tests for continuous features
    for feat in df_ks['feature'].unique():
        sub = df_ks[df_ks['feature'] == feat]
        ax1.plot(sub['week'], sub['pvalue'], marker='o', label=feat)
    ax1.axhline(0.05, color='red', linestyle='--', label='α = 0.05')
    ax1.set_ylabel('KS p-value')
    ax1.set_title('Continious Feature Drift (KS Test)')
    ax1.legend(loc='upper right')

    # Bottom: Chi-square test for binary feature
    ax2.plot(
        df_chi.index, df_chi['p_value'],
        marker='s', color='tab:blue', label='binary feature'
    )
    ax2.axhline(0.05, color='red', linestyle='--', label='α = 0.05')
    ax2.set_xlabel('Weeks since initial training (T0)')
    ax2.set_ylabel('Chi-square p-value')
    ax2.set_title('Binary Feature Drift (Test χ²)')
    ax2.legend(loc='upper right')

    plt.xticks(weeks)
    plt.tight_layout()

    if save_fig:
        os.makedirs(os.path.dirname(fig_path), exist_ok=True)
        fig.savefig(fig_path, dpi=150)

    plt.show()

In [None]:
def sliding_monthly_ari(
    df,
    features,
    best_k,
    base_end,
    n_windows=7,
    random_state=42
):
    """
    Compute ARI between the base-month model (M0) and sliding monthly models (M1..M6).
    
    Parameters
    ----------
    df : pd.DataFrame
        Must contain timestamp_col and feature columns.
    timestamp_col : str
        Name of the datetime column.
    features : list of str
        Feature columns for clustering.
    best_k : int
        Number of clusters for KMeans.
    base_end : pd.Timestamp
        End date for M0 window (exclusive).
    n_windows : int
        Number of monthly windows to evaluate (including M0).
    random_state : int
        Seed for KMeans and subsampling.
        
    Returns
    -------
    pd.DataFrame with columns ['model','end_date','ari'] for M1..M6.
    """
    
    models = []
    scalers = []
    end_dates = []
    
    # 1) Fit models M0..M6 on rolling 12-month windows
    for i in range(n_windows):
        end = base_end + relativedelta(months=i)
        start = end - relativedelta(years=1)
        mask = (df['order_purchase_timestamp'] >= start) & (df['order_purchase_timestamp'] < end)
        window = df.loc[mask, features]
        X = window.values
        
        scaler = StandardScaler().fit(X)
        Xs = scaler.transform(X)
        
        km = KMeans(n_clusters=best_k, random_state=random_state).fit(Xs)
        
        scalers.append(scaler)
        models.append(km)
        end_dates.append(end)
    
    # 2) Compute ARI between M0.predict and Mx.predict on the same X
    results = []
    for i in range(1, n_windows):
        end = end_dates[i]
        start = end - relativedelta(years=1)
        window = df.loc[(df['order_purchase_timestamp'] >= start) & (df['order_purchase_timestamp'] < end), features]
        X = window.values
        
        labels0 = models[0].predict(scalers[0].transform(X))
        labels_i = models[i].predict(scalers[i].transform(X))
        
        ari = adjusted_rand_score(labels0, labels_i)
        results.append({
            'model': f'M{i}',
            'end_date': end,
            'ari': ari
        })
    
    return pd.DataFrame(results)

## Creatings DFs

## ARI revised

In [5]:
def biweekly_ari(
    df,
    features,
    best_k,
    lead_time_weeks=12,
    step_weeks=2,
    random_state=42
):
    """
    1) Define M0 end at (max_date - lead_time_weeks).
    2) Fit M0 on the prior 12 months.
    3) In bi-weekly steps, slide the end date forward by `step_weeks`:
       - Retrain Mi on the year ending at that new end date.
       - Compute ARI between M0.predict and Mi.predict on their shared customers.
    Returns a DataFrame of ARI vs. window-end.
    """
    # ensure datetime
    df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'])
    max_date   = df['order_purchase_timestamp'].max()
    
    # 1) M0 end and training window
    m0_end = max_date - relativedelta(weeks=lead_time_weeks)
    m0_start = m0_end - relativedelta(years=1)
    win0 = df[(df['order_purchase_timestamp'] >= m0_start) & (df['order_purchase_timestamp'] < m0_end)]
    
    # agg features on win0
    feat0 = (
        win0
        .groupby('customer_unique_id')
        .agg(
            last_purchase = ('order_purchase_timestamp', 'max'),
            frequency     = ('order_id','nunique'),
            m_price       = ('m_price','sum')
        )
        .reset_index()
    )
    feat0['recency']     = (m0_end - feat0['last_purchase']).dt.days
    feat0['f_returning'] = (feat0['frequency'] > 1).astype(int)
    feat0['m_price_log'] = np.log1p(feat0['m_price'])
    feat0 = feat0[['customer_unique_id'] + features].set_index('customer_unique_id')
    
    # scale & fit M0
    X0  = feat0.values
    scaler0 = StandardScaler().fit(X0)
    km0     = KMeans(n_clusters=best_k, random_state=random_state).fit(scaler0.transform(X0))
    
    # prepare loop over future ends
    results = []
    n_steps = lead_time_weeks // step_weeks  # 12/2 = 6
    for i in range(1, n_steps+1):
        end_i = m0_end + relativedelta(weeks=step_weeks*i)
        start_i = end_i - relativedelta(years=1)
        win_i = df[(df['order_purchase_timestamp'] >= start_i) & (df['order_purchase_timestamp'] < end_i)]
        
        feat_i = (
            win_i
            .groupby('customer_unique_id')
            .agg(
                last_purchase = ('order_purchase_timestamp', 'max'),
                frequency     = ('order_id','nunique'),
                m_price       = ('m_price','sum')
            )
            .reset_index()
        )
        feat_i['recency']     = (end_i - feat_i['last_purchase']).dt.days
        feat_i['f_returning'] = (feat_i['frequency'] > 1).astype(int)
        feat_i['m_price_log'] = np.log1p(feat_i['m_price'])
        feat_i = feat_i[['customer_unique_id'] + features].set_index('customer_unique_id')
        
        # intersect customers
        common = feat0.index.intersection(feat_i.index)
        X0_c = feat0.loc[common].values
        Xi_c = feat_i.loc[common].values
        
        # fit Mi
        scaleri = StandardScaler().fit(Xi_c)
        km_i    = KMeans(n_clusters=best_k, random_state=random_state).fit(scaleri.transform(Xi_c))
        
        # predict & ARI
        l0 = km0.predict(scaler0.transform(X0_c))
        li = km_i.predict(scaleri.transform(Xi_c))
        ari = adjusted_rand_score(l0, li)
        
        results.append({
            'step':       i,
            'end_date':   end_i,
            'weeks_from_M0': step_weeks * i,
            'n_common':   len(common),
            'ari':        ari
        })
    
    return pd.DataFrame(results)

### Cluster drift

The Adjusted Rand Index (ARI) is used to determine cluster drift in order to determine an appropriate maintenance cadence.

In [9]:
# ARI from Tmax - 12 weeks
df_ari_bi = biweekly_ari(
    df,
    features=['recency','f_returning','m_price_log'],
    best_k=4,
    lead_time_weeks=12,
    step_weeks=2
)
print(df_ari_bi)

   step            end_date  weeks_from_M0  n_common  ari
0     1 2018-06-20 15:00:37              2     63789 0.93
1     2 2018-07-04 15:00:37              4     62563 0.88
2     3 2018-07-18 15:00:37              6     60834 0.80
3     4 2018-08-01 15:00:37              8     59132 0.73
4     5 2018-08-15 15:00:37             10     57356 0.63
5     6 2018-08-29 15:00:37             12     55618 0.57


In [None]:
# ARI from Tmax - 24 weeks
df_ari_bi = biweekly_ari(
    df,
    features=['recency','f_returning','m_price_log'],
    best_k=4,
    lead_time_weeks=8,
    step_weeks=2
)
print(df_ari_bi)

In [None]:
# Visualise ARI
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(df_ari_bi['weeks_from_M0'], df_ari_bi['ari'], marker='o', linestyle='-')
ax.axhline(0.8, color='red', linestyle='--', label=f'ARI = 0.8')
ax.set_xlabel('Weeks Since M0 Max')
ax.set_ylabel('ARI')
ax.set_title('M0 validity over time (RFM)')
ax.legend()
plt.tight_layout()
plt.savefig("./plots/maintenance/M0_ARI.png")
plt.show()

Thresholds defined between 0 and 1 trained on a 3 year model. justify with scaling.


Test with different t0 (normally testing forwards ratehr than backwards)
Use recency curve to decided on an overall time frame (1 year for seaosnality or 18 months)
automoize every 2 weeks to trigger retraining. 

Min date and max date need to be fixed: 12 months (train features on this time frame)
Add one week T0 - 1 week

M0 trains on 12 months, M1 = 12 months + 1 week

max = august 2018
m0 = feb 2017 - feb 2018 (trained feb to feb and predicted on Mx)
m1 = march 2017 - march 2018 (delta 1 month)
m2 = april 2017 - april 2018
... M6 (fit predict for each Mx)
can an older model be as effective as the new timeframe?

ARI = weeks M1 comparing clusters of M0 and Mx


### Feature drift
Kolmogorov–Smirnov tests are used to assess how continious features (recency and m_price_log) shift over time, and a chi-square test is used to assess binary features (f_returning) shifts over time.

In [None]:
# Chi2 drift
df_chi = chi2_drift_tests(df, 'f_returning', max_week=12)

# KS drift
df_ks  = ks_drift_tests(df, ['recency','m_price_log'], max_week=12)

plot_feature_drift(df_ks, df_chi,
                   max_week=12,
                   save_fig=True,
                   fig_path="./plots/maintenance/feature_drift_rfm.png")

Interpretation:
* Recency: Every week back, the “recency” distribution is completely different, which is no surprise as “recency” is “days since last order”. As such, recency will always drive some of your ARI decay.
* Price: Week 1’s price distribution is statistically the same. Changes in what people spend start to change the clusters after ~2 weeks.
* f_returning is extremely stable over time. The fraction of customers who’ve returned at least once stays essentially constant across all snapshots. It’s a “safe” feature for segmentation. Since it doesn’t drift, it won’t erode cluster definitions. No need for frequent retraining just for this flag.