# Automating model retraining

In this notebook we will use drift metrics and model performance metrics to decide whether it's a good time to retrain and deploy a new version of our model.

The way this would work is by:
1. scheduling a notebook to run every x days
2. testing whether the retrain conditions are satisfied
3. training and deploying a new version if 2. is true

## A note on data privacy
MLServe.com asks you to provide a background dataset (basically a sample of input features during training) which are used to calculate drift and outliers. The way this works under the hood is by **estimating averages and counts** and not storing the raw data. The same happens during inference. This way we can provide good data quality estimates without sacrificing privacy.

For the calculation of model accuracy we have to store the predictions and true values from the feedback provided but there is no way to match them with the input feautures as they are not stored in our platform.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
from mlserve_sdk.client import MLServeClient
import pandas as pd
import numpy as np
import os
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
def query_to_dataframe(n_samples=1000, missing_frac=0.05, random_state=42):
    """
    Generate synthetic churn dataset for ML benchmarking.

    Parameters
    ----------
    n_samples : int
        Number of rows to generate.
    missing_frac : float
        Fraction of missing values to inject per column (0–1).
    random_state : int
        Seed for reproducibility.

    Returns
    -------
    X : pd.DataFrame
        Feature matrix with categorical & numerical features.
    y : pd.Series
        Binary churn target (0 = no churn, 1 = churn).
    """
    np.random.seed(random_state)

    # Generate synthetic features
    data = {
        "customer_id": np.arange(1, n_samples+1),
        "age": np.random.randint(18, 80, n_samples),
        "tenure_months": np.random.randint(1, 72, n_samples),
        "monthly_charges": np.round(np.random.uniform(20, 120, n_samples), 2),
        "total_charges": np.round(np.random.uniform(20, 8000, n_samples), 2),
        "contract_type": np.random.choice(
            ["Month-to-month", "One year", "Two year"], n_samples, p=[0.6, 0.25, 0.15]
        ),
        "payment_method": np.random.choice(
            ["Electronic check", "Mailed check", "Bank transfer", "Credit card"], n_samples
        ),
        "internet_service": np.random.choice(
            ["DSL", "Fiber optic", "No"], n_samples, p=[0.3, 0.5, 0.2]
        ),
        "gender": np.random.choice(["Male", "Female"], n_samples),
        "has_phone_service": np.random.choice(["Yes", "No"], n_samples, p=[0.9, 0.1]),
        "num_dependents": np.random.poisson(1, n_samples),  # ~0-4 mostly
    }

    X = pd.DataFrame(data)

    # Inject missing values
    if missing_frac > 0:
        for col in X.columns.drop("customer_id"):
            X.loc[X.sample(frac=missing_frac, random_state=random_state).index, col] = np.nan

    # Churn probability (synthetic rules + noise)
    prob_churn = (
        0.3 * (X["contract_type"] == "Month-to-month").astype(float) +
        0.25 * (X["internet_service"] == "Fiber optic").astype(float) +
        0.15 * (X["payment_method"] == "Electronic check").astype(float) +
        0.002 * (X["monthly_charges"].fillna(60)) +
        0.01 * (X["num_dependents"].fillna(0) == 0).astype(float) +
        np.random.normal(0, 0.1, n_samples)
    )
    prob_churn = 1 / (1 + np.exp(-prob_churn))  # sigmoid

    y = pd.Series(np.random.binomial(1, prob_churn), name="churn")

    return X, y

In [4]:
X, y = query_to_dataframe()
X.drop(columns=["customer_id"], inplace=True)
for col in ["contract_type", "payment_method", "internet_service", "gender", "has_phone_service"]:
    X[col] = X[col].astype("category")

model = XGBClassifier(enable_categorical=True)
model.fit(X, y)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,True


In [5]:
USERNAME = os.getenv("USERNAME")
TOKEN = os.getenv("TOKEN")

client = MLServeClient()
client.login(USERNAME, TOKEN)

We will start by training the baseline model. We will get a few predictions and send back feedback so that performance metrics can be estimated

In [7]:
try:
    lv=client.get_latest_version("churn")
    next_version=lv["next_version"]
except:
    next_version="v1"

print(next_version)

v1


In [8]:
client.deploy(
    model=model,
    name="churn",
    version=next_version,
    features=list(X),
    background_df=X.sample(500),
    metrics={'accuracy':model.score(X, y)},
    task_type='classification'
)

{'predict_url': 'https://mlserve.com/api/v1/predict/churn/v1'}

In [12]:
%%time

TEST_DATA = {
    "features": X.columns.tolist(),
    "inputs": X.values.tolist()
}
preds = client.predict("churn", "v1", TEST_DATA, explain=True)
print(f"predictions for {len(X)} records")

predictions for 1000 records
CPU times: user 59.4 ms, sys: 6.22 ms, total: 65.7 ms
Wall time: 1.21 s


In [17]:
test_ids=preds["prediction_ids"][:100]

feedback=[]
for tid in test_ids:
    val=np.random.randint(0, 2)
    r=np.random.normal(10, 7)
    feedback.append({"prediction_id":tid, "true_value":val, "reward":r})

client.send_feedback(feedback)

{'status': 'ok', 'updated': 100, 'not_found': []}

# Do I need to retrain?

*In this example we use data quality and online metrics to decide whether we need to retrain a model*

In [27]:
lv=client.get_latest_version("churn")
# Using % mean difference between train and inference data (drift detection)
d = client.get_data_quality(lv['model'], lv['latest_version'], as_dataframe=True)

# build a rule
retrain = d['drift']['pct_mean_diff'].mean()>0.1
print('Should retrain (drift)? ->', retrain)

# Or using online metrics like model accuracy
oms = client.get_online_metrics(lv['model'], lv["latest_version"], window_hours=24, as_dataframe=True)

# build another rule
retrain =oms.loc[0,'accuracy']<0.7
print('Should retrain (accuracy)? ->', retrain)

Should retrain (drift)? -> False
Should retrain (accuracy)? -> True


In [23]:
try:
    lv=client.get_latest_version("churn")
    next_version=lv["next_version"]
except:
    next_version="v1"

print(next_version)

if retrain:
    X, y = query_to_dataframe()
    X.drop(columns=["customer_id"], inplace=True)
    for col in ["contract_type", "payment_method", "internet_service", "gender", "has_phone_service"]:
        X[col] = X[col].astype("category")
    
    model = XGBClassifier(enable_categorical=True)
    model.fit(X, y)

    client.deploy(
        model=model,
        name=lv['model'],
        version=next_version,
        features=list(X),
        background_df=X.sample(200),
        metrics={'accuracy':model.score(X, y)},
        task_type='classification'
    )

v2
