# Deployment and Monitoring

This notebook provides contextual information on how the credit risk models previously developed (in notebooks 00–05) could be operationalized in a real-world production setting. Rather than hands-on implementation, the intent here is to outline considerations and processes such as preprocessing, inference, shadow testing, monitoring, and retraining triggers.

The examples use mocked and derived data to help illustrate key aspects of the pipeline, without accessing production data from LendingClub. The focus remains on process design, workflow, and governance, rather than on specific production details.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

## Problem setup: existing production environment

We assume a setting that closely mirrors a real-world credit platform:

- There is an existing production model (the champion) that currently drives approval, pricing, and portfolio composition.

- Observed outcomes (defaults, repayments) are available only for approved applicants.

- Both approval and pricing are policy-driven, creating explicit selection bias in the observed data.

- New models (e.g., logistic regression and XGBoost) are evaluated as challengers, but cannot be deployed blindly without careful validation.

This setup implies that traditional offline evaluation is insufficient. Any deployment strategy must explicitly account for the fact that we do not observe counterfactual outcomes for applicants who would have been rejected under the incumbent policy.

## From offline evaluation to deployment reality

In earlier notebooks, we showed that the challenger models provide improved risk ordering within the approved population. However, deployment requires answering a different set of questions:

- How do we test a new model without disrupting production decisions?

- How do we monitor model behavior when outcomes are only partially observable?

- How do we detect drift, degradation, or unintended policy interactions over time?

To address these constraints, we adopt a shadow deployment framework.

## Shadow testing under selection bias

In shadow mode, the challenger model runs in parallel with the production model, receiving the same inputs and producing predictions, but not influencing decisions.

This creates three conceptual regions:

- Approved by champion, approved by challenger:

    – Fully observable outcomes; safe region for direct comparison.

- Approved by champion, rejected by challenger

    – Indicates potential risk reduction; outcomes are observable and informative.

- Rejected by champion, approved by challenger (shadow region)

    – Outcomes are not observed under the current policy, but this region is critical for understanding incremental expansion risk.

The third region is particularly important: it represents users the challenger would accept but the production model currently filters out. Because we do not observe outcomes for these applicants, we cannot immediately claim gains or losses. Instead, this region must be monitored indirectly using score distributions, feature drift, stability metrics, and pricing-adjusted risk assumptions, as developed earlier. One practical approach is to randomly approve a small subset of applicants in this shadow region under the new policy. This allows us to collect actual outcome data in this segment without taking on excessive risk or costs, improving our ability to evaluate the true impact of policy changes.

## Moving from Shadow Mode to Production Decisions

After a challenger model demonstrates stable and reliable behavior in shadow mode, the key challenge is how to safely transition it into actual credit decision-making. This process requires careful planning to overcome selection bias, outcome delays, and coupled policy effects (approval + pricing).

Before allowing the model to directly influence approvals or pricing, several core questions must be addressed:

- Does the challenger remain stable under real-world traffic and data drift?
- Can we quantify and mitigate risks in segments not observed by the incumbent model (the shadow approval region)?
- Have we validated the end-to-end pipeline—including preprocessing, inference, agreement/disagreement analysis, and calibration—under production constraints?
- Do we have robust online monitoring to promptly detect unexpected shifts or adverse selection?
- Is there a clear rollback plan if portfolio risk rises?

The typical path forward is a progressive, controlled rollout. This means:

- Routing a small, predefined share of applications (e.g., 5–10%) or specific score bands to the challenger for real decisions.
- Observing actual outcomes in these newly accessible regions, while maintaining strict oversight and the ability to revert if necessary.
- Using reject inference techniques cautiously to bound potential risks in populations that still lack observed outcomes (e.g., assigning conservative default rates to newly approved applicants).

Ultimately, moving a new model into production requires more than offline metrics: it demands live, granular outcome monitoring, rapid feedback loops, and a risk-aware deployment playbook. Only after the challenger proves safety and outperformance under these real production conditions should its deployment share be expanded.

## Experiment Design

Offline metrics (AUC, KS, lift) are necessary but not sufficient for production decisions in credit risk. In production, outcomes arrive with delay, approvals are policy-shaped, and profitability depends on pricing and portfolio mix. Experiment design is therefore about answering a causal question as safely as possible:

“If we switch decisioning from the champion to the challenger, what changes in outcomes and risk-adjusted performance should we expect — and how confident are we?”

Below we outline three practical designs, ordered from strongest causal identification to most governance-friendly for risk.

### Classical A/B test (when allowed)

A classical A/B test randomizes applicants into two groups:

* Group A (Champion): decisions and pricing follow the current production model.
* Group B (Challenger): decisions and pricing follow the challenger model.

Key properties:

* Randomization ensures the two groups are comparable *ex-ante*.
* We can estimate causal effects on outcomes (default rate, loss proxies, conversion, etc.).
* This is the cleanest way to evaluate the *full policy change* (score + decision threshold + pricing behavior).

Practical constraints in credit:

* Full A/B may be restricted by compliance or risk appetite.
* The riskiest bands may be excluded or downsampled.
* Outcome maturity is slow (e.g., 36 months); so we often rely on interim proxies.

What we typically measure (early + late):

* Early: approval rate, take-up, early delinquency proxies (e.g., 30+ DPD), payment behavior.
* Medium: charge-off signals, roll rates, stability metrics.
* Late: final default definition, cashflow-based metrics (if available).

### Bayesian monitoring (often best for risk)

Credit experiments are usually slow and expensive:

* Defaults are relatively rare (especially in prime bands),
* outcomes take months/years,
* and business wants *continuous* signals rather than a single “p-value at the end”.

Bayesian monitoring is well-suited because it:

* updates evidence continuously as data comes in,
* produces probability statements aligned with risk governance (e.g., “80% chance challenger reduces default rate by at least 20 bps”),
* supports early stop / early rollback decisions naturally.

Typical setup:

* Define a prior belief about default rates (based on dev/validation results, shadow mode, historical baseline).
* Observe production outcomes as they mature.
* Update posterior distributions for champion and challenger.
* Track decision-relevant probabilities:

  * $P(p_{challenger} < p_{champion})$
  * $P(p_{challenger} < p_{champion} - \delta)$ for a minimum improvement threshold ($\delta$)
  * Expected loss under each, with uncertainty bounds.

### Mocked Example A — Classical A/B Test (Frequentist)

We run a small controlled A/B test for 4–8 weeks in a “safe” population slice (e.g., middle score bands) to avoid extreme-tail exposure. We randomly route applicants to champion vs challenger. Both models decide approvals and apply the associated pricing policy. After enough early outcomes mature (or a proxy like 60+ DPD), we compare default rates.

What we do:

* Create two groups: A (champion), B (challenger).
* Observe ($n_A$, $d_A$) and ($n_B$, $d_B$) where ($d$) is #defaults (or proxy events).
* Compute default rates and a statistical significance test for difference in proportions.

Decision framing (example):

* If challenger reduces default rate by at least X bps and the result is statistically significant, we expand rollout.
* If not significant but directionally positive, we extend duration or widen sample.
* If significantly worse (or triggers risk thresholds), rollback immediately.

In [2]:
# ============================================================
# Mocked Example A — Classical A/B Test (Frequentist)
# ============================================================

# --- Mock experiment setup ---
n_total = 50_000
traffic_split = 0.5
n_A = int(n_total * traffic_split)  # Champion
n_B = n_total - n_A  # Challenger

# Assume we're measuring an "early outcome proxy" (e.g., 60+ DPD) within a short window.
# Use plausible rates for an approved population slice.
p_A_true = 0.130  # 13.0% event rate under champion
p_B_true = 0.122  # 12.2% under challenger (improvement of 80 bps)

# Simulate observed events
d_A = np.random.binomial(n_A, p_A_true)
d_B = np.random.binomial(n_B, p_B_true)

pA_hat = d_A / n_A
pB_hat = d_B / n_B
delta = pB_hat - pA_hat  # challenger - champion (negative is good)

# --- Two-proportion z-test (difference in proportions) ---
p_pool = (d_A + d_B) / (n_A + n_B)
se_pool = np.sqrt(p_pool * (1 - p_pool) * (1 / n_A + 1 / n_B))
z = (pB_hat - pA_hat) / se_pool
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

# --- 95% CI for difference (unpooled/Wald) ---
se_unpooled = np.sqrt(pA_hat * (1 - pA_hat) / n_A + pB_hat * (1 - pB_hat) / n_B)
ci_low = delta - 1.96 * se_unpooled
ci_high = delta + 1.96 * se_unpooled

# --- Decision logic (example) ---
min_improvement = -0.005  # challenger should be at least 50 bps better (delta <= -0.005)
alpha = 0.05  # significance threshold

decision = "extend_test"
if (p_value < alpha) and (delta <= min_improvement):
    decision = "promote_challenger"
elif (p_value < alpha) and (delta > 0):
    decision = "rollback_challenger"
elif (p_value >= alpha) and (delta > 0):
    decision = "rollback_challenger"

ab_summary = pd.DataFrame(
    [
        {"group": "A_champion", "n": n_A, "events": d_A, "event_rate": pA_hat},
        {"group": "B_challenger", "n": n_B, "events": d_B, "event_rate": pB_hat},
    ]
)

test_stats = pd.DataFrame(
    [
        {
            "delta_B_minus_A": delta,
            "delta_bps": delta * 10_000,
            "z_value": z,
            "p_value": p_value,
            "ci95_low": ci_low,
            "ci95_high": ci_high,
            "decision": decision,
        }
    ]
)

print("=== A/B Summary ===")
display(ab_summary)

print("=== Two-proportion z-test (B - A) ===")
display(test_stats)

=== A/B Summary ===


Unnamed: 0,group,n,events,event_rate
0,A_champion,25000,3256,0.13024
1,B_challenger,25000,2964,0.11856


=== Two-proportion z-test (B - A) ===


Unnamed: 0,delta_B_minus_A,delta_bps,z_value,p_value,ci95_low,ci95_high,decision
0,-0.01168,-116.8,-3.956715,7.6e-05,-0.017465,-0.005895,promote_challenger


### Mocked Example B — Bayesian Monitoring (Sequential / Risk-Friendly)

Before production, we believed the challenger would reduce default rate modestly (e.g., from ~13% to ~12.5%) based on dev metrics and shadow results. We encode that belief as a prior. As production outcomes arrive, we update the posterior continuously and track the probability that challenger is better than champion, with a minimum improvement threshold.

What we do:

* Use Beta priors for default probabilities:

  * $p_A \sim \text{Beta}(\alpha_A, \beta_A)$
  * $p_B \sim \text{Beta}(\alpha_B, \beta_B)$
* Update with observed outcomes:

  * Posterior $p \mid data \sim \text{Beta}(\alpha + d, \beta + n-d)$
* Compute:

  * $P(p_B < p_A)$
  * $P(p_B < p_A - \delta)$ for a practical improvement threshold ($\delta$)

Stopping rules (example):

* Promote challenger if:

  * (P(p_B < p_A - \delta) > 0.95) (high confidence of meaningful improvement)
* Continue monitoring if:

  * (0.6 < P(p_B < p_A) < 0.95)
* Rollback if:

  * (P(p_B > p_A + \delta_{harm}) > 0.90)

This aligns better with credit governance because it communicates risk in probabilistic terms and supports safe incremental decisions without waiting for a single fixed “end of test”.

In [4]:
# ============================================================
# Mocked Example B — Bayesian Monitoring (Sequential)
# ============================================================
import numpy as np
import pandas as pd

np.random.seed(7)


def beta_posterior_params(alpha, beta, n, d):
    return alpha + d, beta + (n - d)


def beta_mean(alpha, beta):
    return alpha / (alpha + beta)


def beta_ci(alpha, beta, q=(0.05, 0.95)):
    return stats.beta.ppf(q[0], alpha, beta), stats.beta.ppf(q[1], alpha, beta)


def prob_B_better(alphaA, betaA, alphaB, betaB, n_mc=200_000, delta=0.0):
    # Monte Carlo probability that pB < pA - delta
    pA = np.random.beta(alphaA, betaA, size=n_mc)
    pB = np.random.beta(alphaB, betaB, size=n_mc)
    return np.mean(pB < (pA - delta))


# --- Priors informed by dev/shadow beliefs ---
# Example belief: champion around 13% and challenger around 12.5%, with moderate confidence.
# Using prior strength ~ 2,000 pseudo-observations.
prior_strength = 2000
pA_prior = 0.130
pB_prior = 0.125

alphaA0 = pA_prior * prior_strength
betaA0 = (1 - pA_prior) * prior_strength
alphaB0 = pB_prior * prior_strength
betaB0 = (1 - pB_prior) * prior_strength

# --- Streaming production results in weekly batches ---
weeks = 10
batch_n_A = 2500
batch_n_B = 2500

# True (unknown) underlying event rates in production
p_A_prod = 0.131
p_B_prod = 0.123

records = []
alphaA, betaA = alphaA0, betaA0
alphaB, betaB = alphaB0, betaB0

# Practical thresholds for decisions
delta_min = 0.005  # meaningful improvement threshold (50 bps)
harm_min = 0.005  # meaningful harm threshold (50 bps)

promote_prob = 0.95
rollback_prob = 0.90

for w in range(1, weeks + 1):
    dA = np.random.binomial(batch_n_A, p_A_prod)
    dB = np.random.binomial(batch_n_B, p_B_prod)

    alphaA, betaA = beta_posterior_params(alphaA, betaA, batch_n_A, dA)
    alphaB, betaB = beta_posterior_params(alphaB, betaB, batch_n_B, dB)

    meanA = beta_mean(alphaA, betaA)
    meanB = beta_mean(alphaB, betaB)
    ciA = beta_ci(alphaA, betaA, q=(0.05, 0.95))
    ciB = beta_ci(alphaB, betaB, q=(0.05, 0.95))

    # Probability challenger is better (any improvement)
    p_better = prob_B_better(alphaA, betaA, alphaB, betaB, n_mc=120_000, delta=0.0)

    # Probability challenger is better by at least delta_min
    p_better_by = prob_B_better(alphaA, betaA, alphaB, betaB, n_mc=120_000, delta=delta_min)

    # Probability challenger is worse by at least harm_min
    p_worse_by = prob_B_better(
        alphaB, betaB, alphaA, betaA, n_mc=120_000, delta=harm_min
    )  # P(pA < pB - harm_min)

    decision = "continue"
    if p_better_by > promote_prob:
        decision = "promote_challenger"
    elif p_worse_by > rollback_prob:
        decision = "rollback_challenger"

    records.append(
        {
            "week": w,
            "batch_n_A": batch_n_A,
            "batch_events_A": dA,
            "posterior_mean_A": meanA,
            "A_ci05": ciA[0],
            "A_ci95": ciA[1],
            "batch_n_B": batch_n_B,
            "batch_events_B": dB,
            "posterior_mean_B": meanB,
            "B_ci05": ciB[0],
            "B_ci95": ciB[1],
            "P(B < A)": p_better,
            "P(B < A - 50bps)": p_better_by,
            "P(B > A + 50bps)": p_worse_by,
            "decision": decision,
        }
    )

bayes_df = pd.DataFrame(records)

print("=== Bayesian Monitoring (weekly updates) ===")
display(bayes_df)

=== Bayesian Monitoring (weekly updates) ===


Unnamed: 0,week,batch_n_A,batch_events_A,posterior_mean_A,A_ci05,A_ci95,batch_n_B,batch_events_B,posterior_mean_B,B_ci05,B_ci95,P(B < A),P(B < A - 50bps),P(B > A + 50bps),decision
0,1,2500,305,0.125556,0.117527,0.133773,2500,305,0.123333,0.115368,0.131489,0.624083,0.345083,0.149708,continue
1,2,2500,351,0.130857,0.124288,0.137546,2500,271,0.118,0.111721,0.124403,0.989367,0.919058,0.000783,continue
2,3,2500,339,0.132105,0.126436,0.137863,2500,292,0.117684,0.112293,0.123167,0.998717,0.975692,1.7e-05,promote_challenger
3,4,2500,315,0.130833,0.125805,0.135931,2500,296,0.117833,0.113029,0.12271,0.9987,0.971075,8e-06,promote_challenger
4,5,2500,317,0.130138,0.125571,0.134763,2500,282,0.116966,0.112606,0.121385,0.999633,0.982917,8e-06,promote_challenger
5,6,2500,310,0.129235,0.125028,0.133492,2500,298,0.117294,0.113261,0.121379,0.999608,0.974125,0.0,promote_challenger
6,7,2500,350,0.130615,0.126668,0.134606,2500,314,0.118359,0.114576,0.122186,0.999917,0.984733,0.0,promote_challenger
7,8,2500,369,0.132545,0.128804,0.136325,2500,289,0.118045,0.114487,0.121643,1.0,0.998883,0.0,promote_challenger
8,9,2500,302,0.131347,0.127815,0.134913,2500,298,0.118163,0.114789,0.121573,0.999992,0.996825,0.0,promote_challenger
9,10,2500,313,0.130778,0.127418,0.134168,2500,317,0.118963,0.115738,0.12222,1.0,0.991658,0.0,promote_challenger


### Shadow Deployment with Targeted Experimentation: A Robust Alternative for High-Risk Rollouts

When redirecting a large portion of traffic is costly or operationally challenging, an alternative approach is to combine shadow deployment with selective approval of random cases in the "shadow" region. This means that instead of fully diverting users to the new model, you run both the incumbent and challenger models in parallel (the shadow mode), but approve a small random sample of applicants or items in the shadow region for full evaluation. By doing this, you can accurately estimate the conversion rates and risk profiles for each group or policy region, while dramatically limiting the operational risk and resource impact.

This targeted experimentation design provides robust statistical power in segments of particular interest (such as the shadow region where models disagree), and allows for clear communication to non-technical stakeholders regarding experimental benefits and risk controls. Especially in environments characterized by high uncertainty or regulatory scrutiny, this approach offers a transparent way to demonstrate the value of further testing, ensures more efficient use of experimental resources, and facilitates more confident, data-driven decision making across all relevant regions of your model's decision space.

### Practical note on outcome latency

Even with A/B or Bayesian monitoring, a 36-month maturity target is slow. In production, we typically:

* monitor early delinquency as an interim proxy (30/60/90+ DPD),
* validate proxy-to-final mapping using historical vintages,
* and use Bayesian monitoring to update confidence as more mature outcomes come in.

This avoids premature decisions while still enabling controlled iteration.

## Monitoring Plan

Model deployment in credit risk does not end at go-live. Given delayed outcomes, selection bias, and policy coupling, monitoring must focus on leading indicators of degradation, not only on realized defaults. This section outlines a pragmatic monitoring framework that balances robustness, interpretability, and operational feasibility.

### Online KPIs

Online KPIs track how the model behaves in production traffic, before outcomes fully mature.

At a minimum, we monitor:

* Approval rate

  Sudden changes often indicate upstream data issues, threshold misalignment, or drift in applicant mix.

* Score distribution

  Mean, variance, and tail mass of predicted risk over time. Sharp shifts may signal feature drift or preprocessing inconsistencies.

* Segment-level volume

  Distribution of approvals across score bands, grades, or policy buckets. Useful to detect silent policy changes or unintended concentration.

* Shadow disagreement rates (if applicable)

  Size and composition of:

  * champion-approve / challenger-reject
  * champion-reject / challenger-approve

  Growth in these regions should be explainable and stable.

These KPIs are cheap, fast, and should be monitored daily or weekly. In many cases, they catch issues before outcome-based metrics move.

### Drift Monitoring

Drift refers to changes in the data-generating process that can degrade model performance. We distinguish three main types:

#### 1. Data (Covariate) Drift

Changes in the distribution of input features:

* income, DTI, credit utilization,
* inquiry behavior,
* geographic or demographic mix.

What to monitor:

* PSI / CSI for top features,
* summary statistics (mean, p95, missingness),
* drift by segment (e.g., high vs low score).

When basic checks are enough:
Small, smooth shifts consistent with seasonality or macro trends.

When to go deeper:
Large or abrupt shifts, or drift concentrated in high-impact features.

#### 2. Prediction (Score) Drift

Changes in the distribution of model outputs:

* overall risk level,
* tail behavior,
* separation across bands.

Score drift often reflects either input drift or misalignment between preprocessing and inference.

What to monitor:

* PSI on predicted scores,
* share of population in extreme bands,
* stability of rank ordering across time.

#### 3. Performance Drift (Outcome Drift)

Changes in the relationship between predictions and outcomes:

* calibration decay,
* worsening discrimination,
* unexpected segment-level errors.

Because outcomes are delayed, we rely on:

* early delinquency proxies (e.g., 30/60+ DPD),
* vintage curves,
* rolling cohort analysis.

Key principle:
Do not overreact to noise in small windows — use aggregation and Bayesian smoothing when possible.

### Early Warning Signals

Some signals warrant immediate investigation even before formal thresholds are breached:

* Sudden approval rate jumps without a business explanation
* Score distribution shifts without corresponding feature drift
* Rapid growth of the shadow approval region
* Localized calibration gaps in specific segments
* Divergence between early delinquency and expected risk

These are not necessarily “model failures”, but they often indicate data pipeline issues, policy misalignment, or regime change.

### When to Escalate Monitoring

Basic monitoring is sufficient when:

* drift is gradual,
* performance proxies are stable,
* changes are explainable by known factors.

Deeper investigation is required when:

* multiple drift signals align,
* changes are asymmetric across segments,
* early delinquency diverges persistently from expectations,
* or business constraints (loss limits, capital usage) are approached.

Escalation may include:

* shadow re-evaluation,
* temporary tightening of thresholds,
* retraining feasibility assessment,
* or targeted experimentation.