# Go/No-Go Diagnostic v2 — Rigorous Signal Detection

**Purpose:** Determine with defensible statistics whether there is *any* conditional signal
in the current-state $X_t$ (Backward_Bin) about the next-state $Y_t$ (Forward_Bin)
beyond the marginal baseline.

**Improvements over v1:**
1. Explicit alignment verification
2. Proper MI permutation test (500 permutations, p-value, z-score)
3. Fixed rebinning bug (backward returns use their own quantile edges)
4. Tuned additive-smoothing conditional baseline (alpha sweep on VAL)
5. Interpolated backoff baseline: $P_{mix}(y|x) = \lambda_x \cdot P_{cond}(y|x) + (1-\lambda_x) \cdot P_{marg}(y)$
6. Multi-step evaluation ($k \in \{1,2,3,5,10\}$) via $A^k$

In [1]:
import numpy as np
import pandas as pd
import torch
from pathlib import Path
from datetime import datetime
import os, warnings

warnings.filterwarnings('ignore')

SEED = 42
np.random.seed(SEED)
EPS = 1e-12  # only for log(p + EPS) in evaluation, NOT for MI

print("Imports ready.")

Imports ready.


## Data Loading
Identical to `TransitionProbMatrix_NEWDATA.ipynb`. Same dataset, same temporal split.

In [2]:
train_df  = pd.read_csv("dataset/train_diagnostic.csv")
labels_df = pd.read_csv("dataset/label_diagnostic.csv")

# Compute forward percent change from Price
train_df["Percent_change_forward"] = (
    train_df["Price"].shift(-1) / train_df["Price"] - 1
) * 100.0

# Drop last row (forward return undefined)
train_df = train_df.iloc[:-1].copy()
labels_df = labels_df.iloc[:-1].copy()

# States: 0-based
s_curr_all = (train_df["Backward_Bin"].values.astype(np.int64) - 1)
y_all      = (labels_df["Forward_Bin"].values.astype(np.int64) - 1)

# Raw percent changes (for rebinning)
pct_backward_all = train_df["Percent_change_backward"].values.astype(np.float64)
pct_forward_all  = train_df["Percent_change_forward"].values.astype(np.float64)

n_samples = len(s_curr_all)
n_states_orig = int(max(s_curr_all.max(), y_all.max()) + 1)

# Temporal split: 70 / 15 / 15
T = n_samples
train_end = int(0.7 * T)
val_end   = int(0.85 * T)

idx_train = np.arange(0,         train_end)
idx_val   = np.arange(train_end, val_end)
idx_test  = np.arange(val_end,   T)

s_train, s_val, s_test = s_curr_all[idx_train], s_curr_all[idx_val], s_curr_all[idx_test]
y_train, y_val, y_test = y_all[idx_train], y_all[idx_val], y_all[idx_test]

print(f"n_samples={n_samples}, n_states_orig={n_states_orig}")
print(f"Train: {len(idx_train)}, Val: {len(idx_val)}, Test: {len(idx_test)}")

n_samples=2368, n_states_orig=55
Train: 1657, Val: 355, Test: 356


## (1) Alignment Verification
Confirm that `Backward_Bin[t]` = bin of return from $P_{t-1} \to P_t$,
and `Forward_Bin[t]` = bin of return from $P_t \to P_{t+1}$.

Also verify: `Forward_Bin[t] == Backward_Bin[t+1]` (the forward return at $t$ IS the backward return at $t+1$).

In [3]:
# Python reimplementation of the R bin_return function
def bin_return_py(x):
    """Replicate the R bin_return function exactly. Returns 1-based bin."""
    thresholds = [
        (-np.inf, -10, 1), (-10, -9, 2), (-9, -8, 3), (-8, -7, 4),
        (-7, -6, 5), (-6, -5, 6), (-5, -4.5, 7), (-4.5, -4, 8),
        (-4, -3.5, 9), (-3.5, -3, 10), (-3, -2.5, 11), (-2.5, -2.05, 12),
        (-2.05, -1.85, 13), (-1.85, -1.65, 14), (-1.65, -1.45, 15),
        (-1.45, -1.25, 16), (-1.25, -1.05, 17), (-1.05, -0.95, 18),
        (-0.95, -0.85, 19), (-0.85, -0.75, 20), (-0.75, -0.65, 21),
        (-0.65, -0.55, 22), (-0.55, -0.45, 23), (-0.45, -0.35, 24),
        (-0.35, -0.25, 25), (-0.25, -0.15, 26), (-0.15, -0.05, 27),
        (-0.05, 0.05, 28), (0.05, 0.15, 29), (0.15, 0.25, 30),
        (0.25, 0.35, 31), (0.35, 0.45, 32), (0.45, 0.55, 33),
        (0.55, 0.65, 34), (0.65, 0.75, 35), (0.75, 0.85, 36),
        (0.85, 0.95, 37), (0.95, 1.05, 38), (1.05, 1.25, 39),
        (1.25, 1.45, 40), (1.45, 1.65, 41), (1.65, 1.85, 42),
        (1.85, 2.05, 43), (2.05, 2.55, 44), (2.55, 3.05, 45),
        (3.05, 3.55, 46), (3.55, 4.05, 47), (4.05, 4.55, 48),
        (4.55, 5, 49), (5, 6, 50), (6, 7, 51), (7, 8, 52),
        (8, 9, 53), (9, 10, 54),
    ]
    for lo, hi, b in thresholds:
        if x >= lo and x < hi:
            return b
    if x > 10:
        return 55
    return None

# Show 5 consecutive rows (indices 10-14)
print("=" * 100)
print("ALIGNMENT VERIFICATION (rows 10-14)")
print("=" * 100)
prices = train_df["Price"].values
bwd_pct = train_df["Percent_change_backward"].values
fwd_pct = train_df["Percent_change_forward"].values
bwd_bin_csv = train_df["Backward_Bin"].values  # 1-based from CSV
fwd_bin_csv = labels_df["Forward_Bin"].values   # 1-based from CSV

header = f"{'t':>3} | {'P_t':>9} | {'bwd_pct':>9} | {'bwd_bin':>7} | {'check':>5} | {'fwd_pct':>9} | {'fwd_bin':>7} | {'check':>5}"
print(header)
print("-" * len(header))

for t in range(10, 15):
    # Manually compute returns
    bwd_manual = (prices[t] - prices[t-1]) / prices[t-1] * 100 if t > 0 else float('nan')
    fwd_manual = (prices[t+1] - prices[t]) / prices[t] * 100 if t < len(prices)-1 else float('nan')
    bwd_bin_check = bin_return_py(bwd_pct[t])
    fwd_bin_check = bin_return_py(fwd_pct[t])
    print(f"{t:3d} | {prices[t]:9.2f} | {bwd_pct[t]:9.4f} | {bwd_bin_csv[t]:4d}={bwd_bin_check:3d} | {'OK' if bwd_bin_csv[t]==bwd_bin_check else 'FAIL':>5} | "
          f"{fwd_pct[t]:9.4f} | {fwd_bin_csv[t]:4d}={fwd_bin_check:3d} | {'OK' if fwd_bin_csv[t]==fwd_bin_check else 'FAIL':>5}")

# Key relationship: Forward_Bin[t] should equal Backward_Bin[t+1]
print("\nCross-check: Forward_Bin[t] == Backward_Bin[t+1]?")
matches = 0
total = 0
for t in range(len(fwd_bin_csv) - 1):
    if fwd_bin_csv[t] == bwd_bin_csv[t+1]:
        matches += 1
    total += 1
print(f"  {matches}/{total} match ({matches/total*100:.1f}%)")
if matches == total:
    print("  PERFECT ALIGNMENT CONFIRMED.")
else:
    print(f"  *** MISMATCH in {total-matches} rows! ***")
    # Show first mismatch
    for t in range(len(fwd_bin_csv) - 1):
        if fwd_bin_csv[t] != bwd_bin_csv[t+1]:
            print(f"    First mismatch at t={t}: fwd_bin[t]={fwd_bin_csv[t]}, bwd_bin[t+1]={bwd_bin_csv[t+1]}")
            print(f"    fwd_pct[t]={fwd_pct[t]:.6f}, bwd_pct[t+1]={bwd_pct[t+1]:.6f}")
            break

ALIGNMENT VERIFICATION (rows 10-14)
  t |       P_t |   bwd_pct | bwd_bin | check |   fwd_pct | fwd_bin | check
---------------------------------------------------------------------------
 10 |     57.04 |   -0.0701 |   27= 27 |    OK |    0.4208 |   32= 32 |    OK
 11 |     57.28 |    0.4208 |   32= 32 |    OK |    3.8757 |   47= 47 |    OK
 12 |     59.50 |    3.8757 |   47= 47 |    OK |   -1.0756 |   17= 17 |    OK
 13 |     58.86 |   -1.0756 |   17= 17 |    OK |   -3.1091 |   10= 10 |    OK
 14 |     57.03 |   -3.1091 |   10= 10 |    OK |    0.6663 |   35= 35 |    OK

Cross-check: Forward_Bin[t] == Backward_Bin[t+1]?
  2367/2367 match (100.0%)
  PERFECT ALIGNMENT CONFIRMED.


## Helper Functions

In [4]:
def compute_marginal(y, n_classes):
    """Compute marginal distribution from integer labels."""
    counts = np.bincount(y, minlength=n_classes).astype(np.float64)
    return counts / counts.sum()


def compute_joint_counts(s, y, n_x, n_y):
    """Compute raw joint count matrix C[x, y]."""
    C = np.zeros((n_x, n_y), dtype=np.float64)
    for si, yi in zip(s, y):
        C[si, yi] += 1
    return C


def compute_mi_plugin(s, y, n_x, n_y):
    """Plugin MI estimator from raw counts. No epsilon smoothing.
    MI = sum_{x,y} P_hat(x,y) * log(P_hat(x,y) / (P_hat(x)*P_hat(y)))
    Skips cells where P_hat(x,y) = 0.
    """
    C = compute_joint_counts(s, y, n_x, n_y)
    N = C.sum()
    P_joint = C / N
    P_x = P_joint.sum(axis=1)
    P_y = P_joint.sum(axis=0)
    mi = 0.0
    for i in range(n_x):
        if P_x[i] == 0:
            continue
        for j in range(n_y):
            if P_joint[i, j] > 0 and P_y[j] > 0:
                mi += P_joint[i, j] * np.log(P_joint[i, j] / (P_x[i] * P_y[j]))
    return mi


def mi_permutation_test(s, y, n_x, n_y, n_perm=500):
    """MI permutation test. Returns dict with mi_real, perm stats, p-value, z-score."""
    mi_real = compute_mi_plugin(s, y, n_x, n_y)
    mi_perm = np.zeros(n_perm)
    for i in range(n_perm):
        y_shuf = np.random.permutation(y)
        mi_perm[i] = compute_mi_plugin(s, y_shuf, n_x, n_y)
    p_value = (1 + np.sum(mi_perm >= mi_real)) / (n_perm + 1)
    perm_mean = mi_perm.mean()
    perm_std = mi_perm.std()
    z_score = (mi_real - perm_mean) / perm_std if perm_std > 0 else 0.0
    delta_mi = mi_real - perm_mean
    return {
        "mi_real": mi_real, "perm_mean": perm_mean, "perm_std": perm_std,
        "z_score": z_score, "p_value": p_value,
        "delta_mi_nats": delta_mi, "delta_mi_bits": delta_mi / np.log(2),
    }


def compute_conditional_additive(s, y, n_x, n_y, alpha):
    """P(y|x) = (C[x,y] + alpha) / (C[x,.] + alpha * n_y)."""
    C = compute_joint_counts(s, y, n_x, n_y)
    C_alpha = C + alpha
    P = C_alpha / C_alpha.sum(axis=1, keepdims=True)
    return P


def compute_backoff_baseline(s_train, y_train, s_eval, n_x, n_y, alpha, tau, marginal):
    """Interpolated backoff: P_mix(y|x) = lambda_x * P_cond(y|x) + (1-lambda_x) * P_marg(y)
    where lambda_x = count(x) / (count(x) + tau).
    Returns: (N_eval, n_y) array of predicted distributions.
    """
    C = compute_joint_counts(s_train, y_train, n_x, n_y)
    C_alpha = C + alpha
    P_cond = C_alpha / C_alpha.sum(axis=1, keepdims=True)
    state_counts = C.sum(axis=1)  # count(x)
    lam = state_counts / (state_counts + tau)  # (n_x,)
    # For each eval sample: mix conditional and marginal
    P_mix = np.zeros((len(s_eval), n_y), dtype=np.float64)
    for i, sx in enumerate(s_eval):
        P_mix[i] = lam[sx] * P_cond[sx] + (1 - lam[sx]) * marginal
    return P_mix


def mean_log_likelihood(pred_dist, y_true):
    """Mean log-likelihood: mean(log(P[y_true]))."""
    N = len(y_true)
    probs = pred_dist[np.arange(N), y_true]
    return np.log(probs + EPS).mean()


def accuracy(pred_dist, y_true):
    return (pred_dist.argmax(axis=1) == y_true).mean()


def severity(pred_dist, y_true, n_classes):
    bins = np.arange(n_classes, dtype=np.float64)
    expected = (pred_dist * bins[np.newaxis, :]).sum(axis=1)
    return np.abs(expected - y_true.astype(np.float64)).mean()


print("Helpers defined.")

Helpers defined.


## Rebinning (Fixed)
**Bug fix from v1:** backward returns now use their own quantile edges
(fit on train backward returns), not forward return edges.

In [5]:
def rebin_quantile(pct_values, pct_train_for_edges, n_bins):
    """Rebin using quantile edges fit on the given training data."""
    edges = np.quantile(pct_train_for_edges, np.linspace(0, 1, n_bins + 1))
    edges[0] = -np.inf
    edges[-1] = np.inf
    edges = np.unique(edges)
    actual_n = len(edges) - 1
    bins = np.clip(np.digitize(pct_values, edges) - 1, 0, actual_n - 1)
    return bins, edges, actual_n


def prepare_bins(n_bins, pct_fwd_all, pct_bwd_all, pct_fwd_train, pct_bwd_train,
                 s_orig, y_orig):
    """Prepare rebinned states and labels for a given n_bins.
    For n_bins=55: use original CSV bins.
    Otherwise: quantile-based, with SEPARATE edges for forward and backward.
    """
    if n_bins == 55:
        return s_orig.copy(), y_orig.copy(), 55, "original_fixed"
    else:
        y_new, _, n_y = rebin_quantile(pct_fwd_all, pct_fwd_train, n_bins)
        s_new, _, n_s = rebin_quantile(pct_bwd_all, pct_bwd_train, n_bins)  # FIXED: own edges
        actual_n = max(n_y, n_s)
        return s_new, y_new, actual_n, "quantile"


# Precompute train-only percent changes
pct_fwd_train = pct_forward_all[idx_train]
pct_bwd_train = pct_backward_all[idx_train]  # FIXED: was missing in v1

print("Rebinning functions defined.")

Rebinning functions defined.


## (2) MI Permutation Test — All Bin Counts
500 permutations. Reports MI_real, permutation mean/std, z-score, p-value, deltaMI.

In [6]:
N_PERM = 500
N_BINS_LIST = [25, 35, 40, 55]

mi_results = []

for n_bins in N_BINS_LIST:
    s_all, y_all_rb, n_st, method = prepare_bins(
        n_bins, pct_forward_all, pct_backward_all,
        pct_fwd_train, pct_bwd_train, s_curr_all, y_all)
    s_tr = s_all[idx_train]
    y_tr = y_all_rb[idx_train]

    print(f"\nn_bins={n_bins} ({method}): running {N_PERM} permutations...", end=" ")
    res = mi_permutation_test(s_tr, y_tr, n_st, n_st, n_perm=N_PERM)
    res["n_bins"] = n_bins
    res["method"] = method

    # State coverage
    sc = np.bincount(s_tr, minlength=n_st)
    res["avg_samples_per_state"] = sc[sc > 0].mean()
    res["states_lt5"] = int((sc < 5).sum())

    mi_results.append(res)
    print(f"done.")
    print(f"  MI_real={res['mi_real']:.6f}, perm_mean={res['perm_mean']:.6f}, "
          f"perm_std={res['perm_std']:.6f}")
    print(f"  deltaMI={res['delta_mi_nats']:.6f} nats ({res['delta_mi_bits']:.6f} bits)")
    print(f"  z-score={res['z_score']:.2f}, p-value={res['p_value']:.4f}")
    if res['p_value'] < 0.01 and res['delta_mi_nats'] > 0.001:
        print(f"  -> Statistically significant signal (p<0.01, deltaMI>0.001)")
    elif res['p_value'] < 0.01:
        print(f"  -> Statistically detectable but economically weak (deltaMI={res['delta_mi_nats']:.6f})")
    else:
        print(f"  -> NOT significant (p={res['p_value']:.4f})")

df_mi = pd.DataFrame(mi_results)
print("\n" + "=" * 100)
print("MI PERMUTATION TEST SUMMARY")
print("=" * 100)
cols = ["n_bins", "method", "mi_real", "perm_mean", "perm_std",
        "delta_mi_nats", "delta_mi_bits", "z_score", "p_value",
        "avg_samples_per_state", "states_lt5"]
print(df_mi[cols].to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print("=" * 100)


n_bins=25 (quantile): running 500 permutations... done.
  MI_real=0.253800, perm_mean=0.192654, perm_std=0.010699
  deltaMI=0.061146 nats (0.088214 bits)
  z-score=5.72, p-value=0.0020
  -> Statistically significant signal (p<0.01, deltaMI>0.001)

n_bins=35 (quantile): running 500 permutations... done.
  MI_real=0.468648, perm_mean=0.410466, perm_std=0.013763
  deltaMI=0.058182 nats (0.083940 bits)
  z-score=4.23, p-value=0.0020
  -> Statistically significant signal (p<0.01, deltaMI>0.001)

n_bins=40 (quantile): running 500 permutations... done.
  MI_real=0.577981, perm_mean=0.534409, perm_std=0.013769
  deltaMI=0.043572 nats (0.062861 bits)
  z-score=3.16, p-value=0.0020
  -> Statistically significant signal (p<0.01, deltaMI>0.001)

n_bins=55 (original_fixed): running 500 permutations... done.
  MI_real=0.659427, perm_mean=0.628043, perm_std=0.012749
  deltaMI=0.031384 nats (0.045278 bits)
  z-score=2.46, p-value=0.0100
  -> Statistically significant signal (p<0.01, deltaMI>0.001)

M

## (3) Conditional Baselines — Properly Tuned

Three baselines at $k=1$:
- **A) Marginal**: $P(y)$ from TRAIN
- **B) Additive smoothing**: $P(y|x) = (C[x,y] + \alpha) / (C[x,.] + \alpha \cdot n_{states})$, $\alpha$ tuned on VAL
- **C) Backoff**: $P_{mix}(y|x) = \lambda_x \cdot P_{cond}(y|x) + (1-\lambda_x) \cdot P_{marg}(y)$,
  $\lambda_x = count(x)/(count(x)+\tau)$, with $\alpha$ and $\tau$ tuned on VAL

In [7]:
ALPHA_GRID = [0, 1e-3, 1e-2, 1e-1, 0.5, 1.0, 2.0, 5.0, 10.0]
TAU_GRID   = [1, 2, 5, 10, 20, 50, 100, 200, 500]

baseline_results = []

for n_bins in N_BINS_LIST:
    s_all, y_all_rb, n_st, method = prepare_bins(
        n_bins, pct_forward_all, pct_backward_all,
        pct_fwd_train, pct_bwd_train, s_curr_all, y_all)
    s_tr = s_all[idx_train]; s_va = s_all[idx_val]; s_te = s_all[idx_test]
    y_tr = y_all_rb[idx_train]; y_va = y_all_rb[idx_val]; y_te = y_all_rb[idx_test]

    marginal = compute_marginal(y_tr, n_st)

    # --- Baseline A: Marginal ---
    pred_marg_val  = np.tile(marginal, (len(y_va), 1))
    pred_marg_test = np.tile(marginal, (len(y_te), 1))
    ll_marg_val  = mean_log_likelihood(pred_marg_val, y_va)
    ll_marg_test = mean_log_likelihood(pred_marg_test, y_te)

    # --- Baseline B: Additive smoothing, alpha tuned on VAL ---
    best_alpha, best_alpha_ll = None, -np.inf
    for alpha in ALPHA_GRID:
        if alpha == 0:
            # Zero smoothing: skip states with 0 counts (assign marginal)
            C = compute_joint_counts(s_tr, y_tr, n_st, n_st)
            row_sums = C.sum(axis=1)
            P_cond = np.zeros_like(C)
            for i in range(n_st):
                if row_sums[i] > 0:
                    P_cond[i] = C[i] / row_sums[i]
                else:
                    P_cond[i] = marginal
        else:
            P_cond = compute_conditional_additive(s_tr, y_tr, n_st, n_st, alpha)
        pred_val = P_cond[s_va]
        ll_val = mean_log_likelihood(pred_val, y_va)
        if ll_val > best_alpha_ll:
            best_alpha_ll = ll_val
            best_alpha = alpha

    # Evaluate best alpha on test
    if best_alpha == 0:
        C = compute_joint_counts(s_tr, y_tr, n_st, n_st)
        row_sums = C.sum(axis=1)
        P_cond_best = np.zeros_like(C)
        for i in range(n_st):
            P_cond_best[i] = C[i] / row_sums[i] if row_sums[i] > 0 else marginal
    else:
        P_cond_best = compute_conditional_additive(s_tr, y_tr, n_st, n_st, best_alpha)
    ll_additive_test = mean_log_likelihood(P_cond_best[s_te], y_te)
    acc_additive_test = accuracy(P_cond_best[s_te], y_te)

    # --- Baseline C: Backoff, 2-stage tuning ---
    # Stage 1: fix alpha = best from B, tune tau
    best_tau, best_tau_ll = None, -np.inf
    alpha_for_backoff = best_alpha if best_alpha > 0 else 1e-3
    for tau in TAU_GRID:
        pred_val = compute_backoff_baseline(
            s_tr, y_tr, s_va, n_st, n_st, alpha_for_backoff, tau, marginal)
        ll_val = mean_log_likelihood(pred_val, y_va)
        if ll_val > best_tau_ll:
            best_tau_ll = ll_val
            best_tau = tau

    # Stage 2: refine alpha with best tau
    best_backoff_alpha, best_backoff_ll = alpha_for_backoff, best_tau_ll
    for alpha in ALPHA_GRID:
        if alpha == 0:
            alpha = 1e-3  # avoid division issues
        pred_val = compute_backoff_baseline(
            s_tr, y_tr, s_va, n_st, n_st, alpha, best_tau, marginal)
        ll_val = mean_log_likelihood(pred_val, y_va)
        if ll_val > best_backoff_ll:
            best_backoff_ll = ll_val
            best_backoff_alpha = alpha

    # Evaluate backoff on test
    pred_backoff_test = compute_backoff_baseline(
        s_tr, y_tr, s_te, n_st, n_st, best_backoff_alpha, best_tau, marginal)
    ll_backoff_test = mean_log_likelihood(pred_backoff_test, y_te)
    acc_backoff_test = accuracy(pred_backoff_test, y_te)

    delta_additive = ll_additive_test - ll_marg_test
    delta_backoff  = ll_backoff_test - ll_marg_test

    print(f"\nn_bins={n_bins}: marginal_LL={ll_marg_test:.6f}")
    print(f"  B) Additive: alpha={best_alpha}, test_LL={ll_additive_test:.6f}, delta={delta_additive:+.6f}")
    print(f"  C) Backoff:  alpha={best_backoff_alpha}, tau={best_tau}, "
          f"test_LL={ll_backoff_test:.6f}, delta={delta_backoff:+.6f}")

    baseline_results.append({
        "n_bins": n_bins, "method": method,
        "marginal_LL_test": ll_marg_test,
        "additive_alpha": best_alpha,
        "additive_LL_test": ll_additive_test,
        "additive_delta": delta_additive,
        "additive_acc": acc_additive_test,
        "backoff_alpha": best_backoff_alpha,
        "backoff_tau": best_tau,
        "backoff_LL_test": ll_backoff_test,
        "backoff_delta": delta_backoff,
        "backoff_acc": acc_backoff_test,
    })

df_baselines = pd.DataFrame(baseline_results)

print("\n" + "=" * 110)
print("BASELINE COMPARISON (k=1, TEST)")
print("=" * 110)
print(df_baselines.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print("=" * 110)

# Identify best configuration
best_row = df_baselines.loc[df_baselines["backoff_delta"].idxmax()]
print(f"\nBest backoff delta: {best_row['backoff_delta']:+.6f} nats at n_bins={int(best_row['n_bins'])}")
if best_row['backoff_delta'] > 0:
    print("  Backoff baseline BEATS marginal on test.")
else:
    print("  Backoff baseline does NOT beat marginal on test.")


n_bins=25: marginal_LL=-3.218758
  B) Additive: alpha=10.0, test_LL=-3.216606, delta=+0.002152
  C) Backoff:  alpha=10.0, tau=500, test_LL=-3.217329, delta=+0.001429

n_bins=35: marginal_LL=-3.554692
  B) Additive: alpha=10.0, test_LL=-3.562570, delta=-0.007878
  C) Backoff:  alpha=5.0, tau=200, test_LL=-3.555981, delta=-0.001290

n_bins=40: marginal_LL=-3.688565
  B) Additive: alpha=10.0, test_LL=-3.695031, delta=-0.006467
  C) Backoff:  alpha=10.0, tau=100, test_LL=-3.689438, delta=-0.000873

n_bins=55: marginal_LL=-3.692749
  B) Additive: alpha=1.0, test_LL=-3.891563, delta=-0.198814
  C) Backoff:  alpha=0.001, tau=500, test_LL=-3.690809, delta=+0.001940

BASELINE COMPARISON (k=1, TEST)
 n_bins         method  marginal_LL_test  additive_alpha  additive_LL_test  additive_delta  additive_acc  backoff_alpha  backoff_tau  backoff_LL_test  backoff_delta  backoff_acc
     25       quantile         -3.218758       10.000000         -3.216606        0.002152      0.050562      10.000000   

## (4) Multi-Step Evaluation ($k > 1$)

For stationary baselines:
- **Marginal**: $\hat{\pi}_{t+k} = P_{marg}(y)$ for all $k$.
- **Backoff**: $\hat{\pi}_{t+k} = e_{x_t} \cdot A^k$ where $A$ is the transition matrix from the backoff baseline, evaluated at each state.

Note: $A^k$ is the matrix power of the $k$-step transition matrix.

In [8]:
K_LIST = [1, 2, 3, 5, 10]

# Use the best n_bins configuration from the backoff results
best_nb = int(df_baselines.loc[df_baselines["backoff_delta"].idxmax(), "n_bins"])
best_alpha_ms = float(df_baselines.loc[df_baselines["backoff_delta"].idxmax(), "backoff_alpha"])
best_tau_ms   = float(df_baselines.loc[df_baselines["backoff_delta"].idxmax(), "backoff_tau"])

# Also run for n_bins=55 (original) for comparison
multistep_nbins_list = sorted(set([best_nb, 55]))

multistep_results = []

for n_bins in multistep_nbins_list:
    s_all, y_all_rb, n_st, method = prepare_bins(
        n_bins, pct_forward_all, pct_backward_all,
        pct_fwd_train, pct_bwd_train, s_curr_all, y_all)
    s_tr = s_all[idx_train]; y_tr = y_all_rb[idx_train]

    marginal = compute_marginal(y_tr, n_st)

    # Build transition matrix A for backoff baseline
    # Use the tuned hyperparams from the best config, or re-tune for this n_bins
    row = df_baselines[df_baselines["n_bins"] == n_bins]
    if len(row) > 0:
        alpha_ms = float(row["backoff_alpha"].values[0])
        tau_ms   = float(row["backoff_tau"].values[0])
    else:
        alpha_ms = best_alpha_ms
        tau_ms   = best_tau_ms

    # Build A: each row A[x,:] = lambda_x * P_cond(y|x) + (1-lambda_x) * P_marg(y)
    C = compute_joint_counts(s_tr, y_tr, n_st, n_st)
    C_alpha = C + alpha_ms
    P_cond = C_alpha / C_alpha.sum(axis=1, keepdims=True)
    state_counts = C.sum(axis=1)
    lam = state_counts / (state_counts + tau_ms)
    A = np.zeros((n_st, n_st), dtype=np.float64)
    for i in range(n_st):
        A[i] = lam[i] * P_cond[i] + (1 - lam[i]) * marginal

    print(f"\nn_bins={n_bins} ({method}), alpha={alpha_ms}, tau={tau_ms}")

    for k in K_LIST:
        # Need k-step-ahead pairs: (s_t, y_{t+k})
        # From test set: indices val_end to T-1, we need y at t+k
        # But y_{t+k} requires index t+k to exist
        valid_mask = (idx_test + k) < T
        idx_t = idx_test[valid_mask]
        idx_tk = idx_t + k
        s_t = s_all[idx_t]
        y_tk = y_all_rb[idx_tk]
        n_valid = len(s_t)

        if n_valid < 10:
            print(f"  k={k}: too few samples ({n_valid}), skipping")
            continue

        # Marginal prediction
        pred_marg = np.tile(marginal, (n_valid, 1))
        ll_marg = mean_log_likelihood(pred_marg, y_tk)

        # Backoff k-step: pi_{t+k} = e_{s_t} @ A^k
        Ak = np.linalg.matrix_power(A, k)
        pred_backoff = Ak[s_t]
        ll_backoff = mean_log_likelihood(pred_backoff, y_tk)

        delta = ll_backoff - ll_marg

        multistep_results.append({
            "n_bins": n_bins, "k": k, "n_valid": n_valid,
            "marginal_LL": ll_marg, "backoff_LL": ll_backoff,
            "delta_LL": delta,
        })
        print(f"  k={k:2d}: n={n_valid}, marg_LL={ll_marg:.6f}, "
              f"backoff_LL={ll_backoff:.6f}, delta={delta:+.6f}")

df_multistep = pd.DataFrame(multistep_results)

print("\n" + "=" * 90)
print("MULTI-STEP EVALUATION SUMMARY")
print("=" * 90)
print(df_multistep.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print("=" * 90)


n_bins=55 (original_fixed), alpha=0.001, tau=500.0
  k= 1: n=355, marg_LL=-3.692661, backoff_LL=-3.687368, delta=+0.005294
  k= 2: n=354, marg_LL=-3.693314, backoff_LL=-3.692620, delta=+0.000694
  k= 3: n=353, marg_LL=-3.694515, backoff_LL=-3.693811, delta=+0.000704
  k= 5: n=351, marg_LL=-3.693570, backoff_LL=-3.692869, delta=+0.000701
  k=10: n=346, marg_LL=-3.698555, backoff_LL=-3.697894, delta=+0.000661

MULTI-STEP EVALUATION SUMMARY
 n_bins  k  n_valid  marginal_LL  backoff_LL  delta_LL
     55  1      355    -3.692661   -3.687368  0.005294
     55  2      354    -3.693314   -3.692620  0.000694
     55  3      353    -3.694515   -3.693811  0.000704
     55  5      351    -3.693570   -3.692869  0.000701
     55 10      346    -3.698555   -3.697894  0.000661


## (5) Final Conclusion

In [9]:
os.makedirs("results/diagnostics_v2", exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")

df_mi.to_csv(f"results/diagnostics_v2/mi_permutation_{ts}.csv", index=False)
df_baselines.to_csv(f"results/diagnostics_v2/baselines_k1_{ts}.csv", index=False)
df_multistep.to_csv(f"results/diagnostics_v2/multistep_{ts}.csv", index=False)

print("=" * 90)
print("FINAL CONCLUSION — Is there conditional signal beyond marginal?")
print("=" * 90)

# (i) MI permutation test
print("\n(i) MI Permutation Test:")
any_mi_sig = False
for _, row in df_mi.iterrows():
    nb = int(row['n_bins'])
    p = row['p_value']
    delta = row['delta_mi_nats']
    z = row['z_score']
    if p < 0.01 and delta > 0.001:
        print(f"  n_bins={nb}: p={p:.4f}, deltaMI={delta:.6f} nats, z={z:.2f} "
              f"-> SIGNIFICANT signal")
        any_mi_sig = True
    elif p < 0.01:
        print(f"  n_bins={nb}: p={p:.4f}, deltaMI={delta:.6f} nats, z={z:.2f} "
              f"-> Detectable but tiny")
        any_mi_sig = True
    else:
        print(f"  n_bins={nb}: p={p:.4f}, deltaMI={delta:.6f} nats, z={z:.2f} "
              f"-> NOT significant")

# (ii) Tuned backoff baseline
print("\n(ii) Tuned Backoff Baseline (k=1, TEST):")
best_delta_ll = -np.inf
best_nb_ll = None
any_beats_marginal = False
for _, row in df_baselines.iterrows():
    nb = int(row['n_bins'])
    d_add = row['additive_delta']
    d_bk  = row['backoff_delta']
    best_d = max(d_add, d_bk)
    label = 'backoff' if d_bk >= d_add else 'additive'
    if best_d > 0:
        print(f"  n_bins={nb}: best delta={best_d:+.6f} nats ({label}) -> BEATS marginal")
        any_beats_marginal = True
    else:
        print(f"  n_bins={nb}: best delta={best_d:+.6f} nats ({label}) -> does NOT beat marginal")
    if best_d > best_delta_ll:
        best_delta_ll = best_d
        best_nb_ll = nb

# (iii) Multi-step
print("\n(iii) Multi-Step (k>1):")
any_multistep_sig = False
for _, row in df_multistep.iterrows():
    nb = int(row['n_bins'])
    k = int(row['k'])
    d = row['delta_LL']
    if d > 0:
        print(f"  n_bins={nb}, k={k}: delta={d:+.6f} -> BEATS marginal")
        any_multistep_sig = True
    elif k <= 3:  # only report small k
        print(f"  n_bins={nb}, k={k}: delta={d:+.6f}")

# Overall verdict
print("\n" + "=" * 90)
print("VERDICT")
print("=" * 90)

if any_beats_marginal and any_mi_sig:
    print(f"  YES — Conditional signal EXISTS.")
    print(f"  Best improvement: {best_delta_ll:+.6f} nats/sample at n_bins={best_nb_ll}.")
    print(f"  MI permutation test confirms statistical significance.")
    if best_delta_ll < 0.01:
        print(f"  However, the effect is SMALL ({best_delta_ll:.6f} nats).")
        print(f"  This may be too weak for a neural model to reliably capture.")
    else:
        print(f"  Effect size is non-trivial. Neural model should be able to exploit this.")
elif any_mi_sig and not any_beats_marginal:
    print(f"  WEAK — MI shows statistically detectable dependence,")
    print(f"  but even the best-tuned count-based baseline cannot beat marginal on test.")
    print(f"  The signal is too weak to be useful for prediction.")
elif any_beats_marginal and not any_mi_sig:
    print(f"  AMBIGUOUS — Backoff baseline beats marginal on test (delta={best_delta_ll:+.6f}),")
    print(f"  but MI permutation test is not significant. Could be finite-sample artifact.")
else:
    print(f"  NO — No detectable conditional signal.")
    print(f"  MI permutation test: not significant.")
    print(f"  Best tuned baseline: delta={best_delta_ll:+.6f} nats (does not beat marginal).")
    print(f"  The current-state bin carries NO useful information about the next-state bin.")

if any_multistep_sig:
    print(f"\n  Multi-step note: Some k>1 horizons show positive delta,")
    print(f"  suggesting persistence effects at longer horizons.")

print("\n" + "=" * 90)
print(f"All results saved to results/diagnostics_v2/ (timestamp: {ts})")
print("=" * 90)

FINAL CONCLUSION — Is there conditional signal beyond marginal?

(i) MI Permutation Test:
  n_bins=25: p=0.0020, deltaMI=0.061146 nats, z=5.72 -> SIGNIFICANT signal
  n_bins=35: p=0.0020, deltaMI=0.058182 nats, z=4.23 -> SIGNIFICANT signal
  n_bins=40: p=0.0020, deltaMI=0.043572 nats, z=3.16 -> SIGNIFICANT signal
  n_bins=55: p=0.0100, deltaMI=0.031384 nats, z=2.46 -> SIGNIFICANT signal

(ii) Tuned Backoff Baseline (k=1, TEST):
  n_bins=25: best delta=+0.002152 nats (additive) -> BEATS marginal
  n_bins=35: best delta=-0.001290 nats (backoff) -> does NOT beat marginal
  n_bins=40: best delta=-0.000873 nats (backoff) -> does NOT beat marginal
  n_bins=55: best delta=+0.001940 nats (backoff) -> BEATS marginal

(iii) Multi-Step (k>1):
  n_bins=55, k=1: delta=+0.005294 -> BEATS marginal
  n_bins=55, k=2: delta=+0.000694 -> BEATS marginal
  n_bins=55, k=3: delta=+0.000704 -> BEATS marginal
  n_bins=55, k=5: delta=+0.000701 -> BEATS marginal
  n_bins=55, k=10: delta=+0.000661 -> BEATS margin