<a href="https://colab.research.google.com/github/jacquelinedoan/seq_test/blob/main/sequential_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reinforcement Learning for Optimal Alpha Spending Function in Sequential Hypothesis Testing**
Rambling by Jacqueline.

## Introduction
Product experiements are commonly in the form of a $z$-test, comparing the 2 means of some KPI of both the control and test groups, say $\mu_1$ and $\mu_2$ respectively. The test is of course conducted after all data is collected. Furthermore,
$$H_0: \mu_1 = \mu_2$$
$$H_A: \mu_1 < \mu_2$$

Given significance level $\alpha$, if we find statistical significance, $H_0$ is rejected and we accept $H_A$; the new product is considered a success.
*However, what happens when we do not find significance?*

Say that we decided more data is needed and the data collection period is extended. Both the control and test groups grow in sample size. We then conduct the $z$-test a second time at some significance level. *Why don't we continue this process until we reached significance?*

### *Sampling to reach a foregone conclusion*
**Type I Error** (False Positive Error) is the risk of incorrectly rejecting a true $H_0$. In other words, we conclude the new version of the product is a success while it is not, risking shipping a product that does not have a positive effect on our customers.

**Significance level $\alpha$** is also the maximum probability of observing Type I Errors that the experimenter will accept in the long run.

If we conduct the test once, the probability of not getting a FP error is $1-\alpha$. If we conduct the tests above $k$ times, each at $\alpha$, the change of observing at least one false positive grows
$$P(\geq 1 \text{ FP }) = 1 - P(< 1 \text{ FP }) = 1 - (1-\alpha)^k$$

*As $k \to ∞$, just by chance, the probability you make at least one FP increases.*

---
**Formal Example**

Let $X_1, X_2, \dots$ i.i.d. from $N(\mu, \sigma^2)$ with known $\sigma$. Consider a statistical tests of
$$H_0: \mu = 0$$
$$H_1: \mu \neq 0$$
Given fixed sample size $n$ and $\alpha=0.05$, we reject $H_0$ iff
$$\left|\sum_{i=1}^nX_i\right| > 1.96\sigma \sqrt{n}$$

---


The intergal (lol) relationship between $\alpha$ and FP rate prompts an algorithm to compute it.


---
**Numerical Procedure to Compute Type I Error Rate**

Let $X_1, X_2, \dots$ i.i.d. $N(\mu, \sigma)$, and without loss of generality, let $\sigma=1$. We conduct a sequence of hypothesis tests
$$H_0: \mu = 0$$
$$H_1: \mu \neq 0$$
and we label each test $k = 1,2, \dots, K$. At each $k$, let $S_k = \sum_i^k X_i$  and the boundary value be $b_k$. Sampling is terminated at step $k$ when we observe $S_k > b_k$ for the first time.

The boundary values must satisfy the following
$$ \begin{align} &P(|S_k| > b_k \text{ for some } k ) &= \alpha \\
\implies &P(|S_k| < b_k \text{ for all } k ) &= 1 - \alpha
\end{align}
$$
Furthermore, $f_k$, the pdf of $S_k$ under $H_0$ satisfies the following recursive definition base on convolution
$$f_k(s) = \int_{-b_{k-1}}^{b_{k-1}}f_{k-1}(u)\phi(s-u)du$$
where $\phi$ is the standard normal pdf. Let $k^*$ be the rv for when $|S_k|>b_k$ for the first time, and the probability of stopping at or before $k$ is
$$\begin{align}
P_k &= P(k^* \leq k)\\
&= 1 - P(|S_1|\leq b_1 \dots |S_k|\leq b_k) \\
& = 1 - \int_{-b_{k-1}}^{b_{k-1}}f_k(u)du
\end{align}
$$
The exit probability $P(k^* = k)$ is
$$\begin{align}
P_k - P_{k-1} &= P(|S_1|\leq b_1 \dots |S_k|\leq b_k, |S_k|>b_k) \\
& = \int_{-b_{k-1}}^{b_{k-1}}f_{k-1}(u){1-\Phi(b_k-u) + \Phi(-b_k -u)} du
\end{align}
$$
where $\Phi$ is the standard normal distribution function. The overall significance of the sequential procedure is
$$\alpha = 1 - \int_{-b_{k-1}}^{b_{k-1}}f_K(u)du $$

---



### Literature
Sequential tests are statistical tests to solve the problem above. Well-known techniques include group sequential tests, always valid inference, and corrected-alpha approach.

In this notebook, I'm particularly interested in group sequential tests, particularly, finding the optimal **$\alpha$ spending function**. This is a function that spreads $\alpha$ over the sequence of $z$-tests, say $\alpha_1, \dots, \alpha_k$. While there are many formulation for the function, I'm interested in framing the problem as an RL problem and viewing the function as an optimal policy. We view the problem set up stated by Lan and DeMets (1983) under the lense of product experimentation.

### Constrained Markov Decision Process (CMDP)
Consider a Decision Maker (DM) who is conducting a sequence of 2-sample t-tests (unknown variance)(between control and test groups) with
$$H_0: \mu_C=\mu_T $$
$$H_A: \mu_C < \mu_T$$

Framing the problem as a CMDP, our goal is to maximize power (probability of correct rejection) and the constraint is FPR $ = P(\text{ Reject } H_0 | H_0 \text{ true })\leq \alpha$. Or, formally,
$$\max_\pi E_\pi(\text{Power}) \text{ s.t. } E_\pi(\text{Type I Error rate}) \leq \alpha $$
or equivalently,
$$\max_\pi E_{H_A}(\text{Utility}) \text{ s.t. } E_{H_0}(1\{\text{Reject}\}) \leq \alpha $$
1. **State $s_k$**
The state vector must contain all the information from the past that matters for future decisions for the policy to be Markovian. At each step $k$, the sufficient state vector is
$$s_k = (t_k, \nu_k, n_C, n_T, \alpha_\text{rem})$$
where  
$$t_k=\frac{\bar{X}_T - \bar{X}_C}{\sqrt{s^2_c/n_c + s^2_t/n_t} }$$
The belief state $b_k$ = probability $H_A$ is true given data so far.


2. **Action $a_t$**
      *   Stop and Reject: Declare $H_A$
      *   Continue Sampling and Propose an incremental spend $\Delta\alpha_k \in [0, \alpha_{rem}]$
      *   Stop and Accept: Terminal state - Declare $H_0$
3. **Step Reward**
      * Stop and Reject and $H_A$ is true: $+R_{TP}$
      * Stop and Reject and $H_0$ is true: $-R_{FP}$
      * Stop and Accept and $H_A$ is true: $-R_{FN}$
      * Continue Sampling: $-c$ for sampling cost



**Primal-Dual (Lagrangian)**
- Maximize $L(\pi, \lambda)$ using PPO
- $\lambda ← [\lambda + η\lambda(\hat{C}(\pi)-\alpha)]_+$ where $\hat{C}$ is the empirical Type-I estimate.

In [None]:
"""
alpha_spend_t_ppo.py

PPO + primal-dual scaffold to learn an alpha-spending function for sequential two-sample t-tests.
- Uses Monte-Carlo conditional inversion to map proposed Δα -> t-boundary (cached).
- Policy outputs a fraction of remaining alpha to spend and a futility-stop probability.
- Episode-level cost: indicator of false positive (reject under H0). Constraint enforced with Lagrange multiplier λ.
- This is a scaffold for experiments — increase MC sizes and training iterations for production.

Run: python alpha_spend_t_ppo.py
"""

import os, math, time, pickle, random
from collections import namedtuple
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# Simple RNG seeding for reproducibility

SEED = 1234
np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
# -------------------- Config --------------------
DEFAULT_CONFIG = {
    "alpha": 0.05,
    "max_looks": 6,
    "batch_per_look": 5,   # number of samples per group added at each look
    "mc_precompute_n": 2000,
    "mc_lookup_n": 800,
    "cache_path": "/mnt/data/alpha_t_cache.pkl",
    "train_iters": 200,
    "batch_episodes": 64,
    "lr": 3e-4,
    "lambda_lr": 1e-3,
    "ppo_clip": 0.2,
    "device": "cpu",
    "grid_delta_alpha": np.linspace(1e-4, 0.2, 30).tolist(),
    "info_grid": list(range(1,7)),
    # simulation specifics
    "mu0": 0.0,    # control mean
    "delta": 0.5,  # true effect mean shift under H1
    "sigma": 1.0,  # assumed common sd for data generation
    "init_n_per_group": 2,  # initial samples per group at look 1
}

EpisodeResult = namedtuple("EpisodeResult", ["trajectory", "rejected", "true_H", "n_looks", "typeI_flag"])

# -------------------- Boundary cache / MC solver --------------------
class BoundaryCache:
    def __init__(self, path=None):
        self.cache = {}
        self.path = path
        if path and os.path.exists(path):
            try:
                with open(path, "rb") as f:
                    self.cache = pickle.load(f)
                print(f"Loaded cache from {path} ({len(self.cache)} entries)")
            except Exception as e:
                print("Failed loading cache:", e)
                self.cache = {}
    def save(self):
        if not self.path: return
        try:
            with open(self.path, "wb") as f:
                pickle.dump(self.cache, f)
            print(f"Saved cache to {self.path} ({len(self.cache)} entries)")
        except Exception as e:
            print("Failed saving cache:", e)
    def key(self, prev_bounds, info_idx, delta_alpha):
        k = sum(1 for b in prev_bounds if b is not None and np.isfinite(b))
        m = 0 if len(prev_bounds)==0 else int(np.floor(min(prev_bounds)*10.0))
        da_q = round(float(delta_alpha), 5)
        return (k, m, info_idx, da_q)
    def get(self, prev_bounds, info_idx, delta_alpha):
        return self.cache.get(self.key(prev_bounds, info_idx, delta_alpha), None)
    def set(self, prev_bounds, info_idx, delta_alpha, h):
        self.cache[self.key(prev_bounds, info_idx, delta_alpha)] = float(h)

def compute_t_stat_increment(n_new, mu_control, mu_treat, sigma, rng):
    """
    Simulate new data for both groups (n_new per group) and return the two-sample t-statistic (Welch style)
    for the updated pooled sample (we return the t-stat computed on the incremental combined sample only).
    For MC boundary computation under H0 we will set mu_control == mu_treat.
    """
    # generate samples
    x_c = rng.normal(loc=mu_control, scale=sigma, size=n_new)
    x_t = rng.normal(loc=mu_treat, scale=sigma, size=n_new)
    # compute sample means and variances
    n1 = len(x_c); n2 = len(x_t)
    m1 = x_c.mean(); m2 = x_t.mean()
    s1 = x_c.var(ddof=1) if n1>1 else 0.0
    s2 = x_t.var(ddof=1) if n2>1 else 0.0
    # Welch t for these incremental batches (not cumulative)
    denom = math.sqrt(s1/n1 + s2/n2) if (s1>0 or s2>0) else 1e-8
    t = (m2 - m1) / denom
    # approximate df (Welch-Satterthwaite) for incremental
    num = (s1/n1 + s2/n2)**2
    den = 0.0
    if n1>1:
        den += (s1/n1)**2 / (n1 - 1)
    if n2>1:
        den += (s2/n2)**2 / (n2 - 1)
    df = num / den if den>0 else max(n1+n2-2, 1.0)
    return float(t), float(df)

def compute_boundary_conditional(prev_bounds, info_idx, delta_alpha, n_mc=2000, rng=None, config=None):
    """
    Simulate t-statistics under H0 conditional on surviving prev_bounds and find h such that
    fraction crossing >= h equals delta_alpha.
    Simplifying assumption: treat the t-statistic at current look as approximately Normal(0,1) marginally,
    but we will simulate via small-batch t computations to get realistic tails.
    """
    if rng is None:
        rng = np.random.RandomState()
    if delta_alpha <= 1e-12:
        return 10.0
    if delta_alpha >= 1 - 1e-12:
        return -10.0
    mu0 = config.get("mu0", 0.0)
    sigma = config.get("sigma", 1.0)
    n_new = config.get("batch_per_look", 1)
    samples = []
    batch = max(2048, n_mc)
    # If no prev boundaries, sample t directly under H0
    if not prev_bounds:
        # simulate n_mc incremental t-statistics under H0
        draws = rng.normal(loc=0.0, scale=1.0, size=n_mc)
        # approximate by scaling: treat draws as t-statistics (simple)
        samples = draws
    else:
        # survival conditioning: we approximate by rejection sampling on previous min boundary
        thr = min(prev_bounds)
        survivors = []
        tries = 0
        while len(survivors) < n_mc and tries < 200:
            draws = rng.normal(size=batch)
            survivors.extend(draws[draws <= thr].tolist())
            tries += 1
            if tries>100 and len(survivors)==0:
                # fallback: sample direct normals
                survivors = list(np.random.normal(size=n_mc))
                break
        samples = np.array(survivors[:n_mc])
    # empirical quantile
    h = float(np.quantile(samples, 1.0 - float(delta_alpha)))
    return h

def precompute_boundaries(cache, config):
    deltas = config["grid_delta_alpha"]
    info_grid = config["info_grid"]
    max_prev = config["max_looks"] - 1
    rng = np.random.RandomState(SEED)
    total = len(deltas) * len(info_grid) * (max_prev + 1)
    i = 0
    t0 = time.time()
    for k in range(max_prev + 1):
        prev_bounds = [2.5] * k if k>0 else []
        for info_idx in info_grid:
            for da in deltas:
                key = cache.key(prev_bounds, info_idx, da)
                if key in cache.cache:
                    continue
                h = compute_boundary_conditional(prev_bounds, info_idx, da, n_mc=config["mc_precompute_n"], rng=rng, config=config)
                cache.cache[key] = float(h)
                i += 1
                if i % 200 == 0:
                    print(f"Precompute {i}/{total} elapsed {time.time()-t0:.1f}s")
    cache.save()

def lookup_boundary(cache, prev_bounds, info_idx, delta_alpha, config):
    hit = cache.get(prev_bounds, info_idx, delta_alpha)
    if hit is not None:
        return hit
    # nearest-key heuristic
    k = sum(1 for b in prev_bounds if b is not None and np.isfinite(b))
    candidates = []
    for key, val in cache.cache.items():
        if key[0]==k and key[2]==info_idx:
            candidates.append((abs(key[3]-round(float(delta_alpha),5)), val))
    if candidates:
        candidates.sort(key=lambda x: x[0])
        return float(candidates[0][1])
    # fallback MC
    h = compute_boundary_conditional(prev_bounds, info_idx, delta_alpha, n_mc=config["mc_lookup_n"], rng=np.random.RandomState(), config=config)
    cache.set(prev_bounds, info_idx, delta_alpha, h)
    return h

# -------------------- Simulator (sequential two-sample t-test) --------------------
def run_episode(policy, cache, config, rng=None, quick_approx=False):
    if rng is None:
        rng = np.random.RandomState()
    max_looks = config["max_looks"]
    batch_per_look = config["batch_per_look"]
    alpha = config["alpha"]
    mu0 = config["mu0"]
    delta = config["delta"]
    sigma = config["sigma"]
    # sample true hypothesis
    H1 = (rng.rand() < 0.5)
    mu_t = mu0 + (delta if H1 else 0.0)
    # initialize samples (small initial burn-in)
    n_c = config.get("init_n_per_group", 2)
    n_t = config.get("init_n_per_group", 2)
    # draw initial data
    data_c = list(rng.normal(loc=mu0, scale=sigma, size=n_c))
    data_t = list(rng.normal(loc=mu_t, scale=sigma, size=n_t))
    mean_c = np.mean(data_c); mean_t = np.mean(data_t)
    var_c = np.var(data_c, ddof=1) if n_c>1 else 0.0
    var_t = np.var(data_t, ddof=1) if n_t>1 else 0.0
    alpha_rem = alpha
    prev_boundaries = []
    trajectory = []
    rejected = False
    typeI_flag = False
    Z = 0.0  # we'll keep t-stat in variable name t_stat
    for look in range(1, max_looks+1):
        # compute t-stat (Welch)
        denom = math.sqrt((var_c / n_c) + (var_t / n_t)) if (var_c>0 or var_t>0) else 1e-8
        t_stat = (mean_t - mean_c) / denom
        # approximate df
        num = (var_c/n_c + var_t/n_t)**2
        den = 0.0
        if n_c>1:
            den += (var_c/n_c)**2 / (n_c - 1)
        if n_t>1:
            den += (var_t/n_t)**2 / (n_t - 1)
        nu = num/den if den>0 else max(n_c+n_t-2, 1.0)
        state = np.array([t_stat, nu, n_c, n_t, alpha_rem], dtype=np.float32)
        action, logp, value = policy.act(state)
        delta_alpha = float(action['delta_alpha_frac']) * alpha_rem
        delta_alpha = max(0.0, min(alpha_rem, delta_alpha))
        info_idx = look
        # map to boundary
        if delta_alpha > 0:
            if quick_approx:
                # crude: use normal quantile
                h_t = float(np.quantile(np.random.normal(size=10000), 1.0 - delta_alpha))
            else:
                h_t = lookup_boundary(cache, prev_boundaries, info_idx, delta_alpha, config)
        else:
            h_t = 10.0
        # sample next batch
        x_c = rng.normal(loc=mu0, scale=sigma, size=batch_per_look)
        x_t = rng.normal(loc=mu_t, scale=sigma, size=batch_per_look)
        # update data summaries incrementally
        # append then recompute mean/var for simplicity (n small)
        data_c.extend(x_c.tolist())
        data_t.extend(x_t.tolist())
        n_c = len(data_c); n_t = len(data_t)
        mean_c = np.mean(data_c); mean_t = np.mean(data_t)
        var_c = np.var(data_c, ddof=1) if n_c>1 else 0.0
        var_t = np.var(data_t, ddof=1) if n_t>1 else 0.0
        # recompute t_stat after including new batch
        denom = math.sqrt((var_c / n_c) + (var_t / n_t)) if (var_c>0 or var_t>0) else 1e-8
        t_stat = (mean_t - mean_c) / denom
        # check crossing
        if t_stat >= h_t:
            rejected = True
            trajectory.append((state, action, logp, value, 0.0))
            break
        # futility stop
        stop_fut_prob = float(action['stop_fut_prob'])
        if rng.rand() < stop_fut_prob:
            rejected = False
            break
        # continue
        alpha_rem -= delta_alpha
        if delta_alpha > 0:
            prev_boundaries.append(h_t)
        trajectory.append((state, action, logp, value, -1.0))
        if alpha_rem <= 1e-12:
            break
    if rejected and not H1:
        typeI_flag = True
    return EpisodeResult(trajectory=trajectory, rejected=rejected, true_H=int(H1), n_looks=len(trajectory)+1, typeI_flag=typeI_flag)

# -------------------- PPO policy --------------------
class PPOPolicy(nn.Module):
    def __init__(self, obs_dim=5, hidden=128):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(obs_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU())
        self.stop_fut_head = nn.Linear(hidden, 1)
        self.delta_head = nn.Linear(hidden, 1)
        self.value_head = nn.Linear(hidden, 1)
    def forward(self, x):
        h = self.net(x)
        stop_logit = self.stop_fut_head(h).squeeze(-1)
        delta_raw = torch.sigmoid(self.delta_head(h)).squeeze(-1)
        value = self.value_head(h).squeeze(-1)
        stop_prob = torch.sigmoid(stop_logit)
        return stop_prob, delta_raw, value
    def act(self, state):
        st = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
        with torch.no_grad():
            stop_prob, delta_raw, val = self.forward(st)
        stop_prob = float(stop_prob.item())
        delta_raw = float(delta_raw.item())
        stop_fut = 1.0 if np.random.rand() < stop_prob else 0.0
        logp = math.log(stop_prob + 1e-8) if stop_fut==1.0 else math.log(1.0 - stop_prob + 1e-8)
        action = {'stop_fut_prob': stop_prob, 'delta_alpha_frac': delta_raw}
        return action, logp, float(val.item())

# -------------------- Training loop (primal-dual PPO-like) --------------------
def train_ppo(config, quick=False):
    device = torch.device(config.get("device","cpu"))
    policy = PPOPolicy().to(device)
    optimizer = optim.Adam(policy.parameters(), lr=config["lr"])
    cache = BoundaryCache(path=config.get("cache_path", None))
    # Precompute grid (unless quick)
    if len(cache.cache)==0 and not quick:
        print("Precomputing boundary cache (this may take a while)...")
        precompute_boundaries(cache, config)
    lambda_dual = 1.0
    for it in range(1, config["train_iters"]+1):
        episodes = []
        typeI_count = 0
        for e in range(config["batch_episodes"]):
            ep = run_episode(policy, cache, config, rng=np.random.RandomState(), quick_approx=quick)
            episodes.append(ep)
            if ep.typeI_flag:
                typeI_count += 1
        typeI_hat = typeI_count / len(episodes)
        lambda_dual = max(0.0, lambda_dual + config["lambda_lr"] * (typeI_hat - config["alpha"]))
        # prepare training batch
        states = []
        old_logps = []
        returns = []
        for ep in episodes:
            # episodic utility: +1 for TP, -1 for FN, -1 - lambda for FP under H0
            if ep.rejected and ep.true_H==1:
                u = 1.0
            elif not ep.rejected and ep.true_H==1:
                u = -1.0
            elif ep.rejected and ep.true_H==0:
                u = -1.0 - lambda_dual
            else:
                u = 0.0
            if len(ep.trajectory)>0:
                s0, a0, logp0, v0, r0 = ep.trajectory[0]
                states.append(s0)
                old_logps.append(logp0)
                returns.append(u)
            else:
                states.append(np.array([0.0,0.0, config.get("init_n_per_group",2), config.get("init_n_per_group",2), config["alpha"]], dtype=np.float32))
                old_logps.append(math.log(0.5))
                returns.append(u)
        states_t = torch.tensor(np.stack(states,axis=0), dtype=torch.float32).to(device)
        returns_t = torch.tensor(returns, dtype=torch.float32).to(device)
        old_logp_t = torch.tensor(old_logps, dtype=torch.float32).to(device)
        stop_prob_t, delta_t, val_t = policy.forward(states_t)
        advantages = (returns_t - val_t.detach()).detach()
        # actor loss (stop_prob only, approximate)
        sampled_bit = (old_logp_t > math.log(0.5)).float()
        curr_logp = sampled_bit * torch.log(stop_prob_t + 1e-8) + (1.0-sampled_bit) * torch.log(1.0 - stop_prob_t + 1e-8)
        ratios = torch.exp(curr_logp - old_logp_t)
        surr1 = ratios * advantages
        surr2 = torch.clamp(ratios, 1.0-config["ppo_clip"], 1.0+config["ppo_clip"]) * advantages
        actor_loss = -torch.mean(torch.min(surr1, surr2))
        delta_loss = -torch.mean(advantages * delta_t)
        critic_loss = torch.mean((val_t - returns_t)**2)
        loss = actor_loss + 0.5 * critic_loss + 0.1 * delta_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if it % 10 == 0 or it==1:
            avg_util = float(np.mean(returns))
            print(f"Iter {it:04d} | typeI_hat={typeI_hat:.3f} | lambda={lambda_dual:.3f} | avg_util={avg_util:.3f}")
    cache.save()
    return policy, cache, lambda_dual

# -------------------- Main (smoke run) --------------------
if __name__ == "__main__":
    cfg = DEFAULT_CONFIG.copy()
    cfg["train_iters"] = 60
    cfg["batch_episodes"] = 48
    cfg["mc_precompute_n"] = 400
    cfg["mc_lookup_n"] = 200
    cfg["cache_path"] = "/content/cache/alpha_t_cache_smoke.pkl"
    policy, cache, lambda_final = train_ppo(cfg, quick=True)
    print("Done. final lambda:", lambda_final)


Iter 0001 | typeI_hat=0.000 | lambda=1.000 | avg_util=-0.375
Iter 0010 | typeI_hat=0.021 | lambda=1.000 | avg_util=-0.396
Iter 0020 | typeI_hat=0.021 | lambda=0.999 | avg_util=-0.375
Iter 0030 | typeI_hat=0.021 | lambda=0.999 | avg_util=-0.417
Iter 0040 | typeI_hat=0.021 | lambda=0.999 | avg_util=-0.167
Iter 0050 | typeI_hat=0.021 | lambda=0.999 | avg_util=-0.292
Iter 0060 | typeI_hat=0.062 | lambda=0.998 | avg_util=-0.354
Saved cache to /content/cache/alpha_t_cache_smoke.pkl (0 entries)
Done. final lambda: 0.9984166666666667
