## Design: 
Suppose our current system uses an $\epsilon$-greedy policy for model selection (mostly choosing GPT-5, occasionally a cheaper model). Define a logging schema for each interaction: what context, action, and probability information must we log to enable unbiased OPE of a new routing policy? Outline the data structure clearly.

Data to be logged : 

1. context xi
2. action taken ai
3. reward we got - ri
4. probability info - probability of ai being taken given xi context along with epsilon.

                    - if ai = gpt-5(best action) then pi = 1 - e + (e/k)

                    - if ai != gpt-5 then pi = e/k
                    
                    where k is the number of actions, here 2.

## Coding: 
Implement IPS and Doubly Robust estimators for off-policy policy value. Use a synthetic logged dataset (e.g., generated by a known policy on a multi-armed bandit) and a candidate target policy. Compare their estimates to the ground-truth value. Then implement the SWITCH estimator: for each instance, decide to use IPS or a model prediction based on whether the importance weight is below a threshold. Show that SWITCH yields lower Mean Squared Error than plain IPS when the target policy is significantly different from the logging policy.



In [6]:
import random
import numpy as np

# 1. Setup
# Mean rewards for GPT-5 (0.8) and Cheap (0.3)
true_rewards = [0.8, 0.3] 
old_epsilon = 0.3
num_actions = 1000 # Increased for better estimation

actions = []
rewards = []
ps = []

# 2. Logging Data (Logging Policy)
for i in range(num_actions):
    # Epsilon-greedy
    if random.random() < old_epsilon:
        action = random.choice(range(len(true_rewards)))
    else:
        action = true_rewards.index(max(true_rewards))
    
    actions.append(action)
    reward = np.random.binomial(1, p=true_rewards[action])
    rewards.append(reward)

    # Correct Propensity Calculation
    if action == 0: # GPT-5 was the 'best' in logging policy
        p = (1 - old_epsilon) + (old_epsilon / len(true_rewards))
    else:
        p = old_epsilon / len(true_rewards)
    ps.append(p)

# 3. Direct Method Estimation (q-model)
# We estimate the expected reward for each action based on historical data
q_model = [0.0, 0.0]
for a in [0, 1]:
    # Mean reward = sum of rewards for action a / count of action a
    mask = [actions[i] == a for i in range(num_actions)]
    if sum(mask) > 0:
        q_model[a] = np.mean([rewards[i] for i in range(num_actions) if actions[i] == a])

# 4. Target Policy Definition
# Suppose the new policy ALWAYS picks the Cheap model (index 1)
def target_policy_prob(action_idx):
    return 1.0 if action_idx == 1 else 0.0

# 5. Evaluation (IPS and DR)
ips_sum = 0 
dr_sum = 0

for i in range(num_actions):
    # Probability target policy would take the action that WAS taken
    pi_a = target_policy_prob(actions[i])
    
    # Probability target policy would take its PREFERRED action (for DR baseline)
    # Since it's deterministic, we know it's action 1
    target_action = 1 
    
    # IPS Calculation
    ips_sum += rewards[i] * (pi_a / ps[i])

    # DR Calculation
    # Term 1: The model's guess for the target action
    # Term 2: The error correction (only active if logging action == target action)
    error_correction = (pi_a / ps[i]) * (rewards[i] - q_model[actions[i]])
    dr_sum += q_model[target_action] + error_correction

print(f"True Expected Reward of Target Policy: {true_rewards[1]}")
print(f"IPS Estimated Reward: {ips_sum / num_actions:.4f}")
print(f"DR Estimated Reward:  {dr_sum / num_actions:.4f}")

True Expected Reward of Target Policy: 0.3
IPS Estimated Reward: 0.3267
DR Estimated Reward:  0.3379


In [None]:
# Constants for the Switch
# If 1/p is greater than this, we stop trusting the data and use the model only
tau = 5.0 

switch_sum = 0

for i in range(num_actions):
    # 1. Target policy info
    target_action = 1  # Our new policy always wants the Cheap Model
    pi_a = 1.0 if actions[i] == target_action else 0.0
    
    baseline_q = q_model[target_action]
    
    weight = pi_a / ps[i]
    
    if weight <= tau:
        correction = weight * (rewards[i] - q_model[actions[i]])
        switch_sum += baseline_q + correction
    else:
        switch_sum += baseline_q

final_switch_reward = switch_sum / num_actions
print(f"Switch Estimated Reward (tau={tau}): {final_switch_reward:.4f}")

Switch Estimated Reward (tau=5.0): 0.3379
