## Design: 

Propose a logging schema for Sqwish’s online learning system. It should record everything needed to reproduce training and do later analysis. For each user request, what would you log? (Think: a unique request ID, user context features, chosen prompt/model, all model outputs maybe, any user click or outcome, timestamps, the probability or propensity of the chosen action if using a stochastic policy, etc.). Write a structured list of fields and justify each (why is it needed? e.g. propensity is needed for IPS in OPE).

## Thoughts : 

request_id

timestamp

user context features - Needed to train the Reward Model ($q$) for Doubly Robust estimation.

action taken (choosen model or prompt) - IPS calculation

reward (click or outcome) - needed for reward estimate model and IPS 

propensity - needed for IPS calculation


## Analysis: 
You have deployed a bandit that optimizes prompts. After a week, you examine results and see overall user satisfaction went up 5%. However, for new users (first-time visitors) satisfaction dropped. How would you investigate this? Outline an experiment or analysis using logs to diagnose why the policy might be underperforming for new users (maybe it over-explored or didn’t personalize properly for cold-start users). What changes to the algorithm could you consider (e.g. epsilon-greedy for new users until enough data)?

## Solution : 

Based on the user features or capture separately if a user is new and analyse for new users subgroup. 

Hypothesis 1: Cold Start/Exploration. If the agent is using a "Greedy" policy based on older users, it might be showing new users content that requires historical context they don't have.

Hypothesis 2: Feature Shift. Perhaps the features used for "New Users" are sparse (all zeros), leading the model to default to a "global average" action that is offensive or irrelevant to a first-time visitor.

Algorithmic Changes:

Epsilon-Greedy Reset: For new users, force a higher $\epsilon$ (exploration) to learn their specific preferences faster.

Contextual Fallback: If user_history is empty, use a specialized "Onboarding Policy" rather than the general optimization policy.

## Coding: 
Using an open bandit dataset (e.g. the Open Bandit Pipeline’s logged data if available, or simulate one), perform an off-policy evaluation of a hypothetical new policy. For example, use logged data from a uniform random policy on a classification task as bandit feedback. Define a new deterministic policy (like always choose arm 1 for certain feature values and arm 2 otherwise). Use IPS and Doubly Robust to estimate the new policy’s reward from the logs. Compare that to the actual reward if you run the new policy on the dataset (if ground truth available). This exercise solidifies understanding of OPE’s value and limitations (if the policy is very different, IPS variance will be high).

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

# 1. Setup Simulation Parameters
np.random.seed(42)
n_samples = 2000
n_features = 2

# Generate random contexts (x)
X = np.random.uniform(0, 1, size=(n_samples, n_features))

# Define 'Ground Truth' reward probability (Unknown to the agent)
# Action 0 likes Feature 1; Action 1 likes Feature 0
def get_true_reward_prob(context, action):
    if action == 0:
        return 0.8 if context[1] > 0.5 else 0.2
    else:
        return 0.8 if context[0] > 0.5 else 0.2

# 2. LOGGING DATA: Uniform Random Policy (Behavior Policy)
actions_logged = []
rewards_logged = []
propensities_logged = []

for i in range(n_samples):
    # Logging Policy: p(a=0) = 0.5, p(a=1) = 0.5
    action = np.random.choice([0, 1])
    prob = 0.5
    
    # Observe Reward
    r_prob = get_true_reward_prob(X[i], action)
    reward = np.random.binomial(1, r_prob)
    
    actions_logged.append(action)
    rewards_logged.append(reward)
    propensities_logged.append(prob)

# 3. DEFINE TARGET POLICY (Hypothetical New Policy)
# Logic: Use Action 1 if Feature 0 > 0.5, else Action 0
def target_policy(context):
    return 1 if context[0] > 0.5 else 0

# 4. GROUND TRUTH CALCULATION (What would actually happen?)
true_rewards = [get_true_reward_prob(X[i], target_policy(X[i])) for i in range(n_samples)]
ground_truth_val = np.mean(true_rewards)

# 5. OPE ESTIMATION: IPS
ips_scores = []
for i in range(n_samples):
    pi_a = 1.0 if actions_logged[i] == target_policy(X[i]) else 0.0
    ips_scores.append(rewards_logged[i] * (pi_a / propensities_logged[i]))
ips_val = np.mean(ips_scores)

# 6. OPE ESTIMATION: Doubly Robust (DR)
# Step A: Train a Reward Model (Direct Method)
# We train a model to predict reward given (context, action)
q_models = {}
for a in [0, 1]:
    mask = np.array(actions_logged) == a
    model = LogisticRegression().fit(X[mask], np.array(rewards_logged)[mask])
    q_models[a] = model

# Step B: Calculate DR
dr_scores = []
for i in range(n_samples):
    target_a = target_policy(X[i])
    actual_a = actions_logged[i]
    pi_a = 1.0 if actual_a == target_a else 0.0
    
    # Model predictions
    q_target = q_models[target_a].predict_proba(X[i:i+1])[0][1]
    q_actual = q_models[actual_a].predict_proba(X[i:i+1])[0][1]
    
    # DR Formula: q_target + (indicator/propensity) * (reward - q_actual)
    dr_val = q_target + (pi_a / propensities_logged[i]) * (rewards_logged[i] - q_actual)
    dr_scores.append(dr_val)
dr_val = np.mean(dr_scores)

# --- FINAL RESULTS ---
print(f"Ground Truth Value:  {ground_truth_val:.4f}")
print(f"IPS Estimate:        {ips_val:.4f} (Error: {abs(ips_val-ground_truth_val):.4f})")
print(f"DR Estimate:         {dr_val:.4f} (Error: {abs(dr_val-ground_truth_val):.4f})")

  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept


Ground Truth Value:  0.6503
IPS Estimate:        0.6580 (Error: 0.0077)
DR Estimate:         0.6572 (Error: 0.0069)
