## A/B Test Analysis

### Test Design and Sample Size Determination

We begin with a classic A/B test to compare the performance of the two button designs. When designing the test, we must carefully consider our tolerance for different types of errors:

- **Type I Error (False Positive)**: Incorrectly detecting an effect when none exists
- **Type II Error (False Negative)**: Failing to detect a real effect

In our case, given that:
1. Button design changes are low-cost and easily reversible
2. Missing a 25% improvement represents significant lost opportunity
3. Implementation costs are minimal

We conclude that Type II errors are more costly than Type I errors. This leads us to select:
- Confidence Level: 95% (α = 0.05)
- Statistical Power: 90% (β = 0.10)
- Minimum Detectable Effect (MDE): 25% relative improvement
- Baseline CTR: 7%

Using these parameters in a [sample size calculator](https://www.evanmiller.org/ab-testing/sample-size.html), we determine that we need **4,664 samples per variant** to achieve the desired statistical properties.

**Experiment Duration**
  - Running time: 2 weeks
  - Approximately 10,000 visitors expected
  - The experiment will stop when the required number of samples has been collected

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
from pathlib import Path
from scipy.stats import beta, norm
from models.visualizations import plot_cumulative_reward
from statsmodels.stats.proportion import test_proportions_2indep, confint_proportions_2indep

In [3]:
# Fix the working directory
os.chdir(Path.cwd().parent)

In [4]:
# Read the simulation data
results_ab = pd.read_csv("data/ab_test_results.csv")
results_ab = pd.Series(results_ab.values[0], index=results_ab.columns)
results_ab

algorithm           ab_test
true_ctr_a           0.0702
true_ctr_b           0.1021
bandit_ctr_a       0.066582
bandit_ctr_b            0.1
button_a_clicks         313
button_a_views         4701
button_b_clicks         470
button_b_views         4700
dtype: object

### Defining the Hypotheses

Let `δ = CTR_B - CTR_A` represent the difference in click-through rates and `MDE = 0.0175`.

**Null Hypothesis (H₀)**: δ ≤ MDE  
"The new design (B) does not provide a meaningful improvement over the current design (A)"

**Alternative Hypothesis (H₁)**: δ > MDE  
"The new design (B) provides an improvement of at least 1.75 percentage points over the current design (A)"

In [48]:
MDE = 0.0175
CONFIDENCE_LEVEL = 0.95

_, p_value = test_proportions_2indep(count1=results_ab.loc["button_b_clicks"],
                                        nobs1=results_ab.loc["button_b_views"],
                                        count2=results_ab.loc["button_a_clicks"],
                                        nobs2=results_ab.loc["button_a_views"],
                                        value=MDE,
                                        compare="diff",
                                        alternative="larger")

lower, upper = confint_proportions_2indep(count1=results_ab.loc["button_b_clicks"],
                                        nobs1=results_ab.loc["button_b_views"],
                                        count2=results_ab.loc["button_a_clicks"],
                                        nobs2=results_ab.loc["button_a_views"],
                                        compare="diff",
                                        alpha= 1-CONFIDENCE_LEVEL)

print(f"\nTest Results:")
print(f"P-value: {p_value:.4f}")
print(f"95% Confidence Interval for CTR_B - CTR_A: [{lower:.3f}, {upper:.3f}]")

# Interpret results
if p_value < 1 - CONFIDENCE_LEVEL:
   print("\nConclusion: Reject H₀")
   print(f"Evidence suggests the new design provides relative improvement greater than 25%")
else:
   print("\nConclusion: Fail to reject H₀")
   print("Insufficient evidence for meaningful improvement")


Test Results:
P-value: 0.0026
95% Confidence Interval for CTR_B - CTR_A: [0.022, 0.045]

Conclusion: Reject H₀
Evidence suggests the new design provides improvement greater than 25%


#### Interpretation of Results

The statistical analysis provides strong evidence in favor of the new button design (B):

1. **P-value**
   - The p-value of 0.0026 is well below our significance level of 0.05
   - This means that the improvement we observed is very unlikely to have occurred by chance
   - We can confidently reject the null hypothesis

2. **Confidence Interval**
   - The 95% confidence interval [0.022, 0.045] tells us that:
     - At worst, we expect a 2.2% (absolute) increase in CTR
     - At best, we expect a 4.5% (absolute) increase in CTR
   - Notably, the entire interval lies above our minimum detectable effect of 0.0175
   - This means even our most conservative estimate exceeds our target improvement
   - Also not that the true difference of CTRs (3%) lies inside this interval

3. **Practical Significance**
   - The findings suggest a substantial improvement in user engagement
   - They are both statistically significant and practically meaningful

Given the above, implementing the new button design (B) appears to be a well-supported decision.

The true difference of ctrs lies inside this interval is likely to be inside here 95% of the time

next Recommend rolling out the new button design.. Document findings clearly

<hr>

## Thompson Sampling Analysis - Implementation 1: Initial Exploration Period

### Approach Overview
Thompson Sampling (TS) is a probabilistic algorithm that balances exploration and exploitation in decision-making. We'll examine two variations of TS, starting with an implementation that emphasizes early exploration.

### Design Choices
1. **Initial Exploration Phase**
   - First 600 visitors randomly assigned to variants
   - Provides unbiased baseline data
   - Ensures sufficient initial learning period

2. **Prior Selection**
   - Using uninformative Beta(1,1) priors for both variants
   - Represents complete initial uncertainty

3. **Algorithm Phases**
   - Phase 1: Random allocation (n=600)
   - Phase 2: Thompson Sampling takes over
   - Transition based on collected data

4. **Experiment Duration**
  - Running time: 2 weeks
  - Matches classic A/B test duration
  - Allows fair comparison between methods
  - Approximately 10,000 visitors expected


This approach allows us to:
- Build knowledge from scratch without historical assumptions
- Monitor current user behavior objectively
- Transition smoothly to data-driven allocation

In [5]:
# Read the simulation data
results_TS_min_exp = pd.read_csv("data/TS_min_exp_results.csv")
results_TS_min_exp = pd.Series(results_TS_min_exp.values[0], index=results_TS_min_exp.columns)
results_TS_min_exp

algorithm          TS_min_exp
true_ctr_a             0.0702
true_ctr_b             0.1021
bandit_ctr_a         0.066176
bandit_ctr_b         0.103873
button_a_clicks            36
button_a_views            544
button_b_clicks           920
button_b_views           8857
a_alpha                    37
a_beta                    509
b_alpha                   921
b_beta                   7938
dtype: object

In [50]:
# Monte Carlo simulation for posterior probability analysis
n_samples = 10000

# Generate samples from final posterior distributions
sample_a = np.random.beta(results_TS_min_exp.loc["a_alpha"], 
                         results_TS_min_exp.loc["a_beta"], 
                         n_samples)
sample_b = np.random.beta(results_TS_min_exp.loc["b_alpha"], 
                         results_TS_min_exp.loc["b_beta"], 
                         n_samples)

# Calculate probabilities of interest
prob_b_better = (sample_b > sample_a).mean()
prob_b_better_mde = (sample_b > (sample_a + MDE)).mean()

print(f"\nProbability Analysis:")
print(f"P(B > A): {prob_b_better:.3f}")
print(f"P(B > A + MDE): {prob_b_better_mde:.3f}")

# Calculate credible interval for the difference
diff_samples = sample_b - sample_a
credible_interval = np.percentile(diff_samples, [2.5, 97.5])
print(f"\n95% Credible Interval for (B-A): [{credible_interval[0]:.3f}, {credible_interval[1]:.3f}]")

# 95% credible interval for button B posterior
lower, upper =beta.ppf([0.025, 0.975], a=results_TS_min_exp.loc["b_alpha"], b=results_TS_min_exp.loc["b_beta"])
print(f"\n95% Credible Interval for (B): [{lower:.3f}, {upper:.3f}]")

# 95% credible interval for button A posterior
lower, upper =beta.ppf([0.025, 0.975], a=results_TS_min_exp.loc["a_alpha"], b=results_TS_min_exp.loc["a_beta"])
print(f"\n95% Credible Interval for (A): [{lower:.3f}, {upper:.3f}]")


Probability Analysis:
P(B > A): 0.998
P(B > A + MDE): 0.944

95% Credible Interval for (B-A): [0.013, 0.057]


### Interpretation of Results

The analysis provides strong evidence supporting the effectiveness of Button B and the Thompson Sampling algorithm:

1. **Performance**
  - 99.8% probability that Button B outperforms Button A
  - 94.4% probability that the improvement exceeds our minimum detectable effect (1.75%)
  - CTR estimates are almost identical with those of the A/B test method
  - These high probabilities indicate strong evidence for meaningful improvement

2. **Credible Interval**
  - The 95% credible interval for the difference [0.013, 0.057] indicates:
    - At minimum, a 1.3% (absolute) increase in CTR
    - At maximum, a 5.7% (absolute) increase in CTR
  - The interval for Button B's CTR [0.098, 0.110] contains the true value (0.10)
    - Demonstrates algorithm's accuracy in estimating true performance

3. **Decision Confidence**
  - The lower bound (1.3%) is slightly below our MDE (1.75%)
  - However, the high probability of exceeding MDE (94.4%) suggests this is not concerning
  - The narrow credible interval for Button B indicates precise estimation

4. **Algorithm Performance**
  - Successfully identified the better variant
  - Provided accurate CTR estimates
  - Generated sufficient evidence for decision-making
  - Balanced exploration and exploitation effectively

These results strongly support implementing Button B, with high confidence in both the statistical and practical significance of the improvement.

<hr>

## Thompson Sampling Analysis - Implementation 2: Informed priors with smaller sample size

### Approach Overview
In this second implementation, we leverage our historical knowledge while maintaining Thompson Sampling's adaptive properties. This approach demonstrates how prior information can be incorporated into the decision-making process.

### Design Choices
1. **Informed Prior for Button A**
  - Beta(6, 78) prior reflects historical 7% CTR
  - Parameters chosen to:
    - Center around known performance
    - Allow sufficient variance for exploration (not too strong to dominate new data)
    - Avoid over-confidence in historical data

2. **Algorithm Configuration**
  - No initial exploration period
  - Thompson Sampling active from start
  - Relies on prior knowledge to guide early decisions

3. **Experiment Duration**
  - Running time: 1 week
  - Approximately 5,000 visitors expected
  - Shorter timeline than previous implementations
  - Tests algorithm's efficiency with time constraints

This implementation aims to:
- Leverage existing knowledge effectively
- Test algorithm performance under time constraints

In [54]:
# Read the simulation data
results_TS_priors = pd.read_csv("data/TS_priors_results.csv")
results_TS_priors = pd.Series(results_TS_priors.values[0], index=results_TS_priors.columns)
results_TS_priors

algorithm          TS_priors
true_ctr_a            0.0702
true_ctr_b            0.1021
bandit_ctr_a         0.07619
bandit_ctr_b        0.099872
button_a_clicks           24
button_a_views           315
button_b_clicks          468
button_b_views          4686
a_alpha                   30
a_beta                   369
b_alpha                  469
b_beta                  4219
dtype: object

In [55]:
# Monte Carlo simulation for posterior probability analysis
n_samples = 10000

# Generate samples from final posterior distributions
sample_a = np.random.beta(results_TS_priors.loc["a_alpha"], 
                         results_TS_priors.loc["a_beta"], 
                         n_samples)
sample_b = np.random.beta(results_TS_priors.loc["b_alpha"], 
                         results_TS_priors.loc["b_beta"], 
                         n_samples)

# Calculate probabilities of interest
prob_b_better = (sample_b > sample_a).mean()
prob_b_better_mde = (sample_b > (sample_a + MDE)).mean()

print(f"\nProbability Analysis:")
print(f"P(B > A): {prob_b_better:.3f}")
print(f"P(B > A + MDE): {prob_b_better_mde:.3f}")

# Calculate credible interval for the difference
diff_samples = sample_b - sample_a
credible_interval = np.percentile(diff_samples, [2.5, 97.5])
print(f"\n95% Credible Interval for (B-A): [{credible_interval[0]:.3f}, {credible_interval[1]:.3f}]")

# 95% credible interval for button B posterior
lower, upper =beta.ppf([0.025, 0.975], a=results_TS_priors.loc["b_alpha"], b=results_TS_priors.loc["b_beta"])
print(f"\n95% Credible Interval for (B): [{lower:.3f}, {upper:.3f}]")

# 95% credible interval for button A posterior
lower, upper =beta.ppf([0.025, 0.975], a=results_TS_priors.loc["a_alpha"], b=results_TS_priors.loc["a_beta"])
print(f"\n95% Credible Interval for (A): [{lower:.3f}, {upper:.3f}]")


Probability Analysis:
P(B > A): 0.955
P(B > A + MDE): 0.714

95% Credible Interval for (B-A): [-0.005, 0.050]


### Interpretation of Results
The analysis reveals interesting insights about leveraging historical knowledge in Thompson Sampling:

1. **Prior Effectiveness**
  - Button A's estimated CTR (7.62%) slightly overestimates true CTR (7.02%) but still is a close estimate
  - Credible interval for A [0.051, 0.103] shows higher uncertainty than previous implementation
  - Prior knowledge successfully guided early decisions towards the better choice

2. **Performance**
  - 95.5% probability that Button B outperforms A
  - 71.4% probability of exceeding MDE (1.75%)
  - Less definitive than previous implementation, but achieved in half the time

3. **Sample Allocation**
  - Strong preference for Button B (4,686 views vs 315)
  - Algorithm quickly identified promising variant
  - Efficient resource allocation despite shorter runtime

4. **Credible intervals**
  - Button B's credible interval [0.092, 0.109] contains true CTR (0.102)
  - Difference interval [-0.005, 0.050] shows more uncertainty about the true difference between CTRs

5. **Time-Efficiency Trade-off**
  - Achieved reasonable certainty in one week
  - Less definitive than two-week implementation
  - Demonstrates value of informed priors in accelerating decisions

This implementation shows promise for scenarios where:
- Historical data is reliable
- Quick decisions are needed
- Some uncertainty is acceptable

<hr>

### Learning Progress Analysis

In [6]:


dec_1 = pd.read_csv("data/TS_min_exp_decisions.csv")
dec_1 = dec_1.values.flatten()

dec_2 = pd.read_csv("data/TS_priors_decisions.csv")
dec_2 = dec_2.values.flatten()

dec_3 = pd.read_csv("data/ab_simulation_decisions.csv")
dec_3 = dec_3.values.flatten()

dec_4 = pd.read_csv("data/UCB1_decisions.csv")
dec_4 = dec_4.values.flatten()


In [22]:
plot_cumulative_reward(true_ctrs=[results_TS_min_exp.loc["true_ctr_a"], results_TS_min_exp.loc["true_ctr_b"]],
                       decisions=[dec_1, dec_2, dec_3, dec_4],
                       alg_names=["TS_min_exp", "TS_priors", "ab_test", "UCB 1"],
                       save_path="data/figures"
                       )

![Cumulative Rewards](../data/figures/cumulative_reward.png)

The cumulative rewards graph reveals important insights about the performance and characteristics of each implementation:

1. **Early Stage Behavior (0-600 trials)**
   - A/B test and TS with minimum exploration (TS_min_exp) track closely due to their random allocation phase
   - UCB1 and TS with priors (TS_priors) show similar aggressive exploration patterns
   - All algorithms exhibit expected initial volatility

2. **Mid-Stage Adaptation (600-2000 trials)**
   - TS_min_exp diverges positively from A/B test after exploration phase
   - UCB1 demonstrates strong performance, possibly benefiting from its confidence bounds approach
   - TS_priors maintains competitive performance despite informed priors for Button A

3. **Convergence Phase (2000+ trials)**
   - All adaptive algorithms (TS_min_exp, TS_priors, UCB1) converge toward optimal rate (0.10)
   - A/B test plateaus at suboptimal performance (~0.085)
   - TS implementations show more stable convergence compared to UCB1's oscillations

4. **Key Observations**
   - Adaptive algorithms clearly outperform static A/B testing
   - Both TS implementations achieve similar final performance through different paths
   - UCB1's more aggressive exploration leads to higher variance
   - The cost of A/B test's fixed allocation strategy is clearly visible in lost reward

5. **Long-term Efficiency**
   - All adaptive methods eventually identify and exploit the better option
   - TS_priors achieves comparable performance with half the trials
   - The gap between adaptive and static approaches widens over time


   from 300 to 600 adaptive algorithms seem to exploit clicks otherwise lost by static algorithms

In [None]:
decisions_5000 = [dec_1[:5000], dec_2[:5000], dec_3[:5000], dec_4[:5000]]
rewards = []
win_rates = []
for i, dec in enumerate(decisions_5000):
    rewards.append(np.cumsum(dec))
    win_rates.append(rewards[i] / (np.arange(len(dec))+1))

decisions_10000 = [dec_1, dec_3]
for i, dec in enumerate(decisions_5000):
    rewards.append(np.cumsum(dec))
    win_rates.append(rewards[i] / (np.arange(len(dec))+1))

print(f"% Diffence between TS_min_exp - ab_test at 10000 samples: {((win_rates[5][-1] - win_rates[6][-1]) / win_rates[6][-1]*100):.2f}%")
print(f"% Diffence between TS_priors - TS_min_exp at 5000 samples: {((win_rates[1][-1] - win_rates[0][-1]) / win_rates[0][-1]*100):.2f}%")

<hr>

### Comparative Analysis of Posterior Evolution

![Posterior Grid](../data/figures/TS_min_exp_posterior_grid.png)

![Posterior Grid 2](../data/figures/TS_priors_posterior_grid.png)

#### Early Stage (50 iterations)
- TS_priors exhibits higher density for Button A, reflecting our informative prior (7% CTR)
- In both cases, there is significant overlap between distributions, indicating high uncertainty

#### Early-Mid Stage (150 iterations)
- TS_min_exp shows slight overestimation of both CTRs due to random sampling
- TS_priors demonstrates more accurate estimation as it collects more samples for B

#### Mid Stage (500 iterations)
- TS_min_exp concludes its exploration phase with higher certainty about Button A
- TS_priors shows more confidence in Button B's distribution
- Both implementations begin showing clear separation between variants

#### Convergence Stage (1500-5000 iterations)
1. Distribution Characteristics:
   - Both implementations converge to similar shapes
   - Button B consistently shows higher, narrower peaks
   - Button A distributions maintain wider spread due to fewer samples

2. Key Differences:
   - TS_priors achieves tighter distributions with fewer total samples
   - Final variance differences reflect adaptive sample allocation
   - Both successfully identify the superior variant, but through different paths

The evolution demonstrates how different initialization strategies (random vs informed) ultimately lead to similar conclusions, with TS_priors potentially offering faster convergence at the cost of some exploration.

<hr>

## Conclusions

Our analysis demonstrates several key findings:

1. **Algorithm Performance and Implementation Trade-offs**
   - Classic A/B testing reached higly significant results, but at the cost of lost opportunity
   - Thompson Sampling effectively balanced exploration/exploitation
   - Adaptive methods seem preferable over classic A/B testing for the task
   - Both TS implementations reached similar conclusions through different paths
   - TS with minimum exploration: More thorough but requires longer runtime
   - TS with informed priors: Faster decisions with acceptable confidence
   - Choice depends on business constraints and prior knowledge reliability

2. **Statistical Insights**
   - High confidence in Button B's superiority across all methods
   - Adaptive methods achieved higher cumulative rewards
   - Prior knowledge can effectively accelerate decision-making

4. **Practical Implications**
   - For this use case, Thompson Sampling offers clear advantages over A/B testing
   - When historical data is reliable, informed priors can reduce testing time
   - Monitoring posterior evolution and cumulative rewards provides insight into decision confidence