## Exercise 1: Calculating Required Sample Size
You are planning an A/B test to evaluate the impact of a new email subject line on the open rate. Based on past data, you expect a small effect size of 0.3 (an increase from 20% to 23% in the open rate). You aim for an 80% chance (power = 0.8) of detecting this effect if it exists, with a 5% significance level (α = 0.05).

Calculate the required sample size per group using Python’s statsmodels library.
What sample size is needed for each group to ensure your test is properly powered?

In [6]:
from statsmodels.stats.power import TTestIndPower

# Define the parameters
effect_size = 0.3
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 175.38


## Exercise 2: Understanding the Relationship Between Effect Size and Sample Size
Using the same A/B test setup as in Exercise 1, you want to explore how changing the expected effect size impacts the required sample size.

Calculate the required sample size for the following effect sizes: 0.2, 0.4, and 0.5, keeping the significance level and power the same.
How does the sample size change as the effect size increases? Explain why this happens.

In [9]:
effect_size = 0.2
sample_size_2 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)

effect_size = 0.4
sample_size_3 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)

effect_size = 0.5
sample_size_4 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)



print('=' * 70)
print(f'Required sample size per group (effect size is 0.2): {sample_size_2:.2f}')
print('=' * 70)
print(f'Required sample size per group (effect size is 0.4): {sample_size_3:.2f}')
print('=' * 70)
print(f'Required sample size per group (effect size is 0.5): {sample_size_4:.2f}')
print('=' * 70)

Required sample size per group (effect size is 0.2): 393.41
Required sample size per group (effect size is 0.4): 99.08
Required sample size per group (effect size is 0.5): 63.77


When the expected effect size is small, a larger sample size is required to provide enough statistical power to detect that change.

## Exercise 3: Exploring the Impact of Statistical Power
Imagine you are conducting an A/B test where you expect a small effect size of 0.2. You initially plan for a power of 0.8 but wonder how increasing or decreasing the desired power level impacts the required sample size.

Calculate the required sample size for power levels of 0.7, 0.8, and 0.9, keeping the effect size at 0.2 and significance level at 0.05.
Question: How does the required sample size change with different levels of statistical power? Why is this understanding important when designing A/B tests?

In [10]:
# Define the parameters
effect_size = 0.2
alpha = 0.05
power = 0.7

# Calculate the sample size
analysis = TTestIndPower()
sample_size_5 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)


power = 0.8
sample_size_6 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)


power = 0.9
sample_size_7 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)


print('=' * 70)
print(f'Required sample size per group (power is 0.7): {sample_size_5:.2f}')
print('=' * 70)
print(f'Required sample size per group (power is 0.8): {sample_size_6:.2f}')
print('=' * 70)
print(f'Required sample size per group (power is 0.9): {sample_size_7:.2f}')
print('=' * 70)

Required sample size per group (power is 0.7): 309.56
Required sample size per group (power is 0.8): 393.41
Required sample size per group (power is 0.9): 526.33


Higher statistical power increases our chances of detecting a real effect. However, to achieve this higher power, we must increase the sample size.

## Exercise 4: Implementing Sequential Testing
You are running an A/B test on two versions of a product page to increase the purchase rate. You plan to monitor the results weekly and stop the test early if one version shows a significant improvement.

Define your stopping criteria.
Decide how you would implement sequential testing in this scenario.
At the end of week three, Version B has a p-value of 0.02. What would you do next?

- Stopping Criteria: I will use adjusted significance boundaries based on an Alpha Spending Function. The test will stop if the p-value at a weekly check is lower than the predefined alpha-threshold for that specific time point.
- Implementation: I will apply the O’Brien-Fleming approach. This allows for early stopping for efficacy while controlling the overall Type I error rate at $5\%$.
- Action at Week 3: Given that the p-value is $0.02$ at the third interim analysis, this would likely cross the O’Brien-Fleming boundary. I would stop the test early and implement Version B, as we have gathered sufficient evidence of its superiority without compromising the statistical integrity of the experiment.

## Exercise 5: Applying Bayesian A/B Testing
You’re testing a new feature in your app, and you want to use a Bayesian approach. Initially, you believe the new feature has a 50% chance of improving user engagement. After collecting data, your analysis suggests a 65% probability that the new feature is better.

Describe how you would set up your prior belief.
After collecting data, how does the updated belief (posterior distribution) influence your decision?
What would you do if the posterior probability was only 55%?


- Do not implement yet: The error probability (35–45%) is far too high to justify a sound business decision.

- Continue data collection: We need to gather more data to "narrow" the distribution, allowing the posterior probability to shift closer to a 95% confidence threshold.

- Check Expected Loss: In Bayesian testing, we don't just look at the probability of winning; we also evaluate the "Expected Loss"—how much we stand to lose if Version B is actually worse. If the potential downside is significant, we should aim for 99% certainty.

## Exercise 6: Implementing Adaptive Experimentation
You’re running a test with three different website layouts to increase user engagement. Initially, each layout gets 33% of the traffic. After the first week, Layout C shows higher engagement.

- Explain how you would adjust the traffic allocation after the first week.
- Describe how you would continue to adapt the experiment in the following weeks.
- What challenges might you face with adaptive experimentation, and how would you address them?


1. Traffic Adjustment: I would use an algorithm like Thompson Sampling to shift more traffic toward Layout C. Instead of 33%, Layout C might receive 70-80% of the traffic based on its probability of being the optimal choice.

2. Continuous Adaptation: I would maintain a balance between Exploration and Exploitation. The system will continuously monitor data; if Layout C remains the leader, its share stays high. If Layout B starts performing better, the algorithm will dynamically shift traffic back to B.

3. Challenges: A major challenge is the Novelty Effect, where a layout performs well initially just because it's new. I would address this by keeping a "holdout" or ensuring a minimum traffic floor for all versions to observe long-term stability.