#Exercises XP



**Exercise 1: Calculating Required Sample Size**

You are planning an A/B test to evaluate the impact of a new email subject line on the open rate. Based on past data, you expect a small effect size of 0.3 (an increase from 20% to 23% in the open rate). You aim for an 80% chance (power = 0.8) of detecting this effect if it exists, with a 5% significance level (α = 0.05).

Calculate the required sample size per group using Python’s statsmodels library.
What sample size is needed for each group to ensure your test is properly powered?

In [None]:
from statsmodels.stats.power import TTestIndPower

# Define the parameters
effect_size = 0.3
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 175.38


**Exercise 2: Understanding the Relationship Between Effect Size and Sample Size**

Using the same A/B test setup as in Exercise 1, you want to explore how changing the expected effect size impacts the required sample size.

Calculate the required sample size for the following effect sizes: 0.2, 0.4, and 0.5, keeping the significance level and power the same.
How does the sample size change as the effect size increases? Explain why this happens.

In [None]:
#effect 0.2
effect_size = 0.2
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 393.41


In [None]:
#effect 0.4
effect_size = 0.4
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 99.08


In [None]:
#effect 0.5
effect_size = 0.5
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 63.77


**Answer**

As the desire effect grows, the need for the sample size decreases, which makes it easier to detect a difference between the groups

**Exercise 3: Exploring the Impact of Statistical Power**

Imagine you are conducting an A/B test where you expect a small effect size of 0.2. You initially plan for a power of 0.8 but wonder how increasing or decreasing the desired power level impacts the required sample size.

Calculate the required sample size for power levels of 0.7, 0.8, and 0.9, keeping the effect size at 0.2 and significance level at 0.05.
Question: How does the required sample size change with different levels of statistical power? Why is this understanding important when designing A/B tests?

In [None]:
#power level 0.7
effect_size = 0.2
alpha = 0.05
power = 0.7

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 309.56


In [None]:
#power level of 0.8
effect_size = 0.2
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 393.41


In [None]:
#power level of 0.9
effect_size = 0.2
alpha = 0.05
power = 0.9

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 526.33


**Answer**

As the power level increases, so does the sample size, which means higher power means you want to be more confident in detecting a true effect (if one exists). To achieve this higher confidence, you need more data, which translates to a larger sample size.

It is important to understand it to:
**Avoiding Underpowered Tests,**
**Resource Efficiency**, and
**Informed Decision-Making**,

**Exercise 4: Implementing Sequential Testing**

You are running an A/B test on two versions of a product page to increase the purchase rate. You plan to monitor the results weekly and stop the test early if one version shows a significant improvement.

Define your stopping criteria.
Decide how you would implement sequential testing in this scenario.
At the end of week three, Version B has a p-value of 0.02. What would you do next?


**Answer**

Before starting the A/B test, you must clearly define when you are allowed to stop it, such as the minimum test duration, the minimum sample size, and how strong the improvement must be to matter. Because results are checked every week, you should use a sequential testing method (like alpha-spending or Bayesian testing) so that early checks don’t lead to false wins. At the end of week three, a p-value of 0.02 is only enough to stop the test if it meets the pre-defined sequential threshold; if it does, you can confidently launch Version B, but if not, you should continue the test until the planned end.

**Exercise 5: Applying Bayesian A/B Testing**

You’re testing a new feature in your app, and you want to use a Bayesian approach. Initially, you believe the new feature has a 50% chance of improving user engagement. After collecting data, your analysis suggests a 65% probability that the new feature is better.

Describe how you would set up your prior belief.
After collecting data, how does the updated belief (posterior distribution) influence your decision?
What would you do if the posterior probability was only 55%?

**Answer**

In a Bayesian A/B test, I would start with a neutral prior that reflects a 50/50 belief that the new feature could be better or worse, meaning I don’t favor either version before seeing data. After collecting data, the updated belief (posterior) guides decisions: a 65% probability that the feature is better suggests it is promising but usually not strong enough to fully launch unless it meets a pre-defined confidence threshold. If the posterior probability were only 55%, I would treat the result as inconclusive and either collect more data or stop the test for futility, since the evidence for improvement is weak.

**Exercise 6: Implementing Adaptive Experimentation**

You’re running a test with three different website layouts to increase user engagement. Initially, each layout gets 33% of the traffic. After the first week, Layout C shows higher engagement.

Explain how you would adjust the traffic allocation after the first week.
Describe how you would continue to adapt the experiment in the following weeks.
What challenges might you face with adaptive experimentation, and how would you address them?

**Answer**

After the first week, I would move some traffic toward Layout C since it shows higher engagement, but I wouldn’t send all users there yet. I would keep a smaller but steady amount of traffic on Layouts A and B so I can continue learning and make sure the early result wasn’t just noise. Over the next weeks, I would keep updating the results and gradually send more traffic to the layout that continues to perform best, while setting clear rules for when to stop the test or pause a layout that performs poorly.

Adaptive experiments can be harder to manage because early results can be misleading, traditional statistics don’t always apply, and traffic or user behavior can change over time. To handle this, I would move traffic gradually, keep minimum traffic levels for all layouts, use methods designed for adaptive testing, and set guardrails to protect user experience.