#Top

[![Open in GitHub](https://img.shields.io/badge/Open%20Folder%20in-GitHub-181717?logo=github&logoColor=white)](https://github.com/lindsayalexandra14/lindsayalexandra14/tree/main/templates/ab_testing)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17cZWoW5lvq5EGTlWNSdCw3sjgT8vLxNI)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/AB%20Testing%20Large%20Sample%20Size.png)

**summary**
*   This hypothetical experiment tests two Landing Pages (control vs. treatment)
*   The sample size is 30,000 users
*   I will use Z-Test for Two Proportions, which is good for large sample sizes
*   I am trying to prove that the treatment performed better than the control because the team is interested in moving forward with the treatment
*  It was established from the test that the treatment performed better with significance (at alpha=0.05). The practical significance is low (cohen's h = 0.29). It did not have the full desired statistical power (70% vs. 80%)
*  In this case, I am comfortable enough with the treatment performing some level higher than the control with significance and not vice versa that I will recommend moving forward with implementing the treatment

**tl;dr for results**

*   Skip to "Results Summary" at the end





#Setup

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Setup.png)

##Import Libraries

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Import%20Libraries.png)

In [None]:
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
from statsmodels.stats.proportion import confint_proportions_2indep
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
import textwrap
from statsmodels.stats.power import TTestIndPower
import math

#Test Design

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Test%20Design.png)

##Parameters

In [None]:
alpha = 0.05            # Significance level
power = 0.80            # Statistical power (Probability of detecting an effect when it exists; 0.8 is standard)
control = 0.14             # Baseline rate
effect = 0.05           # Desired relative effect (e.g., 5% lift over baseline)
mde = control * effect   # Minimum Detectable Effect (MDE)
  # Minimum difference you want to detect in absolute terms
  # It is the absolute difference between the proportions
  # e.g., 5% of a 16% baseline = 0.008. Or want 23% to go to 24% = 1% MDE
treatment = control + mde  #Treatment rate (includes effect)
print(f"Control: {control:.4f}")
print(f"Treatment: {treatment:.4f}")

Control: 0.1400
Treatment: 0.1470


In [None]:
p_1=treatment
p_2=control
p1_label = "Treatment"
p2_label = "Control"

alternative = "larger" # in reference to p1:
# p1 is "greater" than p2
# p1 is "less" than p2
# p1 is different from ()"two.sided" p2

hypothesis_dict = {
    "larger":  f"{p1_label} ({p_1:.4f}) is larger than {p2_label} ({p_2:.4f})",
    "smaller":     f"{p1_label} ({p_1:.4f}) is smaller than {p2_label} ({p_2:.4f})",
    "two-sided":f"{p1_label} ({p_1:.4f}) is different from {p2_label} ({p_2:.4f})"
}

# Get message based on the 'alternative'
hypothesis = hypothesis_dict.get(alternative, "Invalid alternative")

print("Hypothesis:", hypothesis)


Hypothesis: Treatment (0.1470) is larger than Control (0.1400)


##Effect size

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Effect%20Size.png)

In [None]:
# Cohen's h (standardized effect size for proportions)
effect_size = proportion_effectsize(p_1, p_2)

print(f"Control: {control:.4f}")
print(f"Treatment: {treatment:.4f}")
print(f"Minimum Detectable Effect (MDE): {mde:.3f}")
print('Effect size for p_1={0:.4f} p_2={1:.4f} is: {2:0.3f}'.format(p_1,p_2,effect_size))

Control: 0.1400
Treatment: 0.1470
Minimum Detectable Effect (MDE): 0.007
Effect size for p_1=0.1470 p_2=0.1400 is: 0.020


Cohen's h benchmarks:

0.2 = small effect

0.5 = medium effect

0.8 = large effect

If the effect is tiny, it will require a very large sample size to detect.

*   Effect is translated into Cohen’s h
*   It is a way to quantify how big the difference between two proportions is, on a standardized scale  
*   Absolute differences (like +2%) are different on a baseline of 5% vs 50%
*   Puts differences on a common scale, to compare effect sizes fairly across experiments
*   Demonstrates practical meaning (vs. just statistical significance)

##Sample Size

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Sample%20Size.png)

Calculate minimum sample size for each group (cell) for one-sided and two-sided tests:
*   A one-sided test is used when you want to test if one group performs specifically better or worse than the other (a directional hypothesis).
*   A two-sided test is used when you want to test if there is any difference between the groups, regardless of direction — whether one is better or worse.

In [None]:
# initialize power analysis
power_analysis = NormalIndPower()

# determine the minimum samples for each group (one-sided test)
n = power_analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power,
ratio=1, alternative=alternative)
print('Sample size/Number needed in each group for one-sided test: {:.3f}'.format(n))

# two-sided test
n = power_analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power,
ratio=1, alternative='two-sided')
print('Sample size/Number needed in each group for two-sided test: {:.3f}'.format(n))


Sample size/Number needed in each group for one-sided test: 31011.454
Sample size/Number needed in each group for two-sided test: 39369.563


#Results

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Results.png)

##Data

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Data.png)

###Import Data

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Import%20Data.png)

From a dataset:

In [None]:
#df_data=

Manually input:

In [None]:
n_observations_control = 32000
n_observations_treatment = 32050

conversions_control = 4300
conversions_treatment = 5000

In [None]:
print(p1_label) # set above in test design
print(p2_label)

Treatment
Control


##Conversion Rates:

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Conversion%20Rates.png)

In [None]:
conv_rate_control = (conversions_control / n_observations_control)
conv_rate_treatment = (conversions_treatment / n_observations_treatment)

p1=conv_rate_treatment #assign p1 vs. p2, test alternative references p1
p2=conv_rate_control

c1=conversions_treatment
c2=conversions_control

n1=n_observations_treatment
n2=n_observations_control

uplift = (p1 - p2) / p2

abs_diff = abs(p1 - p2)

count = np.array([conversions_treatment, conversions_control])
nobs = np.array([n_observations_treatment, n_observations_control])

print("The conversion rate in our control group is: "+"{:.2%}".format(conv_rate_control))
print("The conversion rate in our treatment group is: "+"{:.2%}".format(conv_rate_treatment))
print("")
print("The relative uplift is: "+"{:.2%}".format(uplift))
print(("Absolute difference: "+"{:.2%}".format(abs_diff)))

The conversion rate in our control group is: 13.44%
The conversion rate in our treatment group is: 15.60%

The relative uplift is: 16.10%
Absolute difference: 2.16%


Check parameters and change if needed:

In [None]:
print(alternative)
print(alpha)
print(power)

larger
0.05
0.8


In [None]:
result_hypothesis_dict = {
    "larger":  f"{p1_label} ({p1:.4f}) is greater than {p2_label} ({p2:.4f})",
    "smaller":     f"{p1_label} ({p1:.4f}) is less than {p2_label} ({p2:.4f})",
    "two-sided":f"{p1_label} ({p1:.4f}) is different from {p2_label} ({p2:.4f})"
}

result_hypothesis = result_hypothesis_dict.get(alternative, "Invalid alternative")

print("Result Hypothesis:", result_hypothesis)

Result Hypothesis: Treatment (0.1560) is greater than Control (0.1344)


##Effect Size:

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Effect%20Size.png)

In [None]:
effect_size = proportion_effectsize(p1, p2)
print("The effect size is "+"{:.2}".format(effect_size))

The effect size is 0.061


In [None]:
def interpret_h(h):
    abs_h = abs(h)
    if abs_h < 0.2:
        return "negligible"
    elif abs_h < 0.5:
        return "small"
    elif abs_h < 0.8:
        return "medium"
    else:
        return "large"

h = effect_size
interpretation = interpret_h(h)
print(f"Effect size interpretation: {interpretation}")

Effect size interpretation: negligible


##z-Test

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/ztest.png)

Run test:

In [None]:
stat, pval = proportions_ztest(count, nobs, alternative = alternative)
print(f"Z-statistic: {stat:.4f}")

Z-statistic: 7.7696


##P-value

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Pvalue.png)

In [None]:
p_value=pval
print('The P-Value is {0:0.4f}'.format(p_value))

The P-Value is 0.0000


In [None]:
if p_value < alpha:
    pvalue_message = (f"Because the p-value ({p_value:.3f}) is less than alpha ({alpha:.3f}), "
                      f"this result is statistically significant at the {int((1 - alpha) * 100)}% confidence level.")
else:
    pvalue_message = (f"Because the p-value ({p_value:.3f}) is greater than or equal to alpha ({alpha:.3f}), "
                      f"this result is not statistically significant at the {int((1 - alpha) * 100)}% confidence level.")

# Wrap text
wrapped_pvalue_message = textwrap.fill(pvalue_message, width=80)

print(wrapped_pvalue_message)


Because the p-value (0.000) is less than alpha (0.050), this result is
statistically significant at the 95% confidence level.


##Confidence Interval

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Confidence%20Interval.png)

In [None]:
# Confidence interval
ci_low_1, ci_upp_1 = proportion_confint(c1, n1, alpha=alpha, method='normal')
ci_low_2, ci_upp_2 = proportion_confint(c2, n2, alpha=alpha, method='normal')

print(f"95% Confidence interval: ({ci_low_1:.4f}, {ci_upp_1:.4f})")
print(f"95% Confidence interval: ({ci_low_2:.4f}, {ci_upp_2:.4f})")

95% Confidence interval: (0.1520, 0.1600)
95% Confidence interval: (0.1306, 0.1381)


If you repeated your experiment or data collection many times under the same conditions, then 95% of those calculated confidence intervals would contain the true population conversion rate

In [None]:
lower, upper = confint_proportions_2indep(c1, n1, c2, n2, method='wald')

print(f"95% CI for difference in proportions ({p1_label} - {p2_label}): ({lower:.4f}, {upper:.4f})")


95% CI for difference in proportions (Treatment - Control): (0.0162, 0.0271)


The range of values that the true difference in proportions (e.g., conversion rates) could plausibly fall within, given your sample data.

You have a point estimate of the difference (e.g., p₁ - p₂)

A range (e.g., [-0.0162, 0.0271]) where that true difference likely lies

An associated confidence level (e.g., 95%) — meaning:

If we repeated this experiment many times, 95% of the time the true difference would fall within this interval.

If 0 is not in the confidence interval, the difference in proportions is statistically significant at the specified confidence level.

This means there's evidence of a real difference between the two groups.

###Interpretation:

In [None]:
lower_ci = lower
upper_ci = upper

includes_zero = lower_ci <= 0 <= upper_ci
confidence_level_percent = (1 - alpha) * 100

if includes_zero:
    signficance_message = (
        f"Because the interval includes 0, this result is not statistically "
        f"significant at the {confidence_level_percent:.0f}% confidence level."
    )
else:
    signficance_message = (
        f"Because the interval does not include 0, this result is statistically "
        f"significant at the {confidence_level_percent:.0f}% confidence level."
    )


wrapped_significance_message = textwrap.fill(signficance_message, width=80)
print(wrapped_significance_message)


Because the interval does not include 0, this result is statistically
significant at the 95% confidence level.


##Statistical Power

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Statistical%20Power.png)

In [None]:
analysis = TTestIndPower()
observed_power = analysis.power(effect_size=effect_size, nobs1=n1, alpha=alpha, ratio=n2/n1, alternative=alternative)
# print("The observed power is "+"{:.4%}".format(observed_power))


desired_power = power

if observed_power >= desired_power:
    power_message = (
        "The observed power ({:.1%}) is sufficient (>= {:.2f})."
        .format(observed_power, desired_power)
    )
else:
    power_message = (
        "The observed power ({:.1%}) is insufficient (< {:.0%}), meaning there was a "
        "higher chance we failed to detect a true difference due to limited sample size. "
        "As a result, we cannot be statistically confident in it without further data."
        .format(observed_power, desired_power)
    )

print(textwrap.fill(power_message, width=80))


The observed power (100.0%) is sufficient (>= 0.80).


# Results Summary

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Results%20Summary.png)

In [None]:
print("The conversion rate in our control group is: "+"{:.2%}".format(conv_rate_control))
print("The conversion rate in our treatment group is: "+"{:.2%}".format(conv_rate_treatment))
print("")
print("The relative uplift is: "+"{:.2%}".format(uplift))
print(("Absolute difference: "+"{:.2%}".format(abs_diff)))
print("The effect size is "+"{:.2}".format(effect_size))
print(f"Effect size interpretation: {interpretation}")
print("")
print("Result Hypothesis:", result_hypothesis)
print("")
print(f"Z-statistic: {stat:.4f}")
print("")
print(wrapped_pvalue_message)
print("")
print(f"95% Confidence interval: ({ci_low_1:.4f}, {ci_upp_1:.4f})")
print(f"95% Confidence interval: ({ci_low_2:.4f}, {ci_upp_2:.4f})")
print("")
print(f"95% CI for difference in proportions ({p1_label} - {p2_label}): ({lower:.4f}, {upper:.4f})")
print("")
print(wrapped_significance_message)
print("")
print(textwrap.fill(power_message, width=80))

The conversion rate in our control group is: 13.44%
The conversion rate in our treatment group is: 15.60%

The relative uplift is: 16.10%
Absolute difference: 2.16%
The effect size is 0.061
Effect size interpretation: negligible

Result Hypothesis: Treatment (0.1560) is greater than Control (0.1344)

Z-statistic: 7.7696

Because the p-value (0.000) is less than alpha (0.050), this result is
statistically significant at the 95% confidence level.

95% Confidence interval: (0.1520, 0.1600)
95% Confidence interval: (0.1306, 0.1381)

95% CI for difference in proportions (Treatment - Control): (0.0162, 0.0271)

Because the interval does not include 0, this result is statistically
significant at the 95% confidence level.

The observed power (100.0%) is sufficient (>= 0.80).


#Recommendation

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Recommendation.png)

Due to the significance, power, and high business impact, I will recommend moving forward with implementing the treatment