#Top

[![Open in GitHub](https://img.shields.io/badge/Open%20Folder%20in-GitHub-181717?logo=github&logoColor=white)](https://github.com/lindsayalexandra14/lindsayalexandra14/tree/main/templates/ab_testing)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1snrYT_vN5-u8aqR6B2cAufMZHavFjpXp)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/AB%20Testing%20Large%20Sample%20Size.png)

**summary**
*   This hypothetical experiment tests two Landing Pages (control vs. treatment)
*   The sample size is 32,000 users
*   I will use Z-Test for Two Proportions, which is good for large sample sizes
*   I am trying to prove that the treatment performed better than the control because the team is interested in moving forward with the treatment
*  It was established from the test that the treatment performed better with significance (at alpha=0.05). The practical significance is very small (cohen's h = 0.06). It did have the full desired statistical power (>80%)
*  I will recommend moving forward with implementing the treatment

**tl;dr for results**

*   Skip to "Results Summary" at the end





#Setup

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Setup.png)

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Install%20Packages.png)

In [None]:
install.packages('pwr')
install.packages('glue')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



##Import Libraries

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Import%20Libraries.png)

In [None]:
library(pwr)
library(glue)

#Test Design

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Test%20Design.png)

##Parameters

In [None]:
alpha <- 0.05            # Significance level
power <- 0.80            # Statistical power (Probability of detecting an effect when it exists; 0.8 is standard)
control=0.14             # Baseline rate
effect <- 0.05           # Desired relative effect (e.g., 5% lift over baseline)
mde <- control * effect   # Minimum Detectable Effect (MDE)
  # Minimum difference you want to detect in absolute terms
  # It is the absolute difference between the proportions
  # e.g., 5% of a 16% baseline = 0.008. Or want 23% to go to 24% = 1% MDE
treatment= control + mde  #Treatment rate (includes effect)
print(paste('Control:',control))
print(paste('Treatment:',treatment))

[1] "Control: 0.14"
[1] "Treatment: 0.147"


In [None]:
p_1=treatment
p_2=control
p1_label = "Treatment"
p2_label = "Control"

alternative = "greater" # in reference to p1:
# p1 is "greater" than p2
# p1 is "less" than p2
# p1 is different from ()"two.sided" p2

hypothesis <- switch(alternative,
  greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p_1, p2_label, p_2),
  less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p_1, p2_label, p_2),
  two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p_1, p2_label, p_2),
)

cat("Hypothesis:",hypothesis)


Hypothesis: Treatment (0.1470) is greater than Control (0.1400)

##Effect size

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Effect%20Size.png)

In [None]:
# Cohen's h (standardized effect size for proportions)

effect_size = ES.h(treatment, control)

cat(sprintf("Control = %.4f\n", control))
cat(sprintf("Treatment= %.4f\n", treatment))
cat(sprintf("Minimum Detectable Effect (MDE): %.3f\n", mde))
cat(sprintf("Effect Size (Cohen's h): %.3f\n", effect_size))

Control = 0.1400
Treatment= 0.1470
Minimum Detectable Effect (MDE): 0.007
Effect Size (Cohen's h): 0.020


Cohen's h benchmarks:

0.2 = small effect

0.5 = medium effect

0.8 = large effect

If the effect is tiny, it will require a very large sample size to detect.

*   Effect is translated into Cohen’s h
*   It is a way to quantify how big the difference between two proportions is, on a standardized scale  
*   Absolute differences (like +2%) are different on a baseline of 5% vs 50%
*   Puts differences on a common scale, to compare effect sizes fairly across experiments
*   Demonstrates practical meaning (vs. just statistical significance)

##Sample Size

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Sample%20Size.png)

Calculate minimum sample size for each group (cell) for one-sided and two-sided tests:
*   A one-sided test is used when you want to test if one group performs specifically better or worse than the other (a directional hypothesis).
*   A two-sided test is used when you want to test if there is any difference between the groups, regardless of direction — whether one is better or worse.

In [None]:
# determine the minimum number of samples for each group

# pwr.2p.test requires inputting the effect size
result1 <- pwr.2p.test(h=effect_size, sig.level=alpha, power=power,alternative=alternative)

result2 <- pwr.2p.test(h=effect_size, sig.level=alpha, power=power,alternative='two.sided')

if (alternative == 'greater' | alternative == 'less') {
  power_prop_alternative <- 'one.sided'
} else {
  power_prop_alternative <- 'two.sided'
}

# or power.prop.test requires inputting the second proportion
result3 <- power.prop.test(p1=p_1,p2=p_2, sig.level=alpha, power=power,alternative=power_prop_alternative)

result4 <- power.prop.test(p1=p_1,p2=p_2, sig.level=alpha, power=power,alternative='two.sided')

In [None]:
# Inputting effect
cat("Inputting effect:\n")
cat(paste("(alternative)", alternative, ": n =", round(result1$n)), "\n")
cat(paste("(alternative)", 'two.sided', ": n =", round(result2$n)), "\n\n")

# Inputting proportion
cat("Inputting proportion:\n")
cat(paste("(alternative)", alternative, ": n =", round(result3$n)), "\n")
cat(paste("(alternative)", 'two.sided', ": n =", round(result4$n)), "\n")


Inputting effect:
(alternative) greater : n = 31011 
(alternative) two.sided : n = 39370 

Inputting proportion:
(alternative) greater : n = 31015 
(alternative) two.sided : n = 39374 


#Results

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Results.png)

##Data

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Data.png)

###Import Data

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Import%20Data.png)

From a dataset:

In [None]:
#df_data=

Manually input:

In [None]:
n_observations_control <- 32000
n_observations_treatment <- 32050

conversions_control <- 4300
conversions_treatment <- 5000

n1 <- n_observations_treatment
n2 <- n_observations_control

In [None]:
print(p1_label) # set above in test design
print(p2_label)

[1] "Treatment"
[1] "Control"


##Conversion Rates:

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Conversion%20Rates.png)

In [None]:
conv_rate_control = (conversions_control / n_observations_control)
conv_rate_treatment = (conversions_treatment / n_observations_treatment)

p1=conv_rate_treatment #assign p1 vs. p2, test alternative references p1
p2=conv_rate_control

c1=conversions_treatment
c2=conversions_control

n1=n_observations_treatment
n2=n_observations_control

In [None]:
print(glue("Control Conversion Rate: {round(conv_rate_control * 100, 2)}%"))
print(glue("Treatment Conversion Rate: {round(conv_rate_treatment * 100, 2)}%"))

Control Conversion Rate: 13.44%
Treatment Conversion Rate: 15.6%


Check parameters and change if needed:

In [None]:
print(alternative)
print(alpha)
print(power)

[1] "greater"
[1] 0.05
[1] 0.8


In [None]:
result_hypothesis <- switch(alternative,
  greater = sprintf("%s (%.4f) is greater than %s (%.4f)", p1_label, p1, p2_label, p2),
  less = sprintf("%s (%.4f) is less than %s (%.4f)", p1_label, p1, p2_label, p2),
  two.sided = sprintf("%s (%.4f) is different from %s (%.4f)", p1_label, p1, p2_label, p2),
)

cat("Result Hypothesis:",result_hypothesis)

Result Hypothesis: Treatment (0.1560) is greater than Control (0.1344)

##Effect Size:

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Effect%20Size.png)

In [None]:
# Uplift
uplift = (p1 - p2) / p2

# Absolute Difference
abs_diff = abs(p1 - p2)

# Cohen's h function
proportion_effectsize <- function(p1, p2) {
  2 * asin(sqrt(p1)) - 2 * asin(sqrt(p2))
}

h <- proportion_effectsize(p1, p2)

# Interpret effect size
interpret_h <- function(h) {
  if (abs(h) < 0.2) return("negligible")
  if (abs(h) < 0.5) return("small")
  if (abs(h) < 0.8) return("medium")
  return("large")
}
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
print(glue("Uplift: {round(uplift * 100, 2)}%"))
cat(sprintf("Cohen's h: %.3f\n", h))
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))

Absolute difference: 0.022 (2.2%)
Uplift: 16.1%
Cohen's h: 0.061
Effect size interpretation: negligible


##z-Test

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/ztest.png)

Run test:

In [None]:
x <- c(c1, c2)  # successes
n <- c(n1, n2)  # totals

# Run two-proportion test
test_result <- prop.test(x = x, n = n, alternative = alternative, correct = FALSE)
# correction not needed with large sample size
print(test_result)


	2-sample test for equality of proportions without continuity correction

data:  x out of n
X-squared = 60.366, df = 1, p-value = 3.938e-15
alternative hypothesis: greater
95 percent confidence interval:
 0.01705418 1.00000000
sample estimates:
   prop 1    prop 2 
0.1560062 0.1343750 



##P-value

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Pvalue.png)

In [None]:
p_value <- test_result$p.value

print(sprintf("p-value: %.4f", p_value))

pvalue_message <- if (p_value < alpha) {
  sprintf("Because the p-value (%.3f) is less than alpha (%.3f), this result is statistically significant at the %.0f%% confidence level.",
          p_value, alpha, (1 - alpha) * 100)
} else {
  sprintf("Because the p-value (%.3f) is greater than or equal to alpha (%.3f), this result is not statistically significant at the %.0f%% confidence level.",
          p_value, alpha, (1 - alpha) * 100)
}

cat(strwrap(pvalue_message, width = 80), sep = "\n")


[1] "p-value: 0.0000"
Because the p-value (0.000) is less than alpha (0.050), this result is
statistically significant at the 95% confidence level.


##Confidence Interval

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Confidence%20Interval.png)

In [None]:
conf_int <- test_result$conf.int

print(paste((1-alpha)*100,"%"))
print(paste("confidence interval: [",conf_int[1],",",conf_int[2],']'))

[1] "95 %"
[1] "confidence interval: [ 0.0170541815700391 , 1 ]"


The range of values that the true difference in proportions (e.g., conversion rates) could plausibly fall within, given your sample data.

You have a point estimate of the difference (e.g., p₁ - p₂)

A range (e.g., [-0.012, 0.035]) where that true difference likely lies

An associated confidence level (e.g., 95%) — meaning:

If we repeated this experiment many times, 95% of the time the true difference would fall within this interval.

If 0 is not in the confidence interval, the difference in proportions is statistically significant at the specified confidence level.

This means there's evidence of a real difference between the two groups.

###Interpretation:

In [None]:
lower_ci <- conf_int[1]
upper_ci <- conf_int[2]

if (alternative == "less") {
  if (upper_ci < 0) {
    ci_message <- sprintf("Significant: %s converts LESS than %s (CI upper bound < 0) by %.2f%% points", p1_label, p2_label, upper_ci * 100)
  } else {
    ci_message <- sprintf("Not significant: Cannot conclude %s < %s (CI includes 0 or more)",p1_label,p2_label)
  }
} else if (alternative == "greater") {
  if (lower_ci > 0) {
    ci_message <- sprintf("Significant: %s converts MORE than %s (CI lower bound > 0) by %.2f%% points", p1_label, p2_label, lower_ci * 100)
  } else {
    ci_message <- sprintf("Not significant: Cannot conclude %s > %s (CI includes 0 or less)",p1_label,p2_label)
  }
} else {
  if (lower_ci > 0 || upper_ci < 0) { #two.sided
    ci_message <- "Significant: Conversion rates are different (CI does not include 0)\n"
  } else {
    ci_message <- "Not significant: No evidence of a difference (CI includes 0)\n"
  }
}

wrapped_ci_message <- paste(strwrap(ci_message, width = 80), collapse = "\n")
cat(wrapped_ci_message, "\n")

Significant: Treatment converts MORE than Control (CI lower bound > 0) by 1.71%
points 


In [None]:
includes_zero <- 0
if (lower_ci <= 0 & upper_ci >= 0) {
  includes_zero <- 1
}

confidence_level_percent <- (1 - alpha) * 100

if (includes_zero == 1) {
  significance_message <- paste(
    "Because the interval includes 0, this result is not statistically",
    "significant at the confidence level of", confidence_level_percent, "%."
  )
} else {
  significance_message <- paste(
    "Because the interval does not include 0, this result is statistically",
    "significant at the confidence level of", confidence_level_percent, "%."
  )
}

wrapped_significance_message <- paste(strwrap(significance_message, width = 80), collapse = "\n")
cat(wrapped_significance_message, "\n")


Because the interval does not include 0, this result is statistically
significant at the confidence level of 95 %. 


Confidence Interval of % conversion:

In [None]:
se_p1 <- sqrt(p1 * (1 - p1) / n1)
lower_ci_p1 <- p1 - 1.96 * se_p1
upper_ci_p1 <- p1 + 1.96 * se_p1

cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p1_label, lower_ci_p1, upper_ci_p1))

se_p2 <- sqrt(p2 * (1 - p2) / n2)
lower_ci_p2 <- p2 - 1.96 * se_p2
upper_ci_p2 <- p2 + 1.96 * se_p2

cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p2_label, lower_ci_p2, upper_ci_p2))


Treatment 95% CI: 0.1520 to 0.1600
Control 95% CI: 0.1306 to 0.1381


If you repeated your experiment or data collection many times under the same conditions, then 95% of those calculated confidence intervals would contain the true population conversion rate

##Statistical Power

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Statistical%20Power.png)

In [None]:
# # or we can use power.prop.test which requires inputting the second proportion
# power_result <- power.prop.test(
#   p1 = p_1,                # Proportion in group 1
#   p2 = p_2,                # Proportion in group 2
#   sig.level = alpha,       # Significance level (e.g., 0.05)
#   n1 = n1,                # Sample size in group 1
#   n2 = n2,                # Sample size in group 2
#   alternative = power_prop_alternative  # "two.sided", "one.sided", or "less"/"greater"
# )

# print(power_result)

# Effective sample size (harmonic mean for unequal n)
n_effective <- (2 * n1 * n2) / (n1 + n2)

# Calculate power
power_result <- pwr.2p.test(h = h, n = n_effective, sig.level = alpha, alternative = alternative)
print(power_result)



     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.06144038
              n = 32024.98
      sig.level = 0.05
          power = 1
    alternative = greater

NOTE: same sample sizes



In [None]:
# Extract the power (as a proportion) and convert to percent
power_pct <- round(power_result$power * 100, 1)

if (power_result$power < 0.8) {
  power_sentence <- sprintf(
    "Our test was underpowered (e.g., only ~%s%% power), meaning there was \na higher chance we failed to detect a true difference due to limited sample size. \nAs a result, we cannot be statistically \nconfident in it without further data and cannot give a confident \nestimate in incremental revenue from the test.",
    power_pct
  )
} else {
  power_sentence <- sprintf(
    "Our test was adequately powered (e.g., ~%s%% power), meaning we had a \nstrong chance of detecting a true difference if one existed.",
    power_pct
  )
}

cat("Result Power:", power_pct, "%\n\n")
cat(power_sentence, "\n")


Result Power: 100 %

Our test was adequately powered (e.g., ~100% power), meaning we had a 
strong chance of detecting a true difference if one existed. 


# Results Summary

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Results%20Summary.png)

In [None]:
print(glue("Control Conversion Rate: {round(conv_rate_control * 100, 2)}%"))
print(glue("Treatment Conversion Rate: {round(conv_rate_treatment * 100, 2)}%"))
print(paste("Result Hypothesis:",result_hypothesis))
cat("\n")
cat(sprintf("Absolute difference: %.3f (%.1f%%)\n", abs_diff, abs_diff * 100))
print(glue("Uplift: {round(uplift * 100, 2)}%"))
cat(sprintf("Cohen's h: %.3f\n", h))
cat(sprintf("Effect size interpretation: %s\n", interpret_h(h)))
print(test_result)
print(sprintf("p-value: %.4f", p_value))
cat(strwrap(pvalue_message, width = 80), sep = "\n")
cat("\n")
wrapped_ci_message <- paste(strwrap(ci_message, width = 80), collapse = "\n")
cat(wrapped_ci_message, "\n")
cat("\n")
cat(wrapped_significance_message, "\n")
cat("\n")
cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p1_label, lower_ci_p1, upper_ci_p1))
cat(sprintf("%s 95%% CI: %.4f to %.4f\n",p2_label, lower_ci_p2, upper_ci_p2))
cat("\n")
print(paste("confidence interval for diff: [",conf_int[1],",",conf_int[2],']'))
cat("\n")
cat("Result Power:", power_pct, "%\n\n")
cat(power_sentence, "\n")

Control Conversion Rate: 13.44%
Treatment Conversion Rate: 15.6%
[1] "Result Hypothesis: Treatment (0.1560) is greater than Control (0.1344)"

Absolute difference: 0.022 (2.2%)
Uplift: 16.1%
Cohen's h: 0.061
Effect size interpretation: negligible

	2-sample test for equality of proportions without continuity correction

data:  x out of n
X-squared = 60.366, df = 1, p-value = 3.938e-15
alternative hypothesis: greater
95 percent confidence interval:
 0.01705418 1.00000000
sample estimates:
   prop 1    prop 2 
0.1560062 0.1343750 

[1] "p-value: 0.0000"
Because the p-value (0.000) is less than alpha (0.050), this result is
statistically significant at the 95% confidence level.

Significant: Treatment converts MORE than Control (CI lower bound > 0) by 1.71%
points 

Because the interval does not include 0, this result is statistically
significant at the confidence level of 95 %. 

Treatment 95% CI: 0.1520 to 0.1600
Control 95% CI: 0.1306 to 0.1381

[1] "confidence interval for diff: [ 0.0

***performance:***.  
With 95% confidence, the treatment has a higher conversion rate than the control by at least 1.71%+ points (based on the lower bound CI of 0.0171). This supports the hypothesis that treatment is better than control.

***significance:***.  
Because the p-value (0.000) is less than alpha (0.050), and the 95% confidence interval for the difference does not contain 0, this result is statistically significant at the 95% confidence level. The practical significance is low (cohen's h = 0.06) but the business impact is high based on domain knowledge.

***power:***.  
Our test was adequately powered (e.g., ~100% power vs. 80% desired), meaning we had a strong chance of detecting a true difference if one existed.

#Recommendation

![Alt text](https://github.com/lindsayalexandra14/ds_portfolio/raw/main/2_images/templates/notebook/headers/beige/Recommendation.png)

Due to the significance, power, and high business impact, I will recommend moving forward with implementing the treatment

In [None]:
pkgs <- installed.packages()[, c("Package", "Version")]
lines <- paste(pkgs[, "Package"], pkgs[, "Version"], sep = "==")
writeLines(lines, "requirements_r.txt")

In [None]:
# Print to console
cat(lines, sep = "\n")

glue==1.8.0
IRdisplay==1.1
IRkernel==1.3.2
pbdZMQ==0.3-14
pwr==1.3-0
repr==1.1.7
askpass==1.2.1
backports==1.5.0
base64enc==0.1-3
bit==4.6.0
bit64==4.6.0-1
blob==1.2.4
boot==1.3-31
brew==1.0-10
brio==1.1.5
broom==1.0.9
bslib==0.9.0
cachem==1.1.0
callr==3.7.6
cellranger==1.1.0
class==7.3-23
cli==3.6.5
clipr==0.8.0
cluster==2.1.8.1
codetools==0.2-20
commonmark==2.0.0
conflicted==1.2.0
cpp11==0.5.2
crayon==1.5.3
credentials==2.0.2
curl==6.4.0
data.table==1.17.8
DBI==1.2.3
dbplyr==2.5.0
desc==1.4.3
devtools==2.4.5
diffobj==0.3.6
digest==0.6.37
downlit==0.4.4
dplyr==1.1.4
dtplyr==1.3.1
ellipsis==0.3.2
evaluate==1.0.4
fansi==1.0.6
farver==2.1.2
fastmap==1.2.0
fontawesome==0.5.3
forcats==1.0.0
foreign==0.8-90
fs==1.6.6
gargle==1.5.2
generics==0.1.4
gert==2.1.5
ggplot2==3.5.2
gh==1.5.0
gitcreds==0.1.2
glue==1.8.0
googledrive==2.1.1
googlesheets4==1.1.1
gtable==0.3.6
haven==2.5.5
highr==0.11
hms==1.1.3
htmltools==0.5.8.1
htmlwidgets==1.6.4
httpuv==1.6.16
httr==1.4.7
httr2==1.2.1
ids==1.0.1
ini=