# A/B Testing

## Hypotheses
 
### Null Hypotheses:
Both buttons are equally good. Any difference is just random luck.

### Alternative Hypothesis: 

The buttons are different. The treatment is much better than control. 

We assume H-null is TRUE, then see if our data contradicts it. 
H-null: Defendant is innocent 
H-a : need STRONG evidence to reject innocence

if evidence is weak, then not guilty



## Proportions Test (The Foundation)

The problem:

Amazon wants to test a new "Buy Now" button color. 
Control (Blue): 
- 1000 users saw it 
- 120 clicked

Treatment (Orange): 
- 1000 users saw it 
- 140 clicked 

##### Why Pool?

Because if there's no difference, then both blue and orange are measuring the same thing. So combine them for better estimate

In [2]:
import numpy as np

#Define the data 
clicks_control = 120
users_control = 1000

clicks_treatment = 140
users_treatment = 1000

#Calculate individual proportions 
p_control = clicks_control/users_control
p_treatment = clicks_treatment/users_treatment

print("=" * 60)
print("STEP 2: POOLED PROPORTION")
print("=" * 60)


print(f"\nControl (Blue) click rate: {p_control:.4f} ({p_control*100:.2f}%)")
print(f"Treatment (Orange) click rate: {p_treatment:.4f} ({p_treatment*100:.2f}%)")
print(f"Observed difference: {(p_treatment - p_control):.4f} ({(p_treatment - p_control)*100:.2f}%)")


#Calculate pooled proportion
total_clicks = clicks_control + clicks_treatment
total_users = users_control + users_treatment

p_pooled = total_clicks/total_users

print(f"\n--- Pooled Proportion Calculation ---")
print(f"Total clicks: {total_clicks}")
print(f"Total users: {total_users}")
print(f"Pooled proportion: {p_pooled:.4f} ({p_pooled*100:.2f}%)")

print(f"\nInterpretation:")
print(f"If H₀ is true (no difference between buttons), our best estimate")
print(f"of the TRUE click rate is {p_pooled*100:.2f}%")


STEP 2: POOLED PROPORTION

Control (Blue) click rate: 0.1200 (12.00%)
Treatment (Orange) click rate: 0.1400 (14.00%)
Observed difference: 0.0200 (2.00%)

--- Pooled Proportion Calculation ---
Total clicks: 260
Total users: 2000
Pooled proportion: 0.1300 (13.00%)

Interpretation:
If H₀ is true (no difference between buttons), our best estimate
of the TRUE click rate is 13.00%


### Standard Error

Question: If the true rate is 13%, how much would we expect two samples of 1000 to differ by just chance? 

Answer) Standard Error (SE) tells us the "typical variation"
Formula : SE = sqrt(p*(1-p)*(1/n1 + 1/n2))

p = 0.13
n1 = n2 = 1000

SE = sqrt(0.13*0.87*0.002)
SE = 0.015 = 1.5%

##### What this means? 
If there's NO real difference, we'd typically see about 1.5% variation between two groups just by chance

In [3]:
import numpy as np

# Define the data (same as before)
clicks_control = 120      # Blue button clicks
users_control = 1000      # Blue button users

clicks_treatment = 140    # Orange button clicks  
users_treatment = 1000    # Orange button users

# Calculate individual proportions
p_control = clicks_control / users_control
p_treatment = clicks_treatment / users_treatment

# Calculate pooled proportion (from Step 2)
total_clicks = clicks_control + clicks_treatment
total_users = users_control + users_treatment
p_pooled = total_clicks / total_users

print("=" * 60)
print("STEP 3: STANDARD ERROR")
print("=" * 60)

print(f"\nFrom previous steps:")
print(f"Control click rate: {p_control*100:.2f}%")
print(f"Treatment click rate: {p_treatment*100:.2f}%")
print(f"Pooled proportion: {p_pooled*100:.2f}%")
print(f"Observed difference: {(p_treatment - p_control)*100:.2f}%")

# Calculate Standard Error
# Formula: SE = sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))

print(f"\n--- Standard Error Calculation ---")

# Break it down step by step
variance_part = p_pooled * (1 - p_pooled)
print(f"Step 1 - Variance part: {p_pooled:.4f} * (1 - {p_pooled:.4f}) = {variance_part:.6f}")

sample_size_part = (1/users_control) + (1/users_treatment)
print(f"Step 2 - Sample size part: 1/{users_control} + 1/{users_treatment} = {sample_size_part:.6f}")

inside_sqrt = variance_part * sample_size_part
print(f"Step 3 - Multiply them: {variance_part:.6f} * {sample_size_part:.6f} = {inside_sqrt:.8f}")

standard_error = np.sqrt(inside_sqrt)
print(f"Step 4 - Take square root: sqrt({inside_sqrt:.8f}) = {standard_error:.6f}")

print(f"\nSTANDARD ERROR: {standard_error:.6f} ({standard_error*100:.4f}%)")

print(f"\nInterpretation:")
print(f"If there's NO real difference between buttons, we'd typically expect")
print(f"to see about {standard_error*100:.2f}% variation between two groups just by random chance.")
print(f"\nOur observed difference is {(p_treatment - p_control)*100:.2f}%")
print(f"Compared to typical random variation of {standard_error*100:.2f}%")

STEP 3: STANDARD ERROR

From previous steps:
Control click rate: 12.00%
Treatment click rate: 14.00%
Pooled proportion: 13.00%
Observed difference: 2.00%

--- Standard Error Calculation ---
Step 1 - Variance part: 0.1300 * (1 - 0.1300) = 0.113100
Step 2 - Sample size part: 1/1000 + 1/1000 = 0.002000
Step 3 - Multiply them: 0.113100 * 0.002000 = 0.00022620
Step 4 - Take square root: sqrt(0.00022620) = 0.015040

STANDARD ERROR: 0.015040 (1.5040%)

Interpretation:
If there's NO real difference between buttons, we'd typically expect
to see about 1.50% variation between two groups just by random chance.

Our observed difference is 2.00%
Compared to typical random variation of 1.50%


#### Z-Score 
##### The question: How unusual is our observed 2% difference compared to typical random variation? 

##### Answer: 

Z-Score defines how many standard errors away from zero: 

Observed difference = 14% - 12% = 2% 

z = 2%/1.5% = 1.33

What this means? Our difference is 1.33 standard errors away from zero 

Intuition: 

- z = 0 : No difference at all 
- z = 1 : Small difference (within normal random variation)
- z = 2 : Noticeable difference (getting suspicious)
- z = 3+ : Strong difference 

In [5]:
import numpy as np

# Define the data (same as before)
clicks_control = 400      # Blue button clicks
users_control = 10000      # Blue button users

clicks_treatment = 500    # Orange button clicks  
users_treatment = 10000    # Orange button users

# Calculate individual proportions
p_control = clicks_control / users_control
p_treatment = clicks_treatment / users_treatment

# Calculate pooled proportion (from Step 2)
total_clicks = clicks_control + clicks_treatment
total_users = users_control + users_treatment
p_pooled = total_clicks / total_users

# Calculate standard error (from Step 3)
variance_part = p_pooled * (1 - p_pooled)
sample_size_part = (1/users_control) + (1/users_treatment)
standard_error = np.sqrt(variance_part * sample_size_part)

print("=" * 60)
print("STEP 4: Z-SCORE")
print("=" * 60)

print(f"\nFrom previous steps:")
print(f"Control click rate: {p_control*100:.2f}%")
print(f"Treatment click rate: {p_treatment*100:.2f}%")
print(f"Observed difference: {(p_treatment - p_control)*100:.2f}%")
print(f"Standard Error: {standard_error*100:.2f}%")

# Calculate Z-Score
# Formula: z = (observed_difference - expected_difference) / standard_error
# Under H₀, expected_difference = 0

print(f"\n--- Z-Score Calculation ---")

observed_difference = p_treatment - p_control
expected_difference_under_H0 = 0  # Under null hypothesis, difference should be 0

print(f"Observed difference: {observed_difference:.6f} ({observed_difference*100:.2f}%)")
print(f"Expected difference under H₀: {expected_difference_under_H0}")
print(f"Standard Error: {standard_error:.6f}")

z_score = (observed_difference - expected_difference_under_H0) / standard_error

print(f"\nZ-score = (observed - expected) / SE")
print(f"Z-score = ({observed_difference:.6f} - {expected_difference_under_H0}) / {standard_error:.6f}")
print(f"Z-score = {observed_difference:.6f} / {standard_error:.6f}")
print(f"\n Z-SCORE: {z_score:.4f}")

print(f"\n--- Interpretation ---")
print(f"The observed difference ({observed_difference*100:.2f}%) is {z_score:.2f} standard errors away from zero.")

# Visual interpretation
if abs(z_score) < 1:
    interpretation = "SMALL - Within normal random variation"
    decision = "Not unusual at all"
elif abs(z_score) < 2:
    interpretation = "MODERATE - Somewhat unusual, but could still be chance"
    decision = "Not quite enough evidence"
elif abs(z_score) < 3:
    interpretation = "LARGE - Pretty unusual, likely not just chance"
    decision = "Strong evidence"
else:
    interpretation = "VERY LARGE - Extremely unusual, almost certainly not chance"
    decision = "Very strong evidence"

print(f"\nZ-score magnitude: {interpretation}")
print(f"Decision: {decision}")

# Reference guide
print(f"\n--- Z-Score Reference Guide ---")
print(f"│z│ < 1.0  → Within typical random variation")
print(f"│z│ ≈ 1.5  → Somewhat unusual")
print(f"│z│ ≈ 1.96 → At the edge (corresponds to p=0.05)")
print(f"│z│ ≈ 2.58 → Strong evidence (corresponds to p=0.01)")
print(f"│z│ > 3.0  → Very strong evidence")
print(f"\nOur z-score: {z_score:.4f}")

STEP 4: Z-SCORE

From previous steps:
Control click rate: 4.00%
Treatment click rate: 5.00%
Observed difference: 1.00%
Standard Error: 0.29%

--- Z-Score Calculation ---
Observed difference: 0.010000 (1.00%)
Expected difference under H₀: 0
Standard Error: 0.002932

Z-score = (observed - expected) / SE
Z-score = (0.010000 - 0) / 0.002932
Z-score = 0.010000 / 0.002932

 Z-SCORE: 3.4110

--- Interpretation ---
The observed difference (1.00%) is 3.41 standard errors away from zero.

Z-score magnitude: VERY LARGE - Extremely unusual, almost certainly not chance
Decision: Very strong evidence

--- Z-Score Reference Guide ---
│z│ < 1.0  → Within typical random variation
│z│ ≈ 1.5  → Somewhat unusual
│z│ ≈ 1.96 → At the edge (corresponds to p=0.05)
│z│ ≈ 2.58 → Strong evidence (corresponds to p=0.01)
│z│ > 3.0  → Very strong evidence

Our z-score: 3.4110


#### P-Value and Statistical Significance

P-Value : If H-null were true (no real difference), what's the probability of seeing a result this extreme (or more extreme) just by random chance? 

We got z-score = 3.41

If Subject A and Subject B were equally good, what's the chance that we'd see a 1% difference (or bigger) just by luck?

In [8]:
import numpy as np 
from scipy import stats 

#Email test data 
clicks_control = 400
users_control = 10000

clicks_treatment = 500
users_treatment = 10000

p_control = clicks_control/users_control
p_treatment = clicks_treatment/users_treatment 

total_clicks = clicks_control + clicks_treatment
total_users = users_control + users_treatment 

p_pooled = total_clicks/total_users

variance_part = p_pooled * (1-p_pooled)
sample_size_part = (1/users_control) + (1/users_treatment)
standard_error = np.sqrt(variance_part * sample_size_part)

observed_difference = p_treatment - p_control
z_score = observed_difference / standard_error

print("=" * 60)
print("STEP 5: P-VALUE CALCULATION")
print("=" * 60)

print(f"\nFrom previous steps:")
print(f"Control open rate: {p_control*100:.2f}%")
print(f"Treatment open rate: {p_treatment*100:.2f}%")
print(f"Observed difference: {observed_difference*100:.2f}%")
print(f"Z-score: {z_score:.4f}")

p_value = 2 * (1-stats.norm.cdf(abs(z_score)))

print(f"\n--- P-Value Calculation ---")
print(f"Z-score: {z_score:.4f}")
print(f"P-VALUE: {p_value:.6f} ({p_value*100:.4f}%)")


STEP 5: P-VALUE CALCULATION

From previous steps:
Control open rate: 4.00%
Treatment open rate: 5.00%
Observed difference: 1.00%
Z-score: 3.4110

--- P-Value Calculation ---
Z-score: 3.4110
P-VALUE: 0.000647 (0.0647%)


In [11]:
print(f"\n--- Interpretation ---")
print(f"If H₀ were true (no difference between subjects),")
print(f"there's only a {p_value*100:.4f}% chance of seeing")
print(f"a difference of {abs(observed_difference)*100:.2f}% or larger by random chance.")

# Decision making
alpha = 0.05  # Significance level

print(f"\n--- Decision Making ---")
print(f"Significance level (α): {alpha}")
print(f"P-value: {p_value:.6f}")

if p_value < alpha:
    print(f"\n REJECT H₀")
    print(f"P-value ({p_value:.6f}) < α ({alpha})")
    print(f"Conclusion: There IS a statistically significant difference.")
    print(f"Decision: Subject B is better. LAUNCH IT!")
else:
    print(f"\n FAIL TO REJECT H₀")
    print(f"P-value ({p_value:.6f}) ≥ α ({alpha})")
    print(f"Conclusion: Not enough evidence of a difference.")
    print(f"Decision: Keep testing or abandon.")

# Confidence interpretation
print(f"\n--- Confidence Level ---")
confidence = (1 - p_value) * 100
print(f"We are {confidence:.2f}% confident this difference is NOT due to chance.")

# Visual guide
print(f"\n--- P-Value Interpretation Guide ---")
print(f"p < 0.001  → Very strong evidence")
print(f"p < 0.01   → Strong evidence")
print(f"p < 0.05   → Moderate evidence (standard threshold)")
print(f"p < 0.10   → Weak evidence")
print(f"p ≥ 0.10   → Little to no evidence")
print(f"\nOur p-value: {p_value:.6f} → Very strong evidence!")




--- Interpretation ---
If H₀ were true (no difference between subjects),
there's only a 0.0647% chance of seeing
a difference of 1.00% or larger by random chance.

--- Decision Making ---
Significance level (α): 0.05
P-value: 0.000647

 REJECT H₀
P-value (0.000647) < α (0.05)
Conclusion: There IS a statistically significant difference.
Decision: Subject B is better. LAUNCH IT!

--- Confidence Level ---
We are 99.94% confident this difference is NOT due to chance.

--- P-Value Interpretation Guide ---
p < 0.001  → Very strong evidence
p < 0.01   → Strong evidence
p < 0.05   → Moderate evidence (standard threshold)
p < 0.10   → Weak evidence
p ≥ 0.10   → Little to no evidence

Our p-value: 0.000647 → Very strong evidence!
