#### Statistical Significance 

This code is used to calculate the statistical significance of the number of improved models under various experimental contexts.

In [2]:
import math
from scipy.stats import norm

# Given values from the figure and text
total_pairs = 20000
ties = 12739
n_better = 4488
n_worse = 2773

# Check arithmetic consistency
assert n_better + n_worse + ties == total_pairs, "Sum of better, worse, and ties does not match total pairs."

# Effective sample size (excluding ties)
N_prime = n_better + n_worse

# Under H0, p=0.5
p = 0.5
mu = N_prime * p
sigma = math.sqrt(N_prime * p * (1 - p))

# Apply continuity correction for one-sided test: P(X >= n_better)
# Z = (n_better - 0.5 - mu) / sigma
Z = (n_better - 0.5 - mu) / sigma

# Compute p-value from Z-score for one-tailed test
p_value = 1 - norm.cdf(Z)

print("Effective sample size (N'):", N_prime)
print("Observed 'better' count:", n_better)
print("Observed 'worse' count:", n_worse)
print("Mean (mu):", mu)
print("Std dev (sigma):", sigma)
print("Z-score:", Z)
print("p-value:", p_value)

# Confirming if the results are statistically significant:
# For a very large Z like we have, p-value should be extremely small.
if p_value < 1e-100:
    print("The p-value is effectively zero, confirming extremely strong statistical significance.")
else:
    print("The p-value is not as small as expected. Check calculations.")

Effective sample size (N'): 7261
Observed 'better' count: 4488
Observed 'worse' count: 2773
Mean (mu): 3630.5
Std dev (sigma): 42.60575078554537
Z-score: 20.114655514784403
p-value: 0.0
The p-value is effectively zero, confirming extremely strong statistical significance.
The computed Z-score does not match the reported value closely. Check calculations.
The computed p-value confirms the previously reported extremely small p-value.
