## Power analysis

We want to perform a test, in order to determine the number of sample we would need to a power of 80% on our test with an $\alpha$ of $0.5$. Our issue comes in when we determine the effect size. Most online sources we could find all rely on a prior data set for which we compute the within and total sums of squares. This effect size is a value between 0 and 1, but is generally skewed to lower values, where 0.15 is considered a large effect. We can not find a way to scale our differences in expectation and variances between the different queues to a meaningful effect size measure. 

I've provided the code we use for this in the cell below. The [Documentation](https://www.statsmodels.org/dev/_modules/statsmodels/stats/power.html#FTestAnovaPower) for the package

In [1]:
import statsmodels.stats.power as smp
for i in range(1, 10):
    x = i * 0.1
    sample_size = smp.FTestAnovaPower().solve_power(effect_size=x, alpha=0.01, k_groups=3, power=0.8, nobs=None)
    print("Effect size: %.2f, Sample size: %.2f" % (x, sample_size))


Effect size: 0.10, Sample size: 1392.68
Effect size: 0.20, Sample size: 351.64
Effect size: 0.30, Sample size: 158.88
Effect size: 0.40, Sample size: 91.44
Effect size: 0.50, Sample size: 60.25
Effect size: 0.60, Sample size: 43.33
Effect size: 0.70, Sample size: 33.15
Effect size: 0.80, Sample size: 26.56
Effect size: 0.90, Sample size: 22.06


As you can see the effect size has strong implications on the sample size. Therefore getting a good estimate is very important. We hope you can help us find a way to get an effect size estimate without generating a toy data set. 

Power:
- rho -> effect size
    - can't get analytical solutions for mmnP and mdn
- Distribution, what do they expect us to use, do we justify which distriubtions we use
- Analysis, is our current design valid
- Can we chose arbitrary rho, since we have to pick a point on which to compare the distributions of the expectation

# Using the Central Limit Theorem to transform random means


In [2]:
import plotly.express as px 
import numpy as np
import scipy.stats as stats
import statsmodels.stats.power as smp

In [3]:
mu1, sigma1 = 3, 2
mu2, sigma2 = 5, 4
mu3, sigma3 = 8, 3

# plotting initial probability density functions
x = np.linspace(-5, 15, 1000)
y1 = stats.norm.pdf(x, mu1, sigma1)
y2 = stats.norm.pdf(x, mu2, sigma2)
y3 = stats.norm.pdf(x, mu3, sigma3)

fig = px.line(title='Unscaled Normal Distributions')
fig.add_scatter(x=x, y=y1, name='Normal Distribution 1')
fig.add_scatter(x=x, y=y2, mode='lines', name='Normal Distribution 2')
fig.add_scatter(x=x, y=y3, mode='lines', name='Normal Distribution 3')
fig.show()

We now rescale the distributions such that the first distribution has mean 0 and variance 1, and use the same transformation on the other distributions

In [4]:
transformed_mu1, transformed_sigma1 = 0, 1
transformed_mu2, transformed_sigma2 = (mu2-mu1)/sigma1, sigma2/sigma1
transformed_mu3, transformed_sigma3 = (mu3-mu1)/sigma1, sigma3/sigma1

x = np.linspace(-5, 15, 1000)
y1 = stats.norm.pdf(x, transformed_mu1, transformed_sigma1)
y2 = stats.norm.pdf(x, transformed_mu2, transformed_sigma2)
y3 = stats.norm.pdf(x, transformed_mu3, transformed_sigma3)

fig = px.line(title='Unscaled Normal Distributions')
fig.add_scatter(x=x, y=y1, name='Standard Normal Distribution ')
fig.add_scatter(x=x, y=y2, mode='lines', name='Transformed Normal Distribution 2')
fig.add_scatter(x=x, y=y3, mode='lines', name='Transformed Normal Distribution 3')
fig.show()

Time to investigate some properties.
Firstly to the overlapping areas have the same probablity?

In [5]:
def integrate_overlap(mu1, sigma1, mu2, sigma2, start, end, step):
    x = np.arange(start, end, step)
    y1 = stats.norm.pdf(x, mu1, sigma1)
    y2 = stats.norm.pdf(x, mu2, sigma2)
    return np.trapz(np.minimum(y1, y2), x)

baseline12 = integrate_overlap(mu1, sigma1, mu2, sigma2, -10, 10, 0.001)
baseline13 = integrate_overlap(mu1, sigma1, mu3, sigma3, -10, 10, 0.001)

transformed12 = integrate_overlap(transformed_mu1, transformed_sigma1, transformed_mu2, transformed_sigma2, -10, 10, 0.001)
transformed13 = integrate_overlap(transformed_mu1, transformed_sigma1, transformed_mu3, transformed_sigma3, -10, 10, 0.001)

print(f"Baseline Overlap 1-2: {baseline12:.2f}")
print(f"Baseline Overlap 1-3: {baseline13:.2f}")
print(f"Transformed Overlap 1-2: {transformed12:.2f}")
print(f"Transformed Overlap 1-3: {transformed13:.2f}")

Baseline Overlap 1-2: 0.61
Baseline Overlap 1-3: 0.31
Transformed Overlap 1-2: 0.61
Transformed Overlap 1-3: 0.31


Now we investigate whether our power is unchanged

In [30]:
# power analysis for the distributions
def solve_power(n, alpha, effect_size):
    return smp.FTestAnovaPower().solve_power(effect_size=effect_size, alpha=alpha, k_groups=3, nobs=n)




def simulate_effect_sizes(mu1, sigma1, mu2, sigma2, mu3, sigma3):

    '''generate random samples and calculate the effect size for each sample'''
    effect_sizes = []
    for _ in range(1000):
        sample1 = np.random.normal(mu1, sigma1, 300)
        sample2 = np.random.normal(mu2, sigma2, 300)
        sample3 = np.random.normal(mu3, sigma3, 300)

        ss_between = np.sum((np.mean([sample1, sample2, sample3], axis=0) - sample1.mean())**2)
        ss_total = np.sum((sample1 - sample1.mean())**2)
        effect_sizes.append(ss_between / ss_total)
    return effect_sizes


def calculate_mu_sigma(lam, mu, n):
    rho = lam*n/(mu*n)
    mu = rho/(lam-mu) *n
    sigma = 1
    return rho, mu, sigma


# effect_sizes = []
# effect_sizes_deviation = []
# rhos = [[],[],[]]
# for mu in np.arange(0.9, .99, 0.01):
#     for lam in np.arange(0.9, mu, 0.01):
#         rho1, mu1, sigma1 = calculate_mu_sigma(lam, mu, 1)
#         rho2, mu2, sigma2 = calculate_mu_sigma(lam ,mu , 2)
#         rho4, mu4, sigma4 = calculate_mu_sigma(lam, mu, 4)
#         effects =simulate_effect_sizes(mu1, sigma1, mu2, sigma2, mu3, sigma3, ) 
#         effect_sizes.append(np.mean(effects))
#         rhos[0].append(rho1)
#         rhos[1].append(rho2)
#         rhos[2].append(rho4)
def calculate_effect_sizes_and_rhos():
    effect_sizes = []
    rhos = [[], [], []]
    original_mu =  

    for mu in np.arange(0.9, 0.99, 0.01):
        for lam in np.arange(0.9, mu, 0.01):
            rho1, mu1, sigma1 = calculate_mu_sigma(lam, mu, 1)
            rho2, mu2, sigma2 = calculate_mu_sigma(lam, mu, 2)
            rho4, mu4, sigma4 = calculate_mu_sigma(lam, mu, 4)
            
            # Assuming simulate_effect_sizes takes two sets of parameters
            effects = simulate_effect_sizes(mu1, sigma1, original_mu, sigma2)
            effect_sizes.append(np.mean(effects))
            
            rhos[0].append(rho1)
            rhos[1].append(rho2)
            rhos[2].append(rho4)

    return effect_sizes, rhos

# Call the function to get effect sizes and rhos
effect_sizes, rhos = calculate_effect_sizes_and_rhos()



alpha = 0.01 
n_range = np.arange(5, 50, 5)
power = np.ndarray(shape=(len(effect_sizes), len(n_range)))
for i,effect_size in enumerate(effect_sizes):
    for j,n in enumerate(n_range):
        power[i, j] = solve_power(n, alpha, effect_size)

# plotting rho agains effect size for each number of servers
fig = px.line(title='Effect Size vs. Rho')
fig.add_scatter(x=rhos[1], y=effect_sizes, name='1 Server')
fig.add_scatter(x=rhos[2], y=effect_sizes, name='2 Servers')
fig.add_scatter(x=rhos[3], y=effect_sizes, name='4 Servers')

SyntaxError: invalid syntax (1073085664.py, line 46)