## Power Analysis

Depends on 4 variables:

1. Effect size
2. Significance level
3. Power
4. Sample size

It consists of calculating one of them when the other three are known.
This allows us to make statements about a sample size needed to detect an effect of a particular size and power with a given significance level.
Conversely, we can use effect size, significance level, and sample size to calculate power, which tells us probability of a true positive, or of detecting an effect when it is in fact present.

In [1]:
from sample_size import TTestInd_sampleSize 
import numpy as np

In this example, I will generate some fake data to be used for an independent-samples t-test. I will then run my sampleSize function (based on statsmodels) on it to determine optimal size for detecting the effect when it is present with given probability (power). Then, I will simulate a bunch of t-tests on data sampled from the same distribution with given sample size, and see how frequently we reject the null.

In [2]:
# Fake data being drawn for two groups from two different gaussian distributions
group1 = np.random.normal(10, 3, 50)
group2 = np.random.normal(15, 3, 50)
print(f'Group 1:\n {group1}')
print(f'Group 2:\n {group2}')

Group 1:
 [12.11691372 10.36894347 13.43720668 11.8866219  10.82192313  7.12392969
  2.50964502 15.63633562  7.54383709 10.49838412  8.83401507  6.29253542
  9.86078013 11.84799924 10.05574408  5.64204932 17.64512766  6.59261195
 10.38271681  8.50807759 12.632553    6.9306526   6.65450317 11.07522337
 14.86591563  8.59463654  5.18102537  8.57500053 17.3062279  10.48180088
  9.59907445 10.25349628  9.73901364  6.90517474 10.22368715  9.69305301
 12.09257413  8.60951899 10.32700848  8.23204587  5.33635271 11.16373777
  9.14016431 11.28165625 12.2076035   7.7654185  11.00169739  8.15101644
  8.90852985  8.55263671]
Group 2:
 [14.69705427 22.09243121 22.24248907 11.39464979 15.83295736  9.49490043
 17.22659644 15.84380874 11.79632582 10.53947189 16.01018034 11.13548514
  9.50383527 22.21537013 11.48370646 14.24049975 15.67871062 15.00783001
  9.70397104 12.15835552 14.25643688 14.7708116  14.25792219 13.72236548
 21.58372642 18.19333862 12.614468   16.78158282 13.11331245 13.45719532
 18.5

In [3]:
# Calculating sample size for t-test
sample_size = TTestInd_sampleSize(
    group1, 
    group2,
    power=0.95,
    alpha=0.05
)
print(f'Desired sample size for each group is: {sample_size}')

Desired sample size for each group is: 10


In [4]:
# Little utility function that I wrote to generate fake data on demand.
def gen_samples():
    global sample_size
    sample1 = np.random.normal(10, 3, sample_size)
    sample2 = np.random.normal(15, 3, sample_size)
    return sample1, sample2

In [5]:
from statsmodels.stats.weightstats import ttest_ind

# simulating a 1000 t-tests
simulated_tests = [ttest_ind(*gen_samples()) for _ in range(1000)]
# pulling p-values out of these t-tests
pvalues = [res[1] for res in simulated_tests]
# counting number of times we were able to detect an effect
numreject = len(list(filter(lambda p: p < 0.05, pvalues)))
# inferring number of times that we failed to detect an effect
numfail = len(pvalues) - numreject
print(f'Rejected: {numreject}\nFailed: {numfail}')

Rejected: 951
Failed: 49


__NOTE: keep in mind, the analysis you reproduce will not have exact same number, but it should be close.__

So, with our suggested sample size for each group, we:

    Successfully rejected: 96.2% of time
    Failed to reject: 3.8% of time

This seems in line with the power that we set on the sample size calculation function given the following:

1. Power represents the probability of a true positive
2. We know that our sample has 100% true positives because we explicitly samples from two different distributions.
3. We set our power to 0.95
4. 96.2% of true positives were successfully detected

So with 1000 t-tests ran on data samples from different distributions, we were able to successfully detect the effect approximately 95% of the time, which is what we set ourselves up for when setting the power to 0.95. Therefore, it seems that this method of determining preferred sample sizes for statistical tests is legit.