In [56]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, chi2
import pandas as pd

# Question 1

From https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html, we get

* **10% vs 15%**: $n=686$
* **25% vs 30%**: $n=1251$
* **40% vs 45%**: $n=1534$
* **45% vs 50%**: $n=1565$
* **55% vs 60%**: $n=1534$

where $n$ is the sample size per group. Let $p=(p_1+p_2)/2$ in each case. The observed pattern here is that as $p(1-p)$ increases/decreases, so too does $n$.

In the limit of large $N$, a binomial distribution converges to a Gaussian distribution with mean $Np$ and standard deviation $\sqrt{Np(1-p)}$. The distribution of the estimator $\hat{p}$ (which we get from sampling from the binomial distribution) thus has mean $p$ and standard error $\sqrt{p(1-p)/N}$. Now when comparing two distinct populations, the power is propotional to the standard error of the two population parameters $p_1$ and $p_2$. To maintain a constant power, a constant standard error must be achieved: thus, as $p(1-p)$ increases, so too must $N$. This explains why the sample size must vary when the difference we are trying to detect remains constant: the standard errors themselves depend on the magnitude of $p$.

# Question 2

From https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html

* **Group 1**: $n=142$
* **Group 2**: $n=142$

## Part A

The two groups have the same required value of $n$: this follows from the fact that at $\alpha=0.05$ (two sided) we have

$$\text{Power} = \text{erf}\left(\frac{\Delta}{\sigma} \sqrt{\frac{N}{2}} - 1.96\right)$$

where $\Delta$ is the difference between means of the two populations and $\sigma$ is the standard deviation of each population. In each case:

* **Group 1:** $\Delta/\sigma = (54-44)/30 = 1/3$
* **Group 2:** $\Delta/\sigma = (21-18)/9 = 1/3$

Since $\Delta/\sigma$ is constant for each group, the same value of $N$ will give the same power.

## Part B

Assuming populations are normally distributed, the power at $\alpha=0.05$ (two-sided) is given by 

$$\text{Power} = \text{erf}\left(\frac{\Delta}{\sigma} \sqrt{\frac{N}{2}} - 1.96\right)$$

If one wishes to decrease $N$ but maintain constant Power, there is only one option: increase $\Delta/\sigma$. This can be done by

1. Increase $\Delta$. In other words, assume (apriori) that the effect you expect to see is more significant than you otherwise would have believed. This will make the chances of rejecting the null hypothesis slimmer (less likely to see statistically significant results).

2. Decrease $\sigma$. In other words, assume (apriori) that the populations you are dealing with are fairly uniform (i.e. exhibit little variance). Once again, this will make the chances of rejecting the null hypothesis slimmer (less likely to see statistically significant results).

# Question 3

The p-value corresponds to the probability of getting a **difference of means** in an pre-specified **extreme** region given that there is *no difference between the group before and after intervention* (the null). In this particular example, if it supposed that the intervention will *increase the magnitude of the variable measured*, the **extreme** region will be specified by a **one-tailed test**.

* Note that while group 1 had a smaller mean change than group two, the standard error (of the mean change) was smaller for group 1 than it was for group 2. Mathematically, the probability of getting a result $\Delta \mu$ or greater (in a two-tailed test) when the standard error is $\sigma$ is

$$\frac{p}{2} = \int_{\Delta \mu}^{\infty} N(0,\sigma)dx = \int_{\Delta \mu/\sigma}^{\infty} N(0,1)dx = 1-\text{erf}(\Delta \mu/\sigma)$$

In [40]:
#Proof: Group 2
2*(1-norm.cdf(12/8))

0.13361440253771617

Thus the quantity of relevance is $\Delta \mu/\sigma$.

* **Group 1**: $\Delta \mu/\sigma=2$
* **Group 2**: $\Delta \mu/\sigma=1.5$

Thus the p-value for group 1 is smaller.

The paired t-test yields statistics that enable testing the null that the changes are the same in each group. The results obtained from the tests $(\Delta \mu_1, \sigma_{\Delta \mu_1})$ and $(\Delta \mu_2, \sigma_{\Delta \mu_2})$, can be compared with eachother (assuming normalilty) to determine if there is sufficient evidence to suggest that *they are not the same* (in otherwords, testing the null that they are the same).

# Question 4

I could use the website, or I could just code it myself... First the table that I used:

In [52]:
arr_obs = np.array([[57,23],[24,83]])
arr_obs

array([[57, 23],
       [24, 83]])

Columns represent levels of loneliness and rows represent high/low time spent on social media. Now we get our expected array (assuming no relationship) and use this to compute a $\chi^2$ value:

In [53]:
n_row = np.expand_dims(np.sum(arr,axis=1),axis=0)
n_col = np.expand_dims(np.sum(arr,axis=0),axis=0)
n = np.sum(arr)
arr_exp = n_row.T@n_col / n
c2 = np.sum((arr_obs-arr_exp)**2 / arr_exp)
print(f'chi2 = {c2:.4f}')

chi2 = 44.4346


This corresponds to a p-value of

In [54]:
p = 1 - chi2(1).cdf(c2)
print(f'p={p:.2e}')

p=2.63e-11


which is significant at $p<0.05$. The conclusion of the study is that there is evidence to suggest that the amount of time one spends on social media is related to reported levels of loneliness.

# Question 5

Since the prevalence is 40% and $n=1000$ we have

* $TP + FN = 400$
* $TN + FP = 600$

Now since $\text{Sensitivity} \equiv TP/(TP+FP)$ and $\text{Specificity} \equiv TN/(TN+FN)$ we get

* $TP = \text{Sens} \cdot (TP+FP) = 0.95 \cdot 400 = 380$
* $FP = 20$
* $TN = \text{Spec} \cdot (TN+FN) = 0.90 \cdot 600 = 540$
* $FN = 60$

So our 2x2 table is

In [61]:
pd.DataFrame(np.array([[380,20],[60,540]]),
            columns=['Has Disease', 'No Disease'],
            index = ['Test Positive', 'Test Negative'])

Unnamed: 0,Has Disease,No Disease
Test Positive,380,20
Test Negative,60,540


The ppv and npv are given by

In [67]:
sens = 0.95
spec = 0.9
prev = 0.4
ppv = sens*prev/(sens*prev + (1-spec)*(1-prev))
npv = spec*(1-prev)/((1-sens)*prev + spec*(1-prev))
print(f'Positive Preditive Value: {100*ppv:.2f}%')
print(f'Negative Preditive Value: {100*npv:.2f}%')

Positive Preditive Value: 86.36%
Negative Preditive Value: 96.43%


No consider the **new population**: 

Since the prevalence is 5% and $n=2000$ we have

* $TP + FP = 100$
* $TN + FN = 1900$

Now since $\text{Sensitivity} \equiv TP/(TP+FP)$ and $\text{Specificity} \equiv TN/(TN+FN)$ we get

* $TP = \text{Sens} \cdot (TP+FP) = 0.95 \cdot 100 = 95$
* $FP = 5$
* $TN = \text{Spec} \cdot (TN+FN) = 0.90 \cdot 1900 = 1710$
* $FN = 190$

In [69]:
pd.DataFrame(np.array([[95,5],[190,1710]]),
            columns=['Has Disease', 'No Disease'],
            index = ['Test Positive', 'Test Negative'])

Unnamed: 0,Has Disease,No Disease
Test Positive,95,5
Test Negative,190,1710


In [70]:
sens = 0.95
spec = 0.9
prev = 0.05
ppv = sens*prev/(sens*prev + (1-spec)*(1-prev))
npv = spec*(1-prev)/((1-sens)*prev + spec*(1-prev))
print(f'Positive Preditive Value: {100*ppv:.2f}%')
print(f'Negative Preditive Value: {100*npv:.2f}%')

Positive Preditive Value: 33.33%
Negative Preditive Value: 99.71%


**In the first population, the probability of ruling in/out disease exceeds 85% in both cases (provided person being tested is randomly sampled). The test is slightly better at ruling out disease. In the second population, however, due to the low prevalence of the disease, the positive predictive value is very low; this is because the likelihood of having the disease is low to begin with. As such, *when randomly sampling* in the second population, the test is not a good metric as to whether or not a patient has the disease. However, the test is a very good indicator of when somebody doesn't have the disease**.

* **Note**: in the second population, one could simply assume that no one has the disease, this would give an npv of 95%. While this seems good, it only works because of the low prevalence of disease in the population.

* **Note**: in reality, the situation is more complicated. Patients who get tested for disease are not *randomly sampled* from the population; they are likely receiving the test because they have prior reason to believe they have the illness. In such a scenario, Bayesian statistics would have to be used.