In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
from math import sqrt
import matplotlib.pyplot as plt
import scipy.stats

### Hypothesis test for the sample proportion

In this notebook we demonstrate how to carry out a hypothesis test for the sample proportion. If the conditions to ensure that the sampling distribution of the proportion is almost normal hold we can apply the same techniques we already applied in [this notebook](./02_Hypothesis_testing_based_on_the_normal_model.ipynb) for the population mean. There are, however, some changes in the way the standard error is calculated. 

The sample proportion can be represented as the average of the set of successes (1) and failures (0) in a sample of independent trials which probality is the population proportion. 

#### The sampling distribution of the sample proportion

If observations are independent and the success-failure condition is met (both n\*p>10 and n\*(1-p)>10, where n is the sample size and p is the sample proportion) the sampling distribution of the sample proportion is normally distributed around the value of the population proportion. The standard distribution of the sample proportion, or standard error, is calculated as:

SE = sqrt(p\*(1-p)/n)

In this example we are assuming that the population proportion is p=0.7. We simulate the sampling of 10000 samples of size 10000 from that population and plot the sampling distribution. 

In [None]:
P = 0.7
NUMBER_SAMPLES = 10000
SIZE_SAMPLE = 10000

p_sample = []
for i in range(NUMBER_SAMPLES):
    p_sample.append(np.mean(np.random.binomial(size=SIZE_SAMPLE, n=1, p=P)))

fig, ax = plt.subplots()
mean = np.mean(p_sample)
std = np.std(p_sample)
weights = np.ones_like(p_sample)/float(len(p_sample))
ax.hist(p_sample, bins=50, weights=weights)
ax.set_xlabel('mean = ' + str(mean) + ', \nstd = ' + str(std))

se = sqrt(0.7*(1-0.7)/SIZE_SAMPLE)
print('Standard error from population mean value = ' + str(se))

#### Hypothesis test

*A new law is to be proposed, and the government wants to determine whether the majority of the population would be for the implementation of that law. In order to do that, they took a random sample of 100 individuals among the population and asked them whether they would be for the new law. They obteined a 53% support. Does the government have enough evidence to approve the new law?*

This is a hypothesis test for the population proportion. The null and alternate hypothesis are:

```
H0: p0 = 0.5
HA: p0 > 0.5
```

Let's first determine whether the sampling distribution is nearly normal. The observations are independent and the size of the sample is smaller than 10% of the population. With respect of the success-failure condition, that has to be based on p0 for the case of a hypothesis test:

In [None]:
p0 = 0.5
n = 100

print('p0 * n = ' + str(p0*n))
print('(1-p0) * n = ' + str((1-p0)*n))

Both values are higher than 0, so we can conclude that the sampling distribution is nearly normal. The standard error can be calculated now, using p0 since this is a hypothesis test:

In [None]:
SE = sqrt(p0*(1-p0)/n)
print('SE = ' + str(SE))

The sample proportion is ps=0.53. In order to obtain a p-value we first calculate the Z-score:

In [None]:
ps = 0.53
Z = (ps-p0)/SE
print('Z = ' + str(Z))

Since this is a one-side test, we use the upper tail of the normal distribution to obtain the p-value:

In [None]:
p_value = 1 - scipy.stats.norm.cdf(Z)
print('p-value = ' + str(p_value))

The p-value is higher than the significance level of 0.05, so we cannot reject the null hypothesis. We do not have enough evidence to support the affirmation that the majority of the population supports the new law.

#### Choose the sample size

What large should be the sample size to make sure that we have enough evidence of a significant majority if that majority actually exists? 

We can use the margin of error of the confidence interval to determine the size of the sample. Given the sample in the previous sample, the confidence interval is given by:

In [None]:
lower = ps - 1.96*SE # 1.96 corresponds to the confidence level for a 95% confidence interval
higher = ps + 1.96*SE

print([lower, higher])

The null hypothesis value (0.5) is within that confidence interval. That's the reson why we couldn't reject the null hypothesis. We need a sample size that ensures that the margin of error (the 1.96\*SE part of the confience interval) leaves 0.5 out of the confidence interval:

In [None]:
m = ps - p0
print('target margin of error = ' + str(m))

Let's use a target margin of error of 0.031:

```
m = 1.96\*sqrt(p0\*(1-p0)/n)
n = p0\*(1-p0)/(m/1.96)\*\*2
```

In that expression we can use p0, or we can use p0=0.5 to take into account the worst case scenario. In our example p0 is already 0.5.

In [None]:
n = p0*(1-p0)/(m/1.96)**2
print('sample size = ' + str(n))

And let's repeat the hypothesis test, assuming that the sample proportion is still 0.53

In [None]:
SE = sqrt(p0*(1-p0)/n)
print('SE = ' + str(SE))

Z = (ps-p0)/SE
print('Z = ' + str(Z))

p_value = 1 - scipy.stats.norm.cdf(Z)
print('p-value = ' + str(p_value))

In this case the p-value is lower than 0.05, so we can reject the null hypothesis.

#### Simulation for hypothesis testing of the proportion

If the sample is small and the success-failure condition is not hold we can apply a simulation method to perform a hypothesis test of the sample proportion. 

Let's assume a setting in which a 20% of the population with a given chronic disease will not survive longer than one year. A group of researchers investigated a new treatment and they claim that they can reduce this probability. The test can be specified as:

```
H0: p0 = 0.2
HA: pA < 0.2
```

The group of researchers based their claims about the efficiency of their treatment in the fact that they tested the treatment with 40 patients and only 5 of them did not survive after one year passed. 

In this setting is easy to check that the success-failure condition does not hold:

In [None]:
N = 40
P0 = 0.2

print(N*P0)
print(N*(1-P0))

In [None]:
SUCCESSES = 5
p_sample = SUCCESSES/N
print(p_sample)

Given a sample of size n and p0, we can simulate a p_sim by producing n random trials with probability p0:

In [None]:
def calculate_p_sim(n, p0):
    return np.sum(np.random.uniform(size=n) <= p0)/float(n)

print(calculate_p_sim(N, P0))

We can now obtain a distribution for p_sim to approximate the null distribution by obtaining multiple samples using this method. The p-value is calculated as the number of p_sim values that support the alternate hypothesis (that are equal or lower than p_sample) divided by the total number of p_sim values. 

In [None]:
SIZE_SAMPLE = 10000

p_sim_sample = [calculate_p_sim(N, P0) for i in range(SIZE_SAMPLE)]

fig, ax = plt.subplots()
weights = np.ones_like(p_sim_sample)/float(len(p_sim_sample))
ax.hist(p_sim_sample, bins=50, weights=weights)
ax.set_title('p_sim distribution')

p_value = np.sum(np.array(p_sim_sample) < p_sample)/len(p_sim_sample)
print('p-value = ' + str(p_value))

plt.axvline(x=p_sample, color='red')

The p_value is not lower than a significance level of 0.05, so we fail to reject the null hypothesis. 

When using this method we have to take into account that large p-values may be overestimated. Additionally, if the sample proportion if very close to the population mean we may obtain p-values that are larger than 1. In that case we have to consider these values equal to 1.  