In [15]:
import numpy as np
import random
from scipy.stats import norm

# Problem 3b {-}

In [12]:
lam = 1
n = 20
alpha = 0.05
m = 10 ** 4

num_rejects = 0

for _ in range(m):
    sample = np.random.poisson(lam=lam, size=n)
    error = norm.ppf(alpha / 2) * (lam / n) ** (1 / 2)
    sample_mean = np.mean(sample)
    
    if lam < sample_mean + error or lam > sample_mean - error:
        # Confidence interval doesn't capture lam -> reject null hypothesis
        num_rejects += 1
        
print('Estimated type I error rate: ', num_rejects / m)

Estimated type I error rate:  0.0531


# Problem 4 {-}

We can perform a permutation test to test our null hypothesis that | mu_A - mu_B | = 0 against the alternative hypothesis that | mu_A - mu_B | > 0, where mu_A and mu_B are the true mean soil pH values for locations A and B respectively. The absolute difference of the sample means | mu_A - mu_B | is our test statistic, and we reject the null hypothesis if our estimated p-value is small enough (< 0.05).

In [22]:
K = 10 ** 5

A = [7.58, 8.52, 8.01, 7.99, 7.93, 7.89, 7.85, 7.82, 7.80]
B = [7.85, 7.73, 8.53, 7.40, 7.35, 7.30, 7.27, 7.27, 7.23]

s_obs = abs(np.mean(A) - np.mean(B))

combined = A + B
p_value = 0

for _ in range(K):
    # Get a permutation sample 
    random.shuffle(combined)
    a = combined[:len(A)]
    b = combined[len(A):]
    s = abs(np.mean(a) - np.mean(b))
    
    if s > s_obs:
        p_value += 1
        
p_value /= K

print('Estimated p-value: ', p_value)

Estimated p-value:  0.03499


We get an estimated p-value of 0.03499, which is small enough for us to reject our null hypothesis. Thus we have strong evidence against our null hypothesis; in other words, we have strong evidence that the true mean soil pH values differ for the two locations.

# Prolem 5a {-}

We will use the Wald test for comparing means of two populations as described in lecture 12 page 3. We can find the p-value by solving W(X) = z_(1-alpha/2) as given in page 7 of lecture 12.

In [14]:
twain = [0.225, 0.262, 0.217, 0.24, 0.23, 0.229, 0.235, 0.217]
snodgrass = [0.209, 0.205, 0.196, 0.21, 0.202, 0.207, 0.224, 0.223, 0.22, 0.201]

se_hat = (np.var(twain, ddof=1) / len(twain) + np.var(snodgrass, ddof=1) / len(snodgrass)) ** (1 / 2)
W = abs(np.mean(twain) - np.mean(snodgrass) / se_hat)
p_value = 2 * norm.cdf(-W)
print('p-value: ', p_value)

p-value:  3.3122956254677125e-265


Since the p-value is very small (close to zero), we have a very strong evidence against our null hypothesis that the two populations hae the same mean of the proportion of three-letter words. Thus we conclude that it is very unlikely that the essays written by Snodgrass were actually written by Mark Twain.

# Problem 5b {-}

We will use the absolute difference of the sample means as a test statistic.

In [20]:
K = 10 ** 5

s_obs = abs(np.mean(twain) - np.mean(snodgrass))

combined = twain + snodgrass
p_value = 0

for _ in range(K):
    # Get a permutation sample 
    random.shuffle(combined)
    t = combined[:len(twain)]
    s = combined[len(twain):]
    s = abs(np.mean(t) - np.mean(snodgrass))
    
    if s > s_obs:
        p_value += 1
        
p_value /= K

print('Estimated p-value: ', p_value)

Estimated p-value:  0.00062


We get a very small estimated p-value of 0.00062, meaning we have a very strong evidence against our null hypothesis that the two populations hae the same mean of the proportion of three-letter words. Thus we conclude the same as in part a; it is very unlikely that the essays written by Snodgrass were actually written by Mark Twain.