# (Statistical) Hypothesis Testing / Simulating p-values, etc.

- Review samples (empirical distributions) and populations (distributions) 
- Review ***bootstrap sampling distributions***
- Review/Introduce ***sampling distributions***
    - under ***null*** (and ***alternative***) ***hypotheses***
- ***p-values*** and ***simulating p-values***
- ***Evidence*** against the ***null hypothesis***
    - Contrasted and compared against ***formal hypothesis testing*** and ***Type I*** and ***Type II errors*** 
- ***Nonparametric*** `stats.wilcoxon` and ***parametric*** `stats.ttest_1samp` ***theoretical p-values***
    - both distinct from the ***simulated parametric p-values*** above...
- What about when $\bar x$ is a ***proportion***?


In [1]:
import pandas as pd
ab = pd.read_csv("../../amazonbooks.csv", encoding="ISO-8859-1")#.dropna()
#print(ab.shape)
#ab.isnull().sum()
ab_noNaN = ab.drop(['Weight_oz','Width','Height'], axis=1).dropna()
ab_noNaN['Pub year'] = ab_noNaN['Pub year'].astype(int)
ab_noNaN['NumPages'] = ab_noNaN['NumPages'].astype(int)
ab_noNaN['Hard_or_Paper'] = ab_noNaN['Hard_or_Paper'].astype("category")
#print(ab_noNaN.shape)
#ab_noNaN.dtypes

In [2]:
# SAMPLE
import plotly.express as px
px.histogram(ab_noNaN, x='Thick', nbins=20)

## Statistics and Parameters

- sample mean $\bar x = \frac{1}{n}\sum_{i=1}^n x_i \longrightarrow \text{ estimates the population mean parameter } \mu$
- sample variance $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2 \longrightarrow \text{ estimates the population variance parameter } \sigma^2$
    - sample standard deviation $s = \sqrt{s^2}$


In [3]:
# POPULATION

from scipy import stats

mu,sd_sigma = 1,0.33
my_theoretical_population = stats.norm(loc=mu, scale=sd_sigma).rvs(size=n)

import numpy as np
support = np.linspace(-1,3,100)

df = pd.DataFrame({'x': support,
                   'density': my_theoretical_population.pdf(support)})
px.line(df, x='x', y='density')

# Null Hypothesis

#### $H_0: \mu=1; \sigma = 0.33; \text{ normal distribution; independent random sampling; a model for book thickness}$


### $H_0: \mu=1; \sigma = 0.33; \text{ normal distribution; independent random sampling}$

### $H_0: \mu=1; \sigma = 0.33; \text{ normal distribution}$

# Alternative Hypothesis

## $H_A: \text{ not } H_0$

- we do not prove $H_0$ false
   - give evidence AGAINST the $H_0$
- we do not prove $H_0$ is true

In [7]:
ab_noNaN.Thick#.mean()

0      0.8
1      0.7
2      0.3
3      1.6
4      1.4
      ... 
320    1.1
321    0.7
322    0.7
323    0.9
324    1.0
Name: Thick, Length: 319, dtype: float64

# sampling distribution of $\bar x$

In [14]:
n = 319
reps = 100000
simulated_test_statistics_under_H0 = np.zeros(reps)
for i in range(reps):
    simulated_test_statistics_under_H0[i] = \
    my_theoretical_population.rvs(size=n).mean()
    
df = pd.DataFrame({'xbar_under_H0': simulated_test_statistics_under_H0})
fig = px.histogram(df, x='xbar_under_H0') 

observed_test_statistic = .96#ab_noNaN.Thick.mean()
import plotly.graph_objects as go
fig.add_vline(x=observed_test_statistic)

In [26]:
n = 319
reps = 100000
simulated_test_statistics_under_H0 = np.zeros(reps)
for i in range(reps):
    simulated_test_statistics_under_H0[i] = \
    my_theoretical_population.rvs(size=n).mean()
    
df = pd.DataFrame({'xbar_under_H0': simulated_test_statistics_under_H0})
fig = px.histogram(df, x='xbar_under_H0') 

observed_test_statistic = .96#ab_noNaN.Thick.mean()
import plotly.graph_objects as go
fig.add_vline(x=observed_test_statistic)

In [20]:
df = pd.DataFrame({'xbar_under_H0': simulated_test_statistics_under_H0})
fig = px.histogram(df, x='xbar_under_H0') 

observed_test_statistic = .96#ab_noNaN.Thick.mean()
import plotly.graph_objects as go
fig.add_vline(x=observed_test_statistic)
fig.add_vline(x=mu+abs(observed_test_statistic-mu))#1.04

In [21]:
mu+abs(observed_test_statistic-mu)

1.04

In [24]:
(abs(simulated_test_statistics_under_H0-mu) >= abs(observed_test_statistic-mu)).sum()/reps

0.0321

# p-value

## The probability of getting a test statistic "as or more extreme" than the observed test statistic if $H_0$ is true

![](https://www.jcpcarchives.org/userfiles/values-of-p-Inference.jpg)

- Simulated versus Theoretical p-values

- parametric versus nonparametric

$H_0: \text{the median (of the population the sample was drawn from) is } 1$ 

In [27]:
stats.wilcoxon(ab_noNaN.Thick-1)

WilcoxonResult(statistic=13323.5, pvalue=1.8400347866201877e-06)

$H_0: \mu=1 \text{ and the population is normally distributed}$

In [28]:
stats.ttest_1samp(ab_noNaN.Thick, 1)

TtestResult(statistic=-4.721193844327131, pvalue=3.5207570807323726e-06, df=318)