<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/08b-hypothesis-testing.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 08b Hypothesis testing


In [None]:
import statsmodels.api as sm
import numpy as np

# Standard error for linear regression

Recall that the standard error in using the sample mean $\bar{\mu}$ to estimate the mean $\mu$ is
$$
\mathrm{SE}(\bar{\mu})^2 = \mathrm{Var}(\bar{y}) = \frac{\sigma^2}{N}
$$

Similar expressions can be obtained for estimating the coefficients $\beta_0$ and $\beta_1$ of linear regression

$$
y = \beta_0 + \beta_1 X + \epsilon
$$

For linear regression, the 95% confidence interval for the estimate $\hat{\beta_1}$ of $\beta_1$ is approximately

$$
\beta_1 \pm 2 \mathrm{SE}(\hat{\beta_1})
$$

# Hypothesis tests

Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of

H0 : There is no relationship between X and Y : $\beta_1 = 0$

versus the alternative hypothesis

Ha : $ \beta_1 \neq 0 $ and compute the probability $p$ of observing a value $t$ or larger.

* To test the null hypothesis, we need to determine whether our estimate sufficiently far from zero that we're confident it's not zero.
* Use a t-statistic for $\beta_1$, and compute the probability $p$ of observing a value of $t$ or larger, assuming $\beta_1 = 0$.
* This is called the "p-value"
* We reject the null hypothesis, that is, we declare a relationship to exist between X and Y, if the p-value is small enough. 



# Advertising dataset

In [None]:
import pandas as pd
url = "https://www.statlearning.com/s/Advertising.csv"
 
df = pd.read_csv(url, index_col=0)

Compare the next cell to p68 of ISLR, 1st Edition.

In [None]:
# Advertising dataset: sales vs TV
X = df[['TV']].copy()
X = sm.add_constant(X)
y = df['sales'].copy()

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

The next cell computes residual standard error. Compare to p69 of ISLR.

In [None]:
print("Parameters:", results.params.to_dict())
print("Standard errors:", results.bse.to_dict())
rss = np.square(y - results.predict()).sum()
n = df.shape[0]
print("Residual Standard Error (RSE): {:.2f}".format(np.sqrt(rss / (n-2))))