<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/08b-hypothesis-testing.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 08b Hypothesis testing


In [1]:
import statsmodels.api as sm
import numpy as np

  import pandas.util.testing as tm


# Standard error for linear regression

Recall that the standard error in using the sample mean $\bar{\mu}$ to estimate the mean $\mu$ is
$$
\mathrm{SE}(\bar{\mu})^2 = \mathrm{Var}(\bar{y}) = \frac{\sigma^2}{N}
$$

Similar expressions can be obtained for estimating the coefficients $\beta_0$ and $\beta_1$ of linear regression

$$
y = \beta_0 + \beta_1 X + \epsilon
$$

For linear regression, we can use $\mathrm{SE}$ to compute the 95% confidence interval for the estimate $\hat{\beta_1}$ of $\beta_1$:

$$
\beta_1 \pm 2 \mathrm{SE}(\hat{\beta_1})
$$





# Hypothesis test for linear regression

Standard errors can also be used to perform hypothesis tests. The most common hypothesis test involves testing the null hypothesis:

$$
H_0 : \text{There is no relationship between X and Y}
$$

versus the alternative hypothesis:

$$
H_a : \text{There is a relationship between X and Y}
$$

Mathematically, these are:

$$
H_0 : \beta_1 = 0
$$

versus

$$
H_a : \beta_1 \neq 0
$$

* If we accept $H_0$ then $\beta_1 = 0$ means $y = \beta_0 + \epsilon$, i.e., $y$ it is not related to $x$.
* To test $H_0$, we need to determine whether our estimate $\hat{\beta}_1$ of $\beta_1$ is sufficiently far from zero that we're confident it's not zero.
* Use a t-statistic for $\hat{\beta}_1$ and set $\beta_1 = 0$ consistent with $H_0$:
$$
t = \frac{\hat{\beta}_1 - 0}{\mathrm{SE}(\hat{\beta}_1)}
$$
* Compute the probability $p$ of observing a value as large as $|t|$ or larger, 
* This is called the "p-value"
* We reject the null hypothesis if the p-value is sufficciently small, that is, the probably of observing $t$ is too small.
* When the p-value is too small, we reject $H_0$ and accept $H_a$, that is, $x$ and $y$ are related because $\beta_1 \neq 0$. 

# Advertising dataset

In [2]:
import pandas as pd
url = "https://www.statlearning.com/s/Advertising.csv"
 
df = pd.read_csv(url, index_col=0)

## Hypothesis test

* The p-value is sufficently small that we reject $H_0$ for $\beta_0$ and $\beta_1$
  * Sales and TV advertising are related to one another
  * $\beta_1$ and $\beta_0$ are not zero.

## Confidence intervals

* The 95% confidence intervals for $\beta_1$ are .042 and .053

Compare results of the next cell to p68 of ISLR.

In [7]:
# Advertising dataset: sales vs TV
X = df[['TV']].copy()
X = sm.add_constant(X)
y = df['sales']

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     312.1
Date:                Thu, 15 Jul 2021   Prob (F-statistic):           1.47e-42
Time:                        10:49:45   Log-Likelihood:                -519.05
No. Observations:                 200   AIC:                             1042.
Df Residuals:                     198   BIC:                             1049.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0326      0.458     15.360      0.0

# Assessing model accuracy

How accurately can we predict $y$?

* We've rejected $H_0$ and computed confidence intervals for $\beta_0$ and $\beta_1$
* But we still don't know the true (population) values for $\beta_0$ and $\beta_1$

**Residual sum of squares (RSS)**

$$
\text{RSS} = \sum_{i=1}^N (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2
$$

**Residual standard error (RSE)**

$$
\text{RSE} = \sqrt{\frac{\text{RSS}}{N-2}}
$$

* $ \mathrm{RSE}$ is an estimate of variance of $\epsilon$ for the regression model $ y = \beta_0 + \beta_1 X + \epsilon $
* In other words, it's an estimate of the amount the response will deviate from the true regression line.

The next cell computes residual standard error. Compare to p69 of ISLR.

In [4]:
print("Parameters:", results.params.to_dict())
print("Standard errors:", results.bse.to_dict())
rss = np.square(y - results.predict()).sum()
n = df.shape[0]
print("Residual Sum of Squares (RSS): {:.2f}".format(rss))
print("Residual Standard Error (RSE): {:.2f}".format(np.sqrt(rss / (n-2))))

Parameters: {'const': 7.032593549127698, 'TV': 0.047536640433019764}
Standard errors: {'const': 0.4578429402734786, 'TV': 0.0026906071877968716}
Residual Sum of Squares (RSS):2102.53
Residual Standard Error (RSE): 3.26


**$\mathrm{R}^2$ statistic**

$$
\mathrm{R}^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}
$$

where

$$
\mathrm{TSS} = \sum_{i=1}^N (y_i - \bar{y})^2
$$

* $R^2$ measures the proportion of variability in $y$ that can be explained with $x$
* Whereas RSE depends on the units of $y$, $R^2$ is unitless and varies from 0 to 1
* For linear regression, $R^2$ is the sample correlation between $y$ and $x$

Q: What consitutes a "good" value for $R^2$?



# Multiple linear regression
 
Q: How do we choose between TV, radio and newspaper with the advertising dataset?

* Recall that, dollar for dollar, univariate models indicated that radio wins.
* Univariate models suggest that \$1000 spent on radio would result in sales of around 200 units.
* In contrast, these models suggest that \$1000 spent on TV or newspaper would result in far fewer sales.


In [6]:
# Python-style implementation
import statsmodels.api as sm

vars = ['TV', 'radio', 'newspaper']

# Target variable
y = df["sales"]

X = pd.DataFrame(1, index=df.index, columns=['Intercept']) # Intercept
X = X.join(df[vars])
X

mod = sm.OLS(y, X)    # Describe model
res = mod.fit()       # Fit model
print(res.summary())  # Summarize model

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Thu, 15 Jul 2021   Prob (F-statistic):           1.58e-96
Time:                        09:10:57   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.0

# Some important questions

1. Is at least one of the predictors $x_i$ useful in predicting the response?
2. Do all the predictors help to explain $y$, or is only a subset of the predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?

# Qualitative predictors