<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/08b-hypothesis-testing.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 08b Hypothesis testing


In [None]:
import statsmodels.api as sm
import numpy as np

# Standard error for linear regression

Recall that the standard error in using the sample mean $\bar{\mu}$ to estimate the mean $\mu$ is
$$
\mathrm{SE}(\bar{\mu})^2 = \mathrm{Var}(\bar{y}) = \frac{\sigma^2}{N}
$$

Similar expressions can be obtained for estimating the coefficients $\beta_0$ and $\beta_1$ of linear regression

$$
y = \beta_0 + \beta_1 X + \epsilon
$$

For linear regression, we can use $\mathrm{SE}$ to compute the 95% confidence interval for the estimate $\hat{\beta_1}$ of $\beta_1$:

$$
\beta_1 \pm 2 \mathrm{SE}(\hat{\beta_1})
$$





# Hypothesis test for linear regression

Standard errors can also be used to perform hypothesis tests. The most common hypothesis test involves testing the null hypothesis:

$$
H_0 : \text{There is no relationship between X and Y}
$$

versus the alternative hypothesis:

$$
H_a : \text{There is a relationship between X and Y}
$$

Mathematically, these are:

$$
H_0 : \beta_1 = 0
$$

versus

$$
H_a : \beta_1 \neq 0
$$

* If we accept $H_0$ then $\beta_1 = 0$ means $y = \beta_0 + \epsilon$, i.e., $y$ it is not related to $x$.
* To test $H_0$, we need to determine whether our estimate $\hat{\beta}_1$ of $\beta_1$ is sufficiently far from zero that we're confident it's not zero.
* Use a t-statistic for $\hat{\beta}_1$ and set $\beta_1 = 0$ consistent with $H_0$:
$$
t = \frac{\hat{\beta}_1 - 0}{\mathrm{SE}(\hat{\beta}_1)}
$$
* Compute the probability $p$ of observing a value as large as $|t|$ or larger, 
* This is called the "p-value"
* We reject the null hypothesis if the p-value is sufficciently small, that is, the probably of observing $t$ is too small.
* When the p-value is too small, we reject $H_0$ and accept $H_a$, that is, $x$ and $y$ are related because $\beta_1 \neq 0$. 

# Advertising dataset

In [None]:
import pandas as pd
url = "https://www.statlearning.com/s/Advertising.csv"
 
df = pd.read_csv(url, index_col=0)

In [None]:
# Advertising dataset: sales vs TV
X = df[['TV']].copy()
X = sm.add_constant(X)
y = df['sales']

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

## Hypothesis test

* The p-value is sufficently small that we reject $H_0$ for $\beta_0$ and $\beta_1$
  * Sales and TV advertising are related to one another
  * $\beta_1$ and $\beta_0$ are not zero.

## Confidence intervals

* The 95% confidence intervals for $\beta_1$ are .042 and .053

Compare results of the next cell to p68 of ISLR.

# Advertising dataset -- the other predictors

In [None]:
# Advertising dataset: sales vs radio
X = df[['radio']].copy()
X = sm.add_constant(X)
y = df['sales']

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

In [None]:
# Advertising dataset: sales vs newspaper
X = df[['newspaper']].copy()
X = sm.add_constant(X)
y = df['sales']

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print(results.summary())

# Assessing model accuracy

How accurately can we predict $y$?

* We've rejected $H_0$ and computed confidence intervals for $\beta_0$ and $\beta_1$
* But we still don't know the true (population) values for $\beta_0$ and $\beta_1$

**Residual sum of squares (RSS)**

$$
\text{RSS} = \sum_{i=1}^N (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2
$$

**Residual standard error (RSE)**

$$
\text{RSE} = \sqrt{\frac{\text{RSS}}{N-2}}
$$

* $ \mathrm{RSE}$ is an estimate of variance of $\epsilon$ for the regression model $ y = \beta_0 + \beta_1 X + \epsilon $
* In other words, it's an estimate of the amount the response will deviate from the true regression line.
* These are the results for sales vs TV...

```
            Intercept   TV
Coefficient 7.0325    0.0475
Std. error  0.4578    0.0027
t-statistic  15.36     17.67
p-value   < 0.0001  < 0.0001
```

The next cell computes RSE. Compare to p69 of ISLR.

In [None]:
# The next 4 lines are here just to be sure we don't redefine X & y by accident
X = df[['TV']].copy()
X = sm.add_constant(X)
y = df['sales']
results = sm.OLS(y, X).fit()

print("Parameters:", results.params.to_dict())
print("Standard errors:", results.bse.to_dict())

# RSS
rss = np.square(y - results.predict()).sum()

# RSE
n = df.shape[0]
rse = np.sqrt(rss / (n-2))
print("Residual Sum of Squares (RSS): {:.2f}".format(rss))
print("Residual Standard Error (RSE): {:.2f}".format(rse))

**$\mathrm{R}^2$ statistic**

RSE depends on the units of measurement, which makes it difficult to define what we mean by a big or small RSE. 

Introducing $\mathrm{R}^2$:

$$
\mathrm{R}^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}
$$

where

$$
\mathrm{TSS} = \sum_{i=1}^N (y_i - \bar{y})^2
$$

* $R^2$ measures the proportion of variability in $y$ that can be explained with $x$
* Whereas RSE depends on the units of $y$, $R^2$ is unitless and varies from 0 to 1
* For linear regression, $R^2$ is also the sample squared correlation between $y$ and $x$

Q: What consitutes a "good" value for $R^2$?



# Multiple linear regression
 
Q: How do we choose between TV, radio and newspaper with the advertising dataset?

* Recall that, dollar for dollar, univariate models indicated that radio wins.
* Univariate models suggest that \$1000 spent on radio would result in sales of around 200 units.
* In contrast, the univariate models suggest that \$1000 spent on TV or newspaper would result in far fewer sales.


In [None]:
# Python-style implementation
import statsmodels.api as sm

vars = ['TV', 'radio', 'newspaper']

# Target variable
y = df["sales"]

X = pd.DataFrame(1, index=df.index, columns=['Intercept']) # Intercept
X = X.join(df[vars])
X

mod = sm.OLS(y, X)    # Describe model
res = mod.fit()       # Fit model
print(res.summary())  # Summarize model

# Some important questions

1. Is at least one of the predictors $x_i$ useful in predicting the response?
2. Do all the predictors help to explain $y$, or is only a subset of the predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?

## 1. Is there at least one important predictor?

Mathematically:

$$
H_0 : \beta_1 = \beta_2 = \beta_3... = \beta_p = 0
$$

versus the alternative

$$
H_a : \beta_i \neq 0 \text{ for a least one } i
$$

This hypothethis test is performed with the F statistic

$$
F = \frac{(\mathrm{TSS - RSS}) / p}{\mathrm{RSS} / (p-1)}
$$

where $\mathrm{TSS} = \sum (y_i - \bar{y})^2$ is the "Total Sum or Squares".

If the modeling assumptions are correct, along with $H_0$, then both the numerator and the denominator equal $\sigma^2$.

When there's no true skill, you expect $F \approx 1$.

On the other hand, if $H_a$ is correct, then you expect $F > 1$.

* For the result above, $F \gg 1$, so it's clear that there's a relationship.
* When N is large, $F > 1$ is sufficient
* When N is small, you need a larger value of $F$.
* Use the p-value for the F statistic, in general, to test $H_0$



## 2. What about a subset of all the predictors?

* Each individual predictor in the result above has a p-value and t-statistic.
* These are equivalent to the F-test for omitting only one variable from the model
* In the multivariate model, there's no evidence that newspapers are important.
* Compare this to the conclusion you might draw from univariate regression.
* **BUT** be careful about avoiding an F test when $p \gg 1$, since you expect 5% of the t-statistics to be "significant"
* The F-statistics compensates for large $p$
* Various statistics can be used to select the "best" model from a large number of candidate models, including:
  * Mallow's $C_p$
  * Akaike Information Criterion (AIC)
  * Bayesian Information Criterion (BIC)
  * Adjusted $R^2$
* We can't consider all possible models when $p$ is large
  * If $p=2$, there are $2^2 = 4$ possible models
  * If $p=30$, there are $2^{30}$, i.e., more than a billion
* The standard approaches to model selection
  * Forward selection
    * start with $H_0$
    * add the variable that produces the lowest RSS
    * continue until you reach some stopping criterion
  * Backward selection
    * start with all $p$ variables
    * remove the variable with the largest $p$ value
    * continue until you reach some stopping criterion
  * Mixed selection
    * a combination of the two other approaches
    * start with $H_0$ and add variables one by one
    * remove any variables if their p-value rises above a threshold

## 3. How well does the model fit? 

* RSE and $R^2$ are the most common metrics for model fit
* $R^2$ for a multivariate model is the squared correlation between $y$ and $\hat{y}$
* In fact, the least-square fit maximimizes $R^2$ among all possible linear models.

$$
\mathrm{RSE} = \sqrt{ \frac{1}{n-p-1} \mathrm{RSS} }
$$

Thus, whereas $R^2$ always increases as you increase $p$, RSE will decrease if the increase in $p$ is greater than the increase in RSS.

## 4. Which response value should we predict and how accurate is the prediction?

* You can compute confidence intervals for the prediction
* Confidence intervals for the prediction don't consider the noise $\epsilon$
* Prediction intervals include the impact of $\epsilon$
* [OLSResults.get_precition() API reference docs](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.get_prediction.html) can be used to calculate both -- statmodels.org


# Qualitative predictors