# Multiple Regression

This section is based partly on Freedman, D.A., 2009. [_Statistical Models: Theory and Practice_, Revised Edition](http://www.amazon.com/Statistical-Models-Practice-David-Freedman/dp/0521743850/), Cambridge University Press.

Statistical models, and regression in particular, are used primarily for three purposes:

1. _Description_: to summarize data
2. _Prediction_: to predict future data
3. _Causal Inference_: to predict what would happen in response to an intervention

It is straightforward to check whether a regression model is a good summary of _existing_ data, although there is some subtlety in determining whether the summary is _good enough_.  How to measure goodness of fit appropriately is not always obvious, and adequacy of fit depends on the use of the summary.

Prediction is harder than description because it involves _extrapolation_: how can one tell what the future will bring? Why should the future be like the past? Is the system under study stable (i.e., _stationary_) or are its properties changing with time?

However, the hardest of these tasks is causal inference. The biggest difficulty in drawing causal inferences is _confounding_, especially when the data arise from _observational studies_ rather than _randomized, controlled experiments_. (_Natural experiments_ lie somewhere in between; there are few that approach the utility of
a randomized controlled experiment, but John Snow's study of the communication of cholera is a notable exception.)

_Confounding_ happens when one factor or variable manifests in the outcome in a way that cannot be distinguished from the _treatment_.

_Stratification_ (e.g., _cross tabulation_) can help reduce confounding. So can modeling&mdash;in some cases, but not in others. 
For modeling to help, it is generally necessary for the structure of the model to 
correspond to how the data were actually generated.
Unfortunately, most models in science, especially social science, are chosen out of habit or computational
convenience, not because they have a basis in the science itself.
This often produces misleading results, and the misleading impression that those results have small
uncertainties.

## Some notation

If $\{x_i\}_{i=1}^n$ is a list of numbers, 
$$
\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n
$$
is the _mean_ of the list;
$$
\mbox{var} x \equiv \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
$$
is the _variance_ of the list;
$$
s_x \equiv \sqrt{\mbox{var} x}
$$
is the SD or _standard deviation_ of the list;
and
$$
   z_i \equiv \frac{x_i - \bar{x}}{s_x}
$$
is _$x_i$ in standard units_.
With $\bar{y}$ and $s_y$ defined analogously for the list $\{y_i\}_{i=1}^n$, and if $s_x$
and $s_y$ are nonzero,

$$
   r_{xy} \equiv \frac{1}{n} \sum_{i=1} \frac{x_i - \bar{x}}{s_x} \cdot \frac{y_i-\bar{y}}{s_y}
$$
is the _correlation of $x$ and $y$_.


## Bivariate regression

Suppose we observe pairs $\{(x_i, y_i)\}_{i=1}^n$.
What straight line $y = a + bx$ comes closest to fitting these data, in the least-squares sense?
That is, what values $a$ and $b$ minimize
$$
   \mbox{mean squared error} = \mbox{MSE} \equiv \frac{1}{n}\sum_{i=1}^n \left ( y_i - (a + bx_i) \right )^2?
$$
The solution is $b = r_{xy} \frac{s_y}{s_x}$ and $a = \bar{y} - b \bar{x}$.

_Proof._
The MSE is 
$$
  \mbox{MSE} = \frac{1}{n}\sum_{i=1}^n \left ( y_i - (a + bx_i) \right )^2 = 
  \frac{1}{n}\sum_{i=1}^n \left ( y_i^2 - 2y_i(a + bx_i) + (a+bx_i)^2 \right ) =
  \frac{1}{n}\sum_{i=1}^n \left ( y_i^2 - 2y_i(a + bx_i) + a^2+2abx_i + b^2x_i^2 \right ).
$$

This is quadratic in both $a$ and $b$, and has positive leading coefficients in both.
Hence, its minimum occurs at a stationary point with respect to both $a$ and $b$.
We can differentiate inside the sum and solve for the stationary point:

$$
   0 = \frac{\partial \mbox{MSE}}{\partial b} = \frac{1}{n} \sum_{i=1}^n 2(y_i - (a+b x_i))\cdot x_i.
$$

$$
   0 = \sum_{i=1}^n y_i x_i - a \sum_{i=1}^n x_i - b \sum_{i=1}^n x_i^2.
$$

## Multiple  Regression

Multiple linear regression is a commonly used tool in most branches of science, as well as economics.
It is frequently misinterpreted.

We will start with an introduction to the linear algebra of constructing the least-squares estimate for
linear regression, then discuss the features and limitations of regression, especially the limitations
of drawing causal inferences from regression models.

### Notation

The basic relationship for linear regression is the equation

$$
   Y = X \beta + \epsilon.
$$

Here, $Y$ ios an $n$-vector of data, called the _dependent variable_, the _response_, the _explained variables_,
the _modeled variables_, or the _left hand side$.
The matrix $X \in {\mathcal M}(n,p)$ with $n \ge p$, $\mbox{rank}(X)=p$ (full rank)
so that the columns of $X$ are linearly independent.

The vector $\beta$ is a $p$-vector of _parameters_, _model parameters_, or _coefficients_.  
The usual goal of regression is to estimate $\beta$.

The vector $\epsilon$ is an $n$-vector of _error_, _disturbance_, or _noise_.

We will usually assume that $\epsilon$ is random, which makes $Y$ random as well.
We will usually treat $X$ as fixed rather than random.

There is an observation $Y_i$ for each _unit_ of observation, a row of $X$ for each observation,
and a column of $X$ for each parameter (element of $\beta$).
The columns correspond to _explanatory variables_, _independent variables_, _covariates_, _control variables_,
or _right-hand side variables_.

We have observaations of $Y$, which are _assumed_ to be values of $X\beta + \epsilon$.
The value of $\beta$ is unknown.
The value of $\epsilon$ cannot be observed.

The standard assumption in multiple regression is that the noise terms $\epsilon$ are random, with

+ $\{\epsilon_i\}$ iid (independent, identically distributed)
+ ${\mathbb E}\epsilon_i = 0$
+ $\mbox{var}\epsilon_i = \sigma^2$, generally a known constant

It is also standard to assume that if $X$ is random, $X$ and $\epsilon$ are independent.
Regardless, the value of $X$ is observed$.

Since $X$ is observed, we can find $\mbox{rank}(X)$.
But there is no way to test whether the main assumptions are true, that is, whether
+ $Y = X\beta + \epsilon$
+ $\{ \epsilon_i \}$ are iid with mean 0 and finite variance
+ $X$ and $\epsilon$ are independent.

It is common to _condition_ on $X$; that is, to treat it as fixed erather than random.

## Ordinary least squares

The ordinary least squares (OLS) estimate of $\beta$ is

$$ 
\hat{\beta} \equiv \left ( X'X  \right )^{-1}X' Y.
$$

The estimate $\hat{\beta}$ is a $p$-vector, just like $\beta$.

The _residuals_ or _fitting errors_ are
$$
  e \equiv Y - X \hat{\beta}.
$$

<hr />
__Theorem.__

1. $e \perp X$ (i.e., $X'e = {\mathbf 0}$: the fitting errors are orthogonal to $X$)
1. $\min_{\gamma \in \Re^p} \| Y - X\gamma\|^2 = \| Y - X \hat{\beta}\|^2$ ($\hat{\beta}$ solves the least squares fitting problem)
<hr />

This is a special case of the _Projection Theorem_.

__Proof.__

We prove the first part by direct calculation:
$ X'e = X' (Y - X \hat{\beta})$, so

$$ \left ( X'X \right )^{-1} X' e = \left ( X'X \right )^{-1} X' Y - \left ( X'X \right )^{-1} X'X \hat{\beta}
= \hat{\beta} - \hat{\beta} = 0.
$$

For the second part, write $\gamma = \hat{\beta} + (\gamma - \hat{\beta})$.
Then

$$
\| Y - X \gamma \|^2 = \| Y - X \hat{\beta} - X(\gamma - \beta) \|^2 = \| e - X (\gamma - \hat{\beta})\|^2
$$
$$
= \| e \|^2 + 2 e' X(\gamma - \hat{\beta}) + \| X(\gamma - \hat{\beta}) \|^2
\ge \| e \|^2
$$
since $e'X = {\mathbf 0}$.

<hr />
__Theorem.__
The ordinary least squares estimate $\hat{\beta$ is _conditionally unbiased_: ${\mathbb E (\hat{\beta} | X) = \beta$.

__Proof. __
By calculation:

$$ 
\hat{\beta} = (X'X)^{-1} X'Y = (X'X)^{-1} X'(X\beta + \epsilon)
$$
$$
= (X'X)^{-1}X'X\beta + (X'X)^{-1}X'\epsilon
= \beta + (X'X)^{-1}X' \epsilon.
$$

Now,
$$ 
{\mathbb E} \left ( (X'X)^{-1} X' \epsilon | X) = (X'X)^{-1} X' E(\epsilon | X).
$$

Since $\epsilon$ and $X$ are independent, 
$$
   {\mathbb E} (\epsilon | X) = {\mathbb E}\epsilon = 0.
$$

Hence,
$$
{\mathbb E} (\hat{\beta}|X) = {\mathbb E}(\beta + (X'X)^{-1}X'\epsilon | X) = \beta + {\mathbf 0} = \beta.
$$

## Computational example

We drop an object from an unknown height $h$.
We measure the height $Y$ of the object at times $0, 1, 2, and 3$ seconds.