# ECON5280 Chatper 4 Statistics and Regression

<font size="5">Junlong Feng</font>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/junlong-feng/econ5280/main?filepath=Lecture4_Stats_Reg.ipynb)

## Outline

* Motivation: Many methods for causal inference we study in this semester can be transformed into linear models. We thus want to know how to estimate the unknown parameters there.
* Estimation of coefficients in a linear model: Method of moment (MM) estimator, unbiasedness, consistency, and asymptotic normality.
* Inference: Testing and confidence interval.

## 1. Estimation and Inference

<font size="2">  *Throughout the semester, we adopt a frequentist perspective. If you're a Bayesian, please bear with me as some techniques would be also useful to Bayesians .*</font>

A statistician/econometrician (fequentist) views the world in the following perspective: 

* There are some random variables $(Y,X,\varepsilon)$. They have the following relationship:
  $$
  Y=g(X,\varepsilon)
  $$
  We call $g$, together with the joint distribution of $(X,\varepsilon)$, the data generating process (DGP) of $Y$.

* Nature applies this DGP to all the entity $i$s (or, people). 

* The statistician/econometrician (i.e., us) collects an i.i.d. sample $\{(Y_{i},X_{i}):i=1,...,n\}$. However, $\varepsilon_{i}$ is not known to us. Meanwhile, $g$ is also unknown to us.

* Our goal is to design tools that use the random sample to *guess* the unknown function $g$ or some functionals about it.

Suppose we are interested in a functional of $g$, called $\theta$. For instance, $X\in\{0,1\}$ and $\theta$ is the average treatment effect: $\theta\equiv \mathbb{E}(g(1,\varepsilon)-g(0,\varepsilon))$. An econometrician needs to do three things:

* Identification of $\theta$ (population level): Is $\theta$ unique?
  * If two $\theta$s can generate the same population, no hope to tell which is which.
* Estimation (sample level): Given a random sample, can we construct a guess ($\hat{\theta}$) that approximates the unknown $\theta$?
* Inference (sample level): Two styles. 
  * Testing: I know $\theta\neq \hat{\theta}$ but given $\hat{\theta}$, can I at least reject some crazy values for $\theta$?
    * Suppose I get $\hat{\theta}=2$ and I know my guess is good. Then it does not make much sense to believe $\theta=10000$.
  * Confidence set: Instead of having a single number $\hat{\theta}$, I want a range constructed from my sample such that I know that the range contains the true unknown $\theta$ with high probability.
    * This is possible even without a sample; I know for sure $\theta \in (-\infty,\infty)$.
    * But that is not useful. We'll develop better tools to get a more informative range.

## 2. Linear Models and Least Squares

Throughout this chapter, we focus on the following linear model for an i.i.d. sample $\{Y_{i},X_{i}:i=1,...,n\}$:
$$
Y_{i}=X_{i}'\beta+\varepsilon_{i},\ \ \ \ \mathbb{E}(X_{i}\varepsilon_{i})=0.
$$
Or, stacking all $n$ observations into columns:
$$
Y=X\beta+\varepsilon,\ \ \ \ \mathbb{E}(X_{i}\varepsilon_{i})=0,\forall i.
$$

* $X_{i}$ is a $k\times 1$ vector which usually contains $1$. 
  * Containing constant 1 makes $\mathbb{E}(\varepsilon_{i})=0$ under $\mathbb{E}(X_{i}\varepsilon_{i})=0$.
* $X$ is an $n\times k$ matrix whose $i$-th row is $X_{i}'$. $Y$ and $\varepsilon$ are $n\times 1$ vectors.
* In this chapter, we don't care about the meaning of $\beta$ and only study how to estimate it.

### 2.1 An MM Estimator

The only conditions we can use are the model and $\mathbb{E}(X_{i}\varepsilon_{i})=0$. Recall that $\mathbb{E}$ is the **population** mean. Why don't we use a **sample analogue** to mimic it? WLLN says that the sample average approximates expectation very well. So our chain of logic runs as follows:

* Our model tells us that $\mathbb{E}(X_{i}Y_{i}-X_{i}X_{i}'\beta)=0$. This is called a **moment condition** or **moment equation** for $\beta$.  Expanding it, we have

$$
\mathbb{E}(X_{i}X_{i}')\beta=\mathbb{E}(X_{i}Y_{i})\implies\beta=[\mathbb{E}(X_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}Y_{i}),
$$

provided that $\mathbb{E}(X_{i}X_{i}')$ is full-rank; $\mathbb{E}(X_{i}X_{i}')$ being full-rank is called the *identification* condition for $\beta$. It has to be satisfied in order that $\beta$ is uniquely determined by the population distribution of $(Y_{i},X_{i})$. $\mathbb{E}(X_{i}X_{i}')$ being full-rank has another name in plain words: perfect multicollinearity does not exist. Perfect multicollinearity means that there exists one variable in vector $X_{i}$ that can be written as a linear combination of the other variables.

  * For instance, let $X_{i}=(1,M_{i},F_{i})'$. One may think about $M_{i}$ and $F_{i}$ are dummies referring to male and female respectively; assuming there are only two genders, $M_{i}+F_{i}$ is always equal to $1$. Then

$$
\mathbb{E}\begin{pmatrix}X_{i}X_{i}'\end{pmatrix}
=\mathbb{E}\begin{pmatrix}1&1-F_{i}&F_{i}\\
1-F_{i}&1-2F_{i}+F_{i}^{2}&F_{i}-F_{i}^{2}\\
F_{i}&F_{i}-F_{i}^{2}&F_{i}^{2}\end{pmatrix}.
$$

It is clear that column 2 plus column 3 is equal to column 1, so $\mathbb{E}\begin{pmatrix}X_{i}X_{i}'\end{pmatrix}$ is NOT full rank. This is called a *dummy variable trap*. Dropping any of the 3 variables (1, $M$ or $F$) solves the problem.

*  Now coming back to (4), note that we cannot directly compute $\mathbb{E}$ because we do not have the distribution. Instead, we use a sample analogue: replace the population mean $\mathbb{E}$ by its sample counterpart $\sum_{i=1}^{n}/n$. After the replacement, it's no longer true $\beta$ that satisfies the equation.  We call the value that satisfies the sample version of the moment condition $\hat{\beta}$:

$$
  \begin{align*}
  \hat{\beta}\equiv &\left[\frac{1}{n}\sum_{i=1}^{n}X_{i}X_{i}'\right]^{-1}\left[\frac{1}{n}\sum_{i=1}^{n}X_{i}Y_{i}\right]\\
  =& (X'X)^{-1}(X'Y).
  \end{align*}
$$

* $\hat{\beta}$ is called the **method of moment (MM) estimator** of $\beta$.  
  * An estimator is **random** because it's a function of random variables.
  * This estimator has another name: Ordinary Least Squares (OLS), motivated from a different prospective (independently developed by Legendre and Gauss 200 years ago), widely adopted in traditional econometric texts. We did not derive OLS in that way since it's not general enough to nest other estimators we'll learn in this semester.

**IMPORTANT**. The MM estimators form a large **class of estimators** which is the most powerful technique developed by statisticians and econometricians. Its general steps are: i) construct moment equations, ii) replace $\mathbb{E}$ with the sample analogue, and iii) solve the sample version moment equations.  You'll see it's so powerful that it handles nearly all econometric problems, no matter how complicated they are. And in doing so you even don't need to think. Just do steps i), ii) and iii) and you are done.

In [None]:
### Verify that the function lm in R gives the same estimates as the formula
n=100; e=rnorm(n,0,1); x=rnorm(n,0,1); y=0.5+2*x+e; cons=rep(1,100)
X=cbind(cons,x); y=as.matrix(y,n,1)
## Method 1: Use formula
bhat=solve(t(X)%*%X)%*%(t(X)%*%y)
## Method 2: Use R routine
model=lm(y~x)
summary(model)

#### 2.1.1 Fitted Values and Residuals

The linear model essentially estimates a *line* (or, hyperplane, more accurately) to fit the data points of $Y$. We have the following jargons:

* **Fitted or predicted value**: $\hat{Y}\equiv X\hat{\beta}=X(X'X)^{-1}X'Y\equiv PY$.
* **Residual**: $\hat{\varepsilon}\equiv Y-X\hat{\beta}=(I-X(X'X)^{-1}X')Y\equiv MY$.

## 3  Statistical Properties of OLS

We are interested in three properties: unbiasedness, consistency, and asymptotic normality. The first two are related to how good the point estimate is. The third is for inference which we'll motivate in the next section. We've seen the definitions of consistency and asymptotic normality in Chapter 3. We now first define and show unbiasedness.

### 3.1 Unbiasedness

**Definition**. An estimator $\hat{\beta}$ is unbiased of some parameter $\beta$ if $\mathbb{E}(\hat{\beta})=\beta$.

* The expectation here makes sense because recall that estimators are random.

* We cannot calculate the expectation directly because the distribition of $\hat{\beta}$ is determined by the distribution of the sample which we do not know.

* The meaning of unbiasedness is different from consistency. Unbiasedness is a finite sample notion which does not need $n\to\infty$. 

Unbiasedness of OLS holds under a stronger assumption than (2): $\mathbb{E}(\varepsilon_{i}|X_{i})=0$.

Proof. We are done once we show that $\mathbb{E}(\hat{\beta}|X)=0$ because then $\mathbb{E}(\hat{\beta})=0$ by LIE.

$$
\begin{align*}
\mathbb{E}(\hat{\beta}|X)=\mathbb{E}(\beta+(X'X)^{-1}X'\varepsilon|X)=\beta+(X'X)^{-1}X'\mathbb{E}(\varepsilon|X)=0.
\end{align*}
$$

In [None]:
### Verify Unbiasedness
n=100; nrep=500 # Create nrep samples and in each sample estimate beta=(0.5,2).
bhat=matrix(0,2,nrep)
for (sim in 1:nrep){
e=rnorm(n,0,1); x=rnorm(n,0,1); y=0.5+2*x+e; cons=rep(1,n)
X=cbind(cons,x); y=as.matrix(y,n,1)
bhat[,sim]=solve(t(X)%*%X)%*%(t(X)%*%y)
}
rowMeans(bhat) # Take average which mimics E(bhat).

### 3.2 Consistency

By $X'X=\sum_{i}X_{i}X_{i}'$ and $X'\varepsilon=\sum_{i}X_{i}\varepsilon_{i}$, we have 
$$
\begin{align*}
\hat{\beta}=&\beta+\left(\sum_{i}X_{i}X_{i}'\right)^{-1}\left(\sum_{i}X_{i}\varepsilon_{i}\right)\\
=&\beta+\left(\frac{1}{n}\sum_{i}X_{i}X_{i}'\right)^{-1}\left(\frac{1}{n}\sum_{i}X_{i}\varepsilon_{i}\right).
\end{align*}
$$
We establish consistency by the following argument:

* By WLLN, $\frac{1}{n}\sum_{i}X_{i}X_{i}'\to_{p}\mathbb{E}(X_{i}X_{i}')$ and $\frac{1}{n}\sum_{i}X_{i}\varepsilon_{i}\to_{p}\mathbb{E}(X_{i}\varepsilon_{i})$.
* By invertibility of $\mathbb{E}(X_{i}X_{i}')$ and by continuous mapping theorem, $(\frac{1}{n}\sum_{i}X_{i}X_{i}')^{-1}\to_{p}[\mathbb{E}(X_{i}X_{i}')]^{-1}$.
* By continuous mapping theorem again, $(\frac{1}{n}\sum_{i}X_{i}X_{i}')^{-1}(\frac{1}{n}\sum_{i}X_{i}\varepsilon_{i})\to_{p}[\mathbb{E}(X_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}\varepsilon_{i})$, where the right hand side is 0 by $\mathbb{E}(X_{i}\varepsilon_{i})=0$.
* Therefore, $\hat{\beta}\to_{p}\beta$.

In [None]:
### (Not rigorously) Verify Consistency
set.seed(5280)
n0=c(50,100,1000); # Create three samples with n=50, 500, 1000.
diff=matrix(0,2,3)
for (sim in 1:3){
  n=n0[sim]
e=rnorm(n,0,1); x=rnorm(n,0,1); y=0.5+2*x+e; cons=rep(1,n)
X=cbind(cons,x); y=as.matrix(y,n,1)
diff[,sim]=solve(t(X)%*%X)%*%(t(X)%*%y)-matrix(c(0.5,2),2,1) # difference between bhat and b
}
colSums(abs(diff))

### 3.3 Asymptotic Normality

Again, by $\hat{\beta}-\beta=\left(\frac{1}{n}\sum_{i}X_{i}X_{i}'\right)^{-1}\left(\frac{1}{n}\sum_{i}X_{i}\varepsilon_{i}\right)$, we have
$$
\begin{align*}
\sqrt{n}(\hat{\beta}-\beta)=&\left(\frac{1}{n}\sum_{i}X_{i}X_{i}'\right)^{-1}\left(\frac{1}{\sqrt{n}}\sum_{i}X_{i}\varepsilon_{i}\right)\\
\text{(by i.i.d. and by $\mathbb{E}(X_{i}\varepsilon_{i})=0$)  }=&\left(\frac{1}{n}\sum_{i}X_{i}X_{i}'\right)^{-1}\left(\sqrt{n}\cdot\left[\frac{1}{n}\sum_{i}X_{i}\varepsilon_{i}-\mathbb{E}(X_{i}\varepsilon_{i})\right]\right).\\
\end{align*}
$$
We establish asymptotic normality by the following argument:

* By WLLN, CMT, and invertibility of $\mathbb{E}(X_{i}X_{i}')$, $\left(\frac{1}{n}\sum_{i}X_{i}X_{i}'\right)^{-1}\to_{p}[\mathbb{E}(X_{i}X_{i}')]^{-1}$.
* By CLT, $\sqrt{n}\cdot\left[\frac{1}{n}\sum_{i}X_{i}\varepsilon_{i}-\mathbb{E}(X_{i}\varepsilon_{i})\right]\to_{d}N(0,V(X_{i}\varepsilon_{i})).$
* By $\mathbb{E}(X_{i}\varepsilon_{i})=0$, $V(X_{i}\varepsilon_{i})=\mathbb{E}(\varepsilon_{i}^{2}X_{i}X_{i}')$.
* By CMT, $\sqrt{n}(\hat{\beta}-\beta)\to_{d}N(0,[\mathbb{E}(X_{i}X_{i}')]^{-1}\mathbb{E}(\varepsilon_{i}^{2}X_{i}X_{i}')\mathbb{E}(X_{i}X_{i}')]^{-1})$.
* For simplicity we denote $\Sigma\equiv [\mathbb{E}(X_{i}X_{i}')]^{-1}\mathbb{E}(\varepsilon_{i}^{2}X_{i}X_{i}')\mathbb{E}(X_{i}X_{i}')]^{-1}$.

Asymptotic normality says, although we know nothing about the distribution of data $\{(Y_{i},X_{i}):i=1,...,n\}$, and thus know nothing about the exact distribution of $\hat{\beta}$, we can approximate its distribution by $N(\beta,\Sigma/n)$ when $n$ is large.

* The variance $\Sigma$ is called the **asymptotic variance** of $\hat{\beta}$. 
* We usually call a formula like $\Sigma$ **sandwich formula**: $[\mathbb{E}(X_{i}X_{i}')]^{-1}$ is bread and $\mathbb{E}(\varepsilon_{i}^{2}X_{i}X_{i}')$ meat.

#### 3.3.1 Estimating $\Sigma$

The asymptotic variance is unknown. It would be useful to estimate it. There are two difficulties: i) $\mathbb{E}$ is not computable because again, we don't know the distribution of data, ii) $\varepsilon_{i}$ is unknown. For i), we know how to deal with it: We can simply replace all $\mathbb{E}$ by sample averages and WLLN guarantees consistency. For ii), we can replace $\varepsilon_{i}$ with the residuals $\hat{\varepsilon}_{i}$.

* **Asymptotic variance estimator**: $\hat{\Sigma}\equiv (\sum_{i}X_{i}X_{i}'/n)^{-1}(\sum_{i}X_{i}X_{i}'\hat{\varepsilon}_{i}^{{2}}/n)(\sum_{i}X_{i}X_{i}'/n)^{-1}$.
  * It can be shown that $\hat{\Sigma}\to_{p}\Sigma$.

**Standard error of $\hat{\beta}$**.

* The diagonal elements in $\hat{\Sigma}$ are estimators of the asymptotic variances of each $\hat{\beta}_{j}$ in the $k\times 1$ vector $\hat{\beta}$. Denote them by $\hat{\Sigma}_{jj}$s.
* We call $\sqrt{\hat{\Sigma}_{jj}}/\sqrt{n}$ the **standard error** of $\hat{\beta}_{j}$, denoted by $se_{j}$. **Important**: $se_{j}\to_{p} 0$.
* By consistency of $\hat{\Sigma}$ and the asymptotic distribution of $\hat{\beta}$, we have $(\hat{\beta}_{j}-\beta_{j})/se_{{j}}\to_{d}N(0,1)$. This is the foundation of inference.

 ## 4 Inference about $\beta$

We got $\hat{\beta}$, but we also know $\hat{\beta}\neq \beta$. We know $\hat{\beta}$ is consistent and unbiased. Suppose our estimate is 2.2. These properties only tell us that $\beta$ should be reasonably close to 2.2. But how close is close? Given an estimate equal to 2.2, if I ask you "do you think the true $\beta$ can be 2", what would you say? How about I ask "do you think the true $\beta$ can be 2000"? Alternatively, I may ask you to give me a reasonable range for $\beta$ given the point estimate is 2.2. 

It turns out the first two questions can be answered by testing, and the second can be done by a confidence interval.

### 4.1 Testing

Setup:

* A candidate value of $\beta_{j}$ is called a **null hypothesis**. Let's denote it by $\mathbb{H}_{0}:\beta_{j}=\beta^{0}_{j}$, where $\beta^{0}_{j}$ is a **known** number.
* Against the null, we can form an **alternative hypothesis**, denoted by $\mathbb{H}_{1}:\beta_{j}\neq \beta^{0}_{j}$.
  * This is a two-sided hypothesis. We only consider such hypotheses in this semester.
* Based on $\hat{\beta}_{j}$, we want to accept or reject one of the two hypotheses. But we may of course make a mistake. There are two mistakes:
  * Type I error: Reject $\mathbb{H}_{0}$ but $\mathbb{H}_{0}$ is true. 
  * Type II error: Not reject $\mathbb{H}_{0}$ but $\mathbb{H}_{0}$ is false.
  * Since $\hat{\beta}_{j}$ is random, we can, at best, control the **probability** of making these mistakes.
  * The probability of making Type I error is called **size** or **level**. One minus the probability of making Type II error is called **power**.
  * One cannot reduce size and increase power at the same time using one single approach (you'll see the reason later). Then should we care more about size or power?
  * We want to have definite control of size, or, prob. of Type I error. **This is because we usually form the hypothesis in a way that we are more afraid of Type I error**.
    * For instance, I caught a cold and I want to test whether it is flu. The test may make two kinds of mistakes: I don't have flu but the test says positive, and I have flu but the test says negative. To the hospital/society, they worry more about the latter because flu is contagious. So under this logic they will design the test to make "flu positive" as the null, and try to reject it. Then the more worrisome mistake becomes a Type I error. By controlling the probability of making Type I error, the risk is under control.
    * If the null is rejected, we say the test is **significant**.

Having this in mind, we now have an idea that we are going to build a test procedure under which we can approximate the probability of Type I error to control it. Such approximation of distribution is via CLT.

Recall that we have $(\hat{\beta}_{j}-\beta_{j})/se_{{j}}\to_{d}N(0,1)$. So, although we don't know the exact distribution of $\hat{\beta}_{j}$ or $(\hat{\beta}_{j}-\beta_{j})/se_{j}$, we know the latter can be well approximated by a standard normal distribution. Now let's see how to use it.

Logic of testing:

* For a given null value $\beta^{0}_{j}$, since we know $\hat{\beta}_{j}$ is a good estimator, we should reject the null hypothesis, i.e., reject $\beta_{j}=\beta^{0}_{j}$ if $\hat{\beta}_{j}$ is **too far** from $\beta^{0}_{j}$, or, equivalently, if $T_{n}\equiv|(\hat{\beta}_{j}-\beta_{j})/se_{{j}}|$ is too large. 
* To determine how large is large, we want to find a threshold $c>0$: if $T_{n}>c$, we reject. If $T_{n}<c$, we do not reject. Then how do we choose $c$? We choose $c$ so that the size (i.e., prob. of Type I error) is controlled at $\alpha$, where $\alpha\in (0,1)$ is usually a small number, say, 0.01, 0.05, 0.1.
  * $c$ is called the **critical value**.
* By definition, $\alpha$ is the probability of rejecting while the null is true. Translating this to math, we want to find a $c$ such that $\Pr(T_{n}>c)=\alpha$. We don't know the LHS probability but by CLT we know it is approximately equal to $\Pr(Z>c\text{ or }Z<-c)$  where $Z\sim N(0,1)$. Hence, we choose $c=z_{1-\alpha/2}$, where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$-th quantile of standard normal distribution: $\Pr(Z\leq z_{1-\alpha/2})=1-\alpha/2$.
  * So critical value is affected by $\alpha$. A smaller $\alpha$ leads a larger critical value.
  * Commonly used critical values: 1.64 ($\alpha=0.1)$, **1.96 ($\alpha=0.05$)**, and 2.58 ($\alpha=0.01$).
* From this we can see one cannot reduce size and increase power by choosing $c$ alone: a smaller $c$ makes it easier to reject (harder to accept) so more likely to make a Type I error (less likely to make a Type II error).
* Power can be increased by another parameter: $n$. This is because under the alternative (Type II error says the alternative is true), $\beta_{j}\neq \beta^{0}_{j}$, so $T_{n}=|(\hat{\beta}_{j}-\beta_{j})/se_{j}+(\beta_{j}-\beta^{0}_{j})/se_{j}|$. Recall that $se_{j}\to_{p}0$. Therefore, $n\to \infty$, $T_{n}$ diverges to infinity, greater than any $c$ with probability one. So we choose $c$ to control Type I error, and Type II error will be small if we have a sufficiently large sample.
  * Unfortunately, we cannot calculate Type II error for a **fixed** sample size. The finite sample power is always an issue in practice. That's why statisticians/econometricians/economists do not say "accept the null" when the test is not significant; we may make a Type II error. They just say "cannot reject the null".

This test is called (student-) $t$ test. Historically, people assumed data are normal and then $T_{n}$ is student-$t$ distribution. Today, we still call it a $t$ test but no longer use $t$ distribution to find the critical values. We never make distributional assumptions about the data but instead use asymptotic approximation to approximate the distribution.

**Important**. Recall that by construction $se_{j}\to 0$ as $n\to\infty$. That means, you can literally reject almost any null when $n$ is sufficiently large! Having statistical significance is not a big deal at all.

* This result looks striking, but is natural if we really understand what **significance** means. For instance, suppose our estimate is $0.01$. We want to test whether the true parameter is 0. Obviously 0.01 is not equal to 0 mathematically. But you also have to take sampling error into consideration. When your sample is small, you don't have enough info, so you're not very confident and you're afraid you get 0.01 by making a mistake. So you say well maybe I cannot reject 0 because although $0.01\neq 0$, my info is not enough so I may make a mistake so 0 is possibly to be the true parameter. However, when you have, let's say 100 million data points and you still get an estimate equal to 0.01, then you may confidently say that the truth is not 0 because I got this number based on such rich info.

#### 4.1.1 $p$-value

$p$-value is closely related to testing. It's the approximate probability (**under the null hypothesis**) of the random $T_{n}$ greater than its realized value $t$ (a nonrandom number) based on our data set. 

Logic of $p$-value:

- If $T_{n}>t$ has a small probability under the null hypothesis, then it means $t$ must be large.
- $T_{n}$ is the approximately the absolute value of a standard normal under the null hypothesis, so its realization should be close to 0.
- Now that the realization $t$ is big, the observed data set is a rare event.
- But we don't believe that we are so unlucky: we have one data set and this data set is a rare event! 
- So we think maybe it's because the null hypothesis is false in the first place. So we **reject **the null if this probability is **too small**.

Formally, $p$-value is defined as $\Pr(|Z|>t)$ where $Z$ is standard normal and we reject at level-$\alpha$ if $\Pr(|Z|>t)<\alpha$.

### 4.2 Confidence Set

Testing has many drawbacks, for instance, it's inefficient: You have to run a test for each null you're interested in. Since there is usually a continuum of possible values for the parameter (e.g.,$-\infty$ to $\infty$), you can never exhaust the possibilities. 

A smarter way is to construct a confidence set that answers all testing questions like "Could this value be the true parameter?" once and for all. 

**Definition**. A confidence set of $\beta_{j}$ is a **random set** $CS_{1-\alpha}$ such that $\Pr(\beta_{j}\in CS_{1-\alpha})=1-\alpha$.

* Randomness, and thus the probability $\Pr$, is with respect to $CS_{1-\alpha}$, not $\beta_{j}$.  
* Note that there is no null hypothesis involved! Once you have a confidence set, you know at once that any null hypothesis falling outside $CS_{1-\alpha}$ will be rejected at level-$\alpha$ without really doing these tests.

We can construct a confidence set for $\beta_{j}$ easily: Since $(\hat{\beta}_{j}-\beta_{j})/se_{j}\to_{d}N(0,1)$, we know that $\Pr(|\hat{\beta}_{j}-\beta_{j}|/se_{j}\leq z_{1-\alpha/2})\approx\Pr(-z_{1-\alpha/2}\leq Z\leq z_{1-\alpha/2})=1-\alpha$. Therefore, an $(1-\alpha)$ confidence interval is $[\hat{\beta}_{j}-se_{j}\times z_{1-\alpha/2},\hat{\beta}_{j}+se_{j}\times z_{1-\alpha/2}]$.

* Confidence set is more and more popular in economics. Top journals are abandoning reporting $p$-value and marking significance. Instead, you just need to report se of your estimator; with it, people can calculate the confidence set for any $\alpha$ they want and do any test they want.

In [None]:
library(sandwich) # library to calculate heteroscedastic robust standard error
library(lmtest)
# DGP
n=500; e=rnorm(n,0,1); x1=rnorm(n,0,1); x2=rnorm(n,0,1); y=0.5+x1+x2+e; cons=rep(1,n) 
model=lm(y~x1+x2)
## Get Hetero-robust variance matrix. Any of HC0-HC3 is okay. HC0 and 1 are commonly used.
cov=vcovHC(model,type="HC0")
## First test the coefficient on x1 is 0 using t-test
ttest1=model$coefficient[2]/sqrt(cov[2,2]) # denominater is the corresponding se.
## Or you can use command coeftest:
coeftest(model,vcov.=vcovHC(model,type="HC0"))

## 5 Using OLS in Real World Causal Problems

In this semester, we almost never directly ***assume*** the causal relationship between $X$ and $Y$ is linear. Instead, linearity will naturally arise in many cases by construction. Whenever we do have a linear representation, we can use OLS and the inferential methods to study the parameters we are interested in.