# ECON5280 Lecture 4 Statistics

<font size="5">Junlong Feng</font>

## Outline

* Motivation: We care about the true parameters (the dice of God). Want to know it via a random sample.
* Estimation and Inference: What does statistics do?
* Estimation of Population Mean and Variance: Method of moment (MM) estimator, unbiasedness, and consistency.
* Inference about Population Mean: Asymptotic normality. Testing and confidence interval.

## 1. Estimation and Inference

<font size="2">  *Throughout the semester, we adopt a frequentist perspective. If you're a Bayesian, please bear with me as some techniques would be also useful to Bayesians .*</font>

A statistician/econometrician (fequentist) views the world in the following perspective: 

* There are some random variables $(Y,X,\varepsilon)$. They have the following relationship:
  $$
  Y=g(X,\varepsilon)
  $$
  We call $g$, together with the distribution of $(X,\varepsilon)$, the data generating process (DGP) of $Y$.

* Nature applies this DGP to all the entity $i$s (or, people). 

* The statistician/econometrician (i.e., us) collect an i.i.d. sample $\{(Y_{i},X_{i}):i=1,...,n\}$. However, $\varepsilon_{i}$ is not known to us. Meanwhile, $g$ is also unknown to us.

* Our goal is to design tools that use the random sample to *guess* the unknown function $g$.

Of course, if $g$ is too complicated, it's hopeless. Hence, we often make assumptions about $g$ to make life easy. One example is assume $g$ is *parametric*, that is $g(X,\varepsilon)=h(X,\epsilon;\theta)$ where the form of $h$ is known. The only unknown part is the vector of parameters $\theta$. For example, $Y=X'\beta+\varepsilon$. Here $h$ is a linear function and $\theta=\beta$.

Now let's assume the model is parametric. Given the model, an econometrician needs to do three things:

* Identification of $\theta$ (population level): Is $\theta$ unique?
  * If two $\theta$s can generate the same sample we observe, no hope to tell which is which.
  * Let's talk more about it in future lectures.
* Estimation (sample level): Given a random sample, can we construct a guess ($\hat{\theta}$) that approximates the unknown $\theta$?
* Inference (sample level): Two styles. 
  * Testing: I know $\theta\neq \hat{\theta}$ but given $\hat{\theta}$, can I at least reject some crazy values for $\theta$?
    * Suppose I get $\hat{\theta}=2$ and I know my guess is good. Then it does not make much sense to still believen $\theta=10000$.
  * Confidence set: Instead of having a single number $\hat{\theta}$, I want a range constructed from my sample such that I know that the range contains the true unknown $\theta$ with high probability.
    * This is possible even without a sample; I know for sure $\theta \in (-\infty,\infty)$.
    * But that is not useful. We'll develop better tools to get a more informative range.

## 2. Estimation of Populaltion Mean and Variance

### 2.1 An MM Estimator

Suppose we have a random variable $Y$. **Without loss of generality**, we can write:
$$
Y=\mu+\sigma\varepsilon,\mathbb{E}(\varepsilon)=0,\mathbb{V}(\varepsilon)\equiv\mathbb{E}(\varepsilon^{2})=1.
$$
You can easily verify that the mean of $Y$ is $\mu$, i.e., $\mathbb{E}(Y)=\mu$, and the variance of $Y$ is $\sigma^{2}$. 

* Note that there is **NO ASSUMPTION** behind this model except that we assume the mean and variance of $Y$ exist.
* We can also check **identification**, i.e., uniqueness of $(\mu,\sigma)$ in population. Suppose there exists another $(\mu_{1},\sigma_{1})\neq(\mu,\sigma)$  such that $Y=\mu_{1}+\sigma_{1}\varepsilon'$ and $\mathbb{E}(\varepsilon')=0,\mathbb{E}(\varepsilon^{'2})=1$ as well, then it implies $\mathbb{E}(Y)=\mu'\neq\mu=\mathbb{E}(Y)$ and/or $\mathbb{V}(Y)=\sigma^{2}\neq \sigma^{'2}=\mathbb{V}(Y)$, a contradiction.

Now can we calculate $(\mu,\sigma)$? No, because we don't know the distribution of $Y$: Recall that $\mu=\mathbb{E}(Y)=\int yf_{Y}(y)dy$ and $\sigma^{2}=\int y^{2}f_{Y}(y)dy-\mu^{2}$ and we don't know the density $f_{Y}$.

So instead, let us consider how to guess $(\mu,\sigma)$, or, to estimate $(\mu,\sigma)$. Suppose we have an i.i.d. sample of $Y$: $\{Y_{i}:i=1,...,n\}$. By i.i.d., all the $Y_{i}$s share the same mean and variance. So they all follow:
$$
Y_{i}=\mu+\sigma\varepsilon_{i},\mathbb{E}(\varepsilon_{i})=0,\mathbb{E}(\varepsilon_{i}^{2})=1.
$$
**IMPORTANT**. The only conditions we can use are $\mathbb{E}(\varepsilon)=0$ and $\mathbb{E}(\varepsilon^{2}-1)=0$. Recall that $\mathbb{E}$ is the **population** mean. Why don't we use a **sample analogue** to mimic it? The sample analogue of a population mean is naturally is *the sample average*. So our chain of logic runs as follows:

* Our model tells us that $\mathbb{E}(Y-\mu)=0$ and $\mathbb{E}[(Y-\mu)^{2}-\sigma^{2}]=0$. They are called **moment equations** or **moment conditions** for $(\mu,\sigma)$. But we cannot solve for $(\mu,\sigma)$ by them because we don't know the distribution of $Y$.

* So instead, we use a sample analogue: estimate population mean $\mathbb{E}$ by its sample counterpart $\sum_{i=1}^{n}/n$. So we propose an estimator based on the moment equations, called an **method of moment** (MM) estimator:
  $$
  \begin{align*}
  \frac{\sum_{i=1}^{n}(Y_{i}-\hat{\mu})}{n}&=0,\\
  \frac{\sum_{i=1}^{n}(Y_{i}-\hat{\mu}-\hat{\sigma}^{2})}{n}&=0.
  \end{align*}
  $$
  Two equations, two unknowns. Solving them, we get $\hat{\mu}=\frac{\sum_{i=1}^{n}Y_{i}}{n}\equiv\bar{Y}$ and $\hat{\sigma}^{2}=\frac{\sum_{i}(Y_{i}-\bar{Y})^{2}}{n}$.

* $(\hat{\mu},\hat{\sigma}^{2})$ are obviously not $(\mu,\sigma^{2})$; they solve different equations. $(\hat{\mu},\hat{\sigma}^{2})$  is called an **estimator** of $(\mu,\sigma^{2})$.

  * An estimator is **random** because it's a function of random variable $Y_{i}$s.
  * Suppose we have a data set and already know the realizations of the $Y_{i}$s (or simply, we see the numbers), then we can calculate the realization of $\hat{\mu}$ by substituting the numbers into its formula. The obtained number is called an **estimate**, which is **nonrandom**. 

**IMPORTANT**. The above derivation seems unnecessarily long and complicated; why don't we in the first place just use the sample average $\bar{Y}$ to mimic $\mathbb{E}(Y)$ even without introducing $\varepsilon$? The reason is that we hope to find a unified approach to handle all the estimation problems in this semester from the simplist example here. The MM estimators form a large **class of estimators** which is the most powerful technique developed by statisticians and econometricians. Its general steps are: i) construct moment equations (e.g. $\mathbb{E}(Y-\mu)=\mathbb{E}(\varepsilon)=0)$, ii) replace $\mathbb{E}$ with the sample analogue, and iii) solve the sample version moment equations. Although it seems unnecessary in this simple example, you'll see it's so powerful that it handles nearly all econometric problems, no matter how complicated they are. And in doing so you even don't need to think. Just do steps i), ii) and ii) and you are done.

### 2.2 Properties of $(\hat{\mu},\hat{\sigma}^{2})$.

We are interested in three properties: unbiasedness, consistency, and asymptotic normality. The first two are related to how good the point estimate is. The third is for inference which we'll motivate in the next section.

**Definition**. An estimator $\hat{\theta}$ is unbiased of some parameter $\theta$ if $\mathbb{E}(\hat{\theta})=\theta$.

* The expectation here makes sense because recall that estimators are random.

* We cannot calculate the expectation directly because the distribition of $\hat{\theta}$ is determined by the distribution of the sample which we do not know.

* The meaning of unbiasedness is different from consistency. Consistency says if we have one data set, and its size gets larger and larger, then the estimator will be close to the true parameter with higher and higher probability. 

  * One spreadsheet, number of rows increases.

  Unbiasedness, on the other hand, roughly says if we have more and more data sets, and we calculate $\hat{\theta}$ using each and every data set, and we get a lot of $\hat{\theta}$. Taking average of all these $\hat{\theta}$s, the mean is equal to the true parameter.

  * Multiple spreadsheets with **fixed** number of rows.

Now we check if our method of moment (MM) estimator $(\hat{\mu},\hat{\sigma}^{2})$ is consistent and unbiased of $(\mu,\sigma^{2})$.

* Consistency. 
  * $\hat{\mu}$ is consistent. By WLLN, $\hat{\mu}\to_{p}\mu$.
  * $\hat{\sigma}^{2}$ is also consistent. $\hat{\sigma}^{2}=\sum_{i}(Y_{i}-\bar{Y})^{2}/n=\sum_{i}Y^{2}_{i}/n-\bar{Y}^{2}\to_{p}\mathbb{E}(Y_{i}^{2})-(\mathbb{E}(Y))^{2}=\sigma^{2}$. The last step (convergence in probability) holds by WLLN, by continuity of quadratic function, and by CMT.
  * $\hat{\sigma}\equiv\sqrt{\sum_{i}(Y_{i}-\bar{Y})^{2}/n}$ is also consistent of $\sigma$ immediately by CMT because $\sqrt{\cdot}$ is a continuous function.
* Unbiasedness. 
  * $\hat{\mu}$ is unbiased. Although we cannot directly calculate $\mathbb{E}(\hat{\mu})$, we can do the following to verify unbiasedness: $\mathbb{E}(\hat{\mu})=\mathbb{E}(\sum_{i}Y_{i}/n)=\mathbb{E}(\sum_{i}Y_{i})/n=\sum_{i}\mathbb{E}(Y_{i})/n=n\cdot\mathbb{E}(Y_{i})/n=\mu$.
  * $\hat{\sigma}^{2}$ is biased: $\mathbb{E}(\hat{\sigma}^{2})=\mathbb{E}(Y_{i}^{2})-\mathbb{E}[(\sum_{i}Y_{i}/n)^{2}]<\mathbb{E}(Y_{i}^{2})-(\mathbb{E}\sum_{i}Y_{i}/n)^{2}=\sigma^{2}$, where the inequality is strict as long as $n>1$ by Jensen's inequality. 
    * $\hat{\sigma}$ is also biased.
    * Bias in $\hat{\sigma}^{2}$ and $\hat{\sigma}$ is not a big issue in most cases.
* Asymptotic normality. We'll leave this to next section.

 ## 3 Inference about $\mu$

We got $\hat{\mu}$, but we also know $\hat{\mu}\neq \mu$. We know $\hat{\mu}$ is consistent and unbiased, but suppose our estimate is 2.2, these properties only tell us well $\mu$ should be reasonably close to 2.2. But how close is close? Given 2.2, can you reject some crazy candidate value of $\mu$ or can you propose a range which likely covers $\mu$? We answer the first question using tesing and the second using a confidence set.

### 3.1 Testing

Setup:

* A candidate value of $\mu$ is called a **null hypothesis**. Let's denote it by $\mathbb{H}_{0}:\mu=\mu^{0}$, where $\mu^{0}$ is a **known** number.
* Against the null, we can form an **alternative hypothesis**, denoted by $\mathbb{H}_{1}:\mu\neq \mu^{0}$.
  * This is a two-sided hypothesis. We only consider such hypotheses in this semester.
* Based on $\hat{\mu}$, we want to accept or reject one of the two hypotheses. But we may of course make a mistake. There are two mistakes:
  * Type I error: Reject $\mathbb{H}_{0}$ but $\mathbb{H}_{0}$ is true. 
  * Type II error: Not reject $\mathbb{H}_{0}$ but $\mathbb{H}_{0}$ is false.
  * Since $\hat{\mu}$ is random, we can, at best, control the **probability** of making these mistakes.
  * The probability of making Type I error is called **size** or **level**. One minus the probability of making Type II error is called **power**.
  * One cannot reduce size and increase power at the same time using one single approach (you'll see the reason later). Then should we care more about size or power?
  * We want to have definite control of size, or, prob. of Type I error. **This is because we usually form the hypothesis in a way that we are more afraid of Type I error**.
    * For instance, you go and test for Covid. The test may make two kinds of mistakes: you are negative but it says positive, and you are positive but the test says negative. To the hospital/society, they worry more about the latter. So under this logic they will design the test to make "Covid positive" as the null, and try to reject it. Then the more worrisome mistake becomes a Type I error. By controlling the probability of making Type I error, the risk is under control.
    * If the null is rejected, we say the test is **significant**.

Having this in mind, we now have an idea that we are going to build a test procedure under which we can calculate the probability of Type I error and then we can control it. To calculate the probability, one needs the distribution, but we don't know the distribution of $\hat{\mu}$, so we invoke CLT to approximate its distribution.

Since $\hat{\mu}$ is the sample average, CLT says that $\sqrt{n}(\hat{\mu}-\mu)/\sigma\to_{d}N(0,1)$. We say $\hat{\mu}$ is **asymptotically normal**. Moreover, we also know that $\hat{\sigma}\to_{p}\sigma$, so by Slutsky's theorem, $\sqrt{n}(\hat{\mu}-\mu)/\hat{\sigma}\to_{d}N(0,1)$. This says, although we don't know the exact distribution of $\hat{\mu}$ or $\sqrt{n}(\hat{\mu}-\mu)/\hat{\sigma}$, we know the latter can be well approximated by a standard normal distribution. Now let's see how to use it.

Logic of testing:

* For a given null value $\mu^{0}$, since we know $\hat{\mu}$ is a good estimator and close to $\mu$, we should reject the null hypothesis, i.e., reject $\mu=\mu^{0}$ if $\hat{\mu}$ is **too far** from $\mu^{0}$, or, equivalently, if $T_{n}\equiv|\sqrt{n}(\hat{\mu}-\mu^{0})/\hat{\sigma}|$ is too large. 
* To determine how large is large, we want to find a threshold $c>0$: if $T_{n}>c$, we reject. If $T_{n}<c$, we do not reject. Then how do we choose $c$? We choose $c$ so that the size (i.e., prob. of Type I error) is controlled at $\alpha$, where $\alpha\in (0,1)$ is usually a small number, say, 0.01, 0.05, 0.1.
  * $c$ is called the **critical value**.
* By definition, $\alpha$ is the probability of rejecting while the null is true. Translating this to math, we want to find a $c$ such that $\Pr(T_{n}>c)=\alpha$. We don't know the LHS probability but by CLT we know it is approximately equal to $\Pr(Z>c\text{ or }Z<-c)$  where $Z\sim N(0,1)$. Hence, we choose $c=z_{1-\alpha/2}$, where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$-th quantile of standard normal distribution: $\Pr(Z\leq z_{1-\alpha/2})=1-\alpha/2$.
  * So critical value is affected by $\alpha$. A smaller $\alpha$ leads a larger critical value.
  * Commonly used critical values: 1.64 ($\alpha=0.1)$, **1.96 ($\alpha=0.05$)**, and 2.58 ($\alpha=0.01$).
* From this we can see one cannot reduce size and increase power by choosing $c$ alone: a smaller $c$ makes it easier to reject (harder to accept) so more likely to make a Type I error (less likely to make a Type II error).
* Power can be increased by another parameter: $n$. This is because under the alternative (Type II error says the alternative is true), $\mu\neq \mu^{0}$, so $T_{n}=|\sqrt{n}(\hat{\mu}-\mu)/\hat{\sigma}+\sqrt{n}(\mu-\mu^{0})/\hat{\sigma}|$. As $n\to \infty$, $T_{n}$ will be infinity plus a normal random variable, which is greater than any $c$ with probability one. So we choose $c$ to control Type I error, and Type II error will be small if we have a sufficiently large sample.
  * Unfortunately, we cannot calculate Type II error for a **fixed** sample size. The finite sample power is always an issue in practice. That's why statisticians/econometricians/economists do not say "accept the null" when the test is not significant; we may make a Type II error. They just say "cannot reject the null".

This test is called (student-) $t$ test. Historically, people assumed data are normal and then $T_{n}$ is student-$t$ distribution. Today, we still call it a $t$ test but no longer use $t$ distribution to find the critical values. We never make distributional assumptions about the data but instead use asymptotic approximation to approximate the distribution.

#### 3.1.1 $p$-value

$p$-value is closely related to testing. It's the approximate probability (**under the null hypothesis**) of the random $T_{n}$ greater than its realized value $t$ (a nonrandom number) based on our data set. 

Logic of $p$-value:

- If $T_{n}>t$ has a small probability under the null hypothesis, then it means $t$ must be large.
- $T_{n}$ is the approximately the absolute value of a standard normal under the null hypothesis, so its realization should be close to 0.
- Now that the realization $t$ is big, the observed data set is a rare event.
- But we don't believe that we are so unlucky: we have one data set and this data set is a rare event! 
- So we think maybe it's because the null hypothesis is false in the first place. So we **reject **the null if this probability is **too small**.

Formally, $p$-value is defined as $\Pr(|Z|>t)$ where $Z$ is standard normal and we reject at level-$\alpha$ if $\Pr(|Z|>t)<\alpha$.

### 3.2 Confidence Set

Testing has many drawbacks, for instance, it's inefficient: You have to run a test for each null you're interested in. Since there is usually a continuum of possible values for the parameter (e.g.,$-\infty$ to $\infty$), you can never exhaust the possibilities. 

A smarter way is to construct a confidence set that answers all testing questions like "Could this value be the true parameter?" once and for all. 

**Definition**. Denote the true parameter by $\theta$. A confidence set is a **random set** $CS_{1-\alpha}$ such that $\Pr(\theta\in CS_{1-\alpha})=1-\alpha$.

* Randomness, and thus the probability $\Pr$, is with respect to $CS_{1-\alpha}$, not $\theta$.  
* Note that there is no null hypothesis involved! Once you have a confidence set, you know at once that any null hypothesis falling outside $CS_{1-\alpha}$ will be rejected at level-$\alpha$ without really doing these tests.

For population mean (or **any parameter whose estimator is asymptotically normal**), we can construct a confidence set easily: Since $\sqrt{n}(\hat{\mu}-\mu)/\hat{\sigma}\to_{d}N(0,1)$, we know that $\Pr(\sqrt{n}|\hat{\mu}-\mu|/\hat{\sigma}\leq z_{1-\alpha/2})\approx\Pr(-z_{1-\alpha/2}\leq Z\leq z_{1-\alpha/2})=1-\alpha$. Therefore, a $CS_{1-\alpha}=[\hat{\mu}-z_{1-\alpha/2}\hat{\sigma}/\sqrt{n},\hat{\mu}+z_{1-\alpha/2}\hat{\sigma}/\sqrt{n}]$.

* As said before, no null is needed in the derivation.

* $\hat{\sigma}/\sqrt{n}$ is called the **standard  error (se)** of $\hat{\mu}$, i.e., an estimate of the standard deviation of $\hat{\mu}$: $\mathbb{V}(\hat{\mu})=\sigma^{2}/n$ so $\sigma_{\hat{\mu}}=\sigma/\sqrt{n}$ and we know $\hat{\sigma}\to_{p} \sigma$.

  * I know this sounds confusing. There are several things going on:
    * Asymptotic variance and s.d. of $\hat{\mu}$: $\sigma^{2},\sigma$.
      * Estimator of the asymptotic variance: $\hat{\sigma}^{2},\hat{\sigma}$.

    * Variance and s.d. of $\hat{\mu}$: $\sigma^{2}/n$ (by i.i.d.), $\sigma/\sqrt{n}$.
      * Estimator of the variance and s.d. of $\hat{\mu}$: $\hat{\sigma}^{2}/n$, $\hat{\sigma}/\sqrt{n}$.

      *  $\hat{\sigma}/\sqrt{n}$ is called the se of $\hat{\mu}$.

  * The $(1-\alpha)$ confidence set of $\mu$ is $[\hat{\mu}-z_{1-\alpha/2}se,\hat{\mu}+z_{1-\alpha/2}se]$.

* Confidence set is more and more popular in economics. Top journals are abandoning reporting $p$-value and marking significance. Instead, you just need to report se of your estimator; with it, people can calculate the confidence set for any $\alpha$ they want and do any test they want.