# ECON5280 Lecture 3 Probability

<font size="5">Junlong Feng</font>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/junlong-feng/econ5280/main?filepath=Lecture3_Probability.ipynb)

## Outline

* Motivation: We treat our sample as random.
* Probability distribution: Terminology and notation.
* Expectation, variance, covariance and correlation: Basic tools.
* Large sample theories: Advanced tools.

## 1. Probability distribution.

<font size="2">  *We'll not adopt the rigorous approach to define random variable and probability distribution.*</font>

### 1.1 Random Variable

* A (real) random variable $X$ can take value from a set $\Omega\subset \mathbb{R}$. Which value it equals is random.
  * If the set $\Omega$ is discrete, e.g. $\{1,2,3\}$, $X$ is **discrete**. If $\Omega$ is a continuum, $X$ is **continuous**.
* A subset of $\Omega$ is called an **event**. 
  + For instance, let $\Omega=\{1,2,3\}$ represent tomorrow's weather: $1=rainy$, $2=sunny$ and $3=other$. Then $\{1\}$ is the event that tomorrow is rainy. $\{1,2\}$ means that tomorrow is rainy or sunny.
  + If two events cannot happen simultaneously, we say they are **mutually exclusive**.
* Probability (measure) is a nonnegative **function** mapping the set of all possible events (the $\sigma$-algebra generated by $\Omega$ if you prefer a more rigorous treatment) to interval $[0,1]$.
  + It needs to satisfy certain properties for such a nonnegative function to be a probability measure. But to save time, this is left for further reading and is not required.

### 1.2 Probability Function, Cumulative Distribution Function, and Density

How do we describe the behavior of a random variable? For a non-random variable, as long as we know its value, we know everything about it. For a random variable, since it takes on multiple values, we need to know two things: i) All possible values that it can take, and ii) (roughly) the probability of each singleton event.

#### 1.2.1 Discrete Random Variables and Probability Function 

We start from the simpliest case where $\Omega$ is a discrete set. Let $\Omega\equiv \{x_{1},\ldots,x_{J}\}$. A singleton event is an event that only contains one element in $\Omega$, e.g. $\{x_{2}\}$. 

The probability distribution of $X$ is the set of $\Pr(X=x_{j})$ for all $j=1,...,J$. Once knowing the distribution, we can calculate the probability of any event. For instance, $\Pr(X=x_{1}\ or\ x_{2})=\Pr(X=x_{1})+\Pr(X=x_{2})$.

A useful discrete distribution:

* Bernoulli, denoted by $Ber(p)$: $\Omega$ only contains $2$ elements, usually normalized to $\{0,1\}$. $\Pr(X=1)=p$.

  * Bernoulli distribution describes any random variable that has two outcomes: You get a job at Morgan Stanley next year or not, Covid-19 ends next year or not, etc.

  * In R, you can generate $n$ Bernoulli-distributed random variables by

In [None]:
n=10
p=0.5
## The second argument is fixed at 1. If you input a larger number, it will be a binomial distribution which we do not talk about in this course.
X=rbinom(n,1,p) 
X

#### 1.2.2 Continuous Random Variables, Cumulative Distribution Function, and Density

When $\Omega$ is a continuum, e.g. $[0,\infty)$, $(-\infty,\infty)$, and $[0,1]$, we say $X$ is continuous. The first thing you should always remember is that the probability of a singleton event for a continuous $X$ is always 0:
$$
\Pr(X=x)=0,\forall x\in\Omega.
$$
Therefore, it makes no sense to define the distribution of a continuous random variable in this way; $X$ and $Y$ may follow different distributions but they both have 0 probability for any singleton event.

* The reason is that probability is a measure. Let $\Omega=[0,1]$ whose length is 1. A well-defined probability should be the length of a subset of $\Omega$. Then what is the length of a singleton, say $\{0.2\}$? We know a point has zero length so the probability of that event is 0.

However, the probability of events like $[0.2,0.3)$ or $(-\infty,2)$ may have nonzero probability. As long as we can design a function knowing which enables us to calculate the probabilities of all possible events, we can use it to represent the distribution. One such a function is called the *cumulative distribution function* (CDF):
$$
F_{X}(x)\equiv \Pr(X\leq x),\forall x\in \mathbb{R}.
$$

* For practical purposes, it does not matter whether you define the CDF as $\Pr(X\leq x)$ or $\Pr(X<x)$ **for continuous random variables**. This is because $\Pr(X\leq x)=\Pr(X<x\text{ or }X=x)=\Pr(X<x)+\Pr(X=x)=\Pr(X<x)$.
* You can verify that once you know $F_X$, you can calculate the probability of any event:
  * $\Pr(X\in (-1,3))=F_{X}(3)-F_{X}(-1)$. $\Pr(X>2)=1-F_{X}(2)$.
* $x$ in the definition **is not necessarily in $\Omega$**. For example, if $\Omega=[0,1]$, it means $\Pr(X\in [0,1])=1$, so $1=\Pr(X\in [0,1])\leq \Pr(X\leq 2)=F_{X}(2) $ so $F_{X}(2)=1$, still well-defined.
* $F_{X}(\cdot)$ is a continuous function on $\mathbb{R}$ if $X$ is continuous.

If you compare the CDF of a continuous random variable and the probability of a discrete random variable, they are pretty different; You can indeed define CDF for a discrete random variable as well (but you need to be careful about the equality sign now):
$$
F_{X}(x)\equiv\Pr(X\leq x)=\sum_{x_{j}\leq x}\Pr(X=x_{j}),\text{  $X$ is discrete}.
$$
 It seems unfair: discrete random variables have two ways to descibe its distribution while continuous random variables only have one. To make it fair, we propose a concept that is more parallel to the singleton probability of discrete random variables: the density function.

Roughly, the density is the derivative of the CDF and thus:
$$
F_{X}(x)\equiv \Pr(X\leq x)=\int_{s\leq x}f_{X}(s)ds,\text{    $X$ is continuous}.
$$
If you compare the two equations above, you can see density for a continuous $X$ plays a similar role as probability for a discrete $X$.

However, from calculus, we know not all continuous functions are differentiable. So, unlike the always-existsing probability of a discrete random variable, not all continuous random variables have densities.

* A continuous random variable whose density exists is called **absolutely continuous**.
* *Cantor distribution* is a continuous distribution that does not have density.

Just like CDF, if you know the density of a random variable, you can calculate any event of a random variable. So, density (when it exists) contains the same amount info as CDF, and thus is another way to represent a distribution.

* $\Pr(X\in (a,b))=\int_{a}^{b}f_{X}(s)ds$. $\Pr(X=x)=\int_{x}^{x}f_{X}(s)ds=0$ under $f_{X}(x)<\infty$ .

A continuous distribution we'll use throughout the semester:

* Normal (or gaussian), denoted by $N(\mu,\sigma^{2})$. $\Omega=(-\infty,\infty)$. $f_{X}(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}\right)$.

In [None]:
library(ggplot2)
library(latex2exp)
library(ggpointdensity)

## Normal densities
x=seq(-7,7,length=100)
f1=dnorm(x,mean=0,sd=1)
f2=dnorm(x,mean=2,sd=1)
f3=dnorm(x,mean=0,sd=2)

## Put the three density function values and x into a dataset
data=data.frame(f1=f1,f2=f2,f3=f3,x=x)

## Plot the three densities in one graph.
p=ggplot(data=data)+
  geom_line(aes(x=x,y=f1,colour="N(0,1)"))+
  geom_line(aes(x=x,y=f2,colour="N(2,1)"),linetype="dashed")+
  geom_line(aes(x=x,y=f3,colour="N(0,4)"),linetype="dotdash")+
  xlab('x')+
  ylab('density')+
  labs(colour="Legend")+
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  )
p

### 1.3 Joint, Conditional and Marginal Distributions

  For our purposes, we will only review these three concepts for a continuous variable whose density exists.

  Let $X$ represent years of schooling and $Y$ monthly income. Both are treated as continuous for now,

  * Years of schooling is sometimes treated as continuous and sometimes discrete. This is the art side of economics and applied econometrics.

$\Pr(X>12)$ is the probability of someone who had post-high school education. When we calculate this probability in practice, we can count the number of people with post-high school education **in the full population**, regardless of her/his income, and divide it by the number of people of this population.

$\Pr(Y>8K)$ is the probability of someone whose monthly income is above 8K. When we calculate this probability in practice, we can count the number of such people **in the full population**, regardless of her/his education level, and divide it by the number of people of this population.

$\Pr(X>12,Y>8K)$ is the probablity of someone who had post-high school education AND whose monthly income is above 8K. We need to count the number of people **in the full population** who meet the two criteria **at the same time**, and divide it by the number of people of this population.

$\Pr(X>12|Y>8K)$ is the probability of someone with income above 8K who had post-high school education. We need to count the number of people **in a sub-population** with income above 8K, and divide it by the number of people of this population.

  * The first two are called **marginal probabilities**.
  * The third one is the **joint probability**.
  * The last one is a **conditional probability**.

  We always have $Joint=Marginal\times Conditional$. Below are some properties of the three:
$$
\begin{align*}
  F_{X,Y}(x,y)\equiv &\Pr(X\leq x, Y\leq y)=\Pr(X\leq x|Y\leq y)F_{Y}(y)=\Pr(Y\leq y|X\leq x)F_{X}(x),\\
  F_{Y|X}(y|x)\equiv &\Pr(Y\leq y|X=x)\equiv \lim_{\epsilon\to 0}\Pr\left(Y\leq y|X\in (x-\epsilon,x+\epsilon)\right), f_{Y|X}(y|x)=\partial_{y}F_{Y|X}(y|x),\\
  f_{X,Y}(x,y)\equiv &f_{X|Y}(x|y)f_{Y}(y)=f_{Y|X}(y|x)f_{X}(x),\\
  F_{X}(x)=&\int_{y\in\Omega_{Y}}F_{X|Y}(x|y)f_{Y}(y)dy,\\
  f_{X}(x)=&\int_{y\in\Omega_{Y}}f_{X,Y}(x,y)dy.
  \end{align*}
$$

  * The first and the third equations are Bayes rule. 
  * The second equation defines conditional probability and density when the contiditioning event has 0 probability.
  * The last two equalities say that **the joint distribution always contains no less information than the marginals**. That is, you can always recover the marginals, and thus the conditionals, from the joint, but not always *vice versa*.

An important concept is **independence**.

**Definition**. $X$ and $Y$ are independent if and only if $F_{X,Y}(x,y)=F_{X}(x)F_{Y}(y)$ **for all $(x,y)$**. Or if they have densities, $X$ and $Y$ are independent if and only if  $f_{X,Y}(x,y)=f_{X}(x)f_{Y}(y)$ **for all** $(x,y)$.

  * In this course, we denote independence by $\perp$. $X\perp Y$ is equivalent to $Y\perp X$.
  * If $X$ and $Y$ are independent, then their marginals contain the same info as the joint by definition.
  * If $X$ and $Y$ are independent, $g(X)$ and $h(Y)$ are also independent for any measurable function $g$ and $h$.

## 2. Expectation, Variance, Covariance and Correlation

In this section we discuss both discrete and continuous random variables, but we will use the same notation: density $f_{X}(x)$ means the density function when $X$ is continuous and means $\Pr(X=x)$ when $X$ is discrete. $\int dx$ means integral when $X$ is continuous and means $\sum_{j}$ when $X$ is discrete.

### 2.1 Marginals

  * Expectation (or mean): $\mathbb{E}(X)\equiv\int_{\Omega}xf_{X}(x)dx$. Sometimes denoted by $\mu$.
    * The probability-weighted average of all possible outcome.
    * **Intuition:** Your expectation of some future outcome is always affected by the proabability of each possible event, seldom the simple average.
    * $\mathbb{E}(aX+bY)=a\mathbb{E}(X)+b\mathbb{E}(Y)$.
    * $\mathbb{E}(XY)\neq \mathbb{E}(X)\mathbb{E}(Y)$ in general.
  * Variance: $\mathbb{V}(X)\equiv \mathbb{E}(X-\mu_{X})^{2}$. Sometimes denoted by $\sigma_{X}^{2}$.
    * The probability-weighted average of the squared Euclidean distance from mean to each outcome. 
    * **Intuition:** It's a squared distance to the mean, so measures the dispersion, or equality, or risk.
      * Consider countries A and B with equal mean household income. A has a larger income variance. What does this say about income inequality?
    * A related concept is the standard deviation (s.d.) $\sigma_{X}=\sqrt{\mathbb{V}(X)}$ .
    * $\mathbb{V}(X)=\mathbb{E}(X^{2})-\mu_{X}^{2}$.
    * $\mathbb{V}(aX)=a^{2}\mathbb{V}(X)$.
  * Covariance: $cov(X,Y)=cov(Y,X)\equiv\mathbb{E}(X-\mu_{X})(Y-\mu_{Y})$.
    + **Intuition:** By how much do $X$ and $Y$ co-vary? One increases, the other increases as well? (Positive.) Decreases? (Negative.) Stays unchanged? (Zero).
    + $cov(X,Y)=\mathbb{E}(XY)-\mu_{X}\mu_{Y}$.
    + $cov(aX,Y)=a\cdot cov(X,Y)$, $cov(X,X)=\mathbb{V}(X)$.
    + $cov(aX+bY,Z)=a\cdot cov(X,Z)+b\cdot cov(Y,Z)$.
    + $\mathbb{V}(aX+bY)=a^{2}\mathbb{V}(X)+b^{2}\mathbb{V}(Y)+2ab\cdot cov(X,Y)$.
  * Correlation: $\rho (X,Y)=\rho(Y,X)\equiv \frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}$. Sometimes denoted by $\rho_{XY}$.
    + **Intuition:** A large covariance may be because $X$ and $Y$ co-vary a lot, or may be because $X$ and/or $Y$  themselves are volatile (large s.d.). So need to divide the latter to reflect their *net* comovement.
    + $\rho_{XY}\in [-1,1]$. Can you prove it?

All the above four quantities are i) deterministic and ii) built on expectations. However, expectation itself may not exist: the integral of $xf(x)$ may explode when $x\in (-\infty,\infty)$.

  * Fat-tail distributions are often a concern. Check out Cauchy distribution.

### 2.2 Conditionals

Conditional expectation is perhaps the most important quantity in this course.

  * $\mathbb{E}(Y|X=x)\equiv \int y f_{Y|X}(y|x)dy$.

    * $x$ here is just a parameter; the integral does not do anything to $X$.
    * On the other hand, the randomness of the variable in front of the condtional sign $|$ is eliminated by the integral.
    * $\mathbb{E}(Y|X=x)$ is a **nonrandom** number.

  * $\mathbb{E}(Y|X)\equiv \int yf_{Y|X}(y|X)dy$.

    * Again, the expectation eliminates the randomness of $Y$ by the integral.
    * **But, the randomness of $X$ is left untouched**.
    * $\mathbb{E}(Y|X)$ is a **random variable**. Its randomness **only comes from $X$**.

  * $\mathbb{E}(aY+bZ|X)=a\mathbb{E}(Y|X)+b\mathbb{E}(Z|X)$, but in general $\mathbb{E}(Y|aX+bZ)\neq \mathbb{E}(Y|aX)+\mathbb{E}(Y|bZ)$.

  * **Law of iterated expectation**: 
    $$
    \begin{align*}
    &\text{(General form)}\ \ \ \ \mathbb{E}[\mathbb{E}(Y|f(X))|g(X)]=\mathbb{E}[\mathbb{E}(Y|g(X))|f(X)]=\mathbb{E}(Y|g(X))\ \text{if $g$ is a function of $f$},\\
    &\text{(Special case 1)}\ \ \ \ \mathbb{E}\left[\mathbb{E}(Y|X_{1},X_{2})|X_{1}\right]=\mathbb{E}\left[\mathbb{E}(Y|X_{1})|X_{1},X_{2}\right]=\mathbb{E}(Y|X_{1}),\\
    &\text{(Special case 2)}\ \ \ \ \mathbb{E}\left[\mathbb{E}(Y|X)\right]=\mathbb{E}(Y).\\
    \end{align*}
    $$

    + **Intuition**: Coarse information set dominates. Taking unconditional expectation is to shrink a large information set (the probability cloud) to a single point (expectation is nonrandom). The conditioning variables tell you when the shrinkage should stop, i.e., randomness from the conditioning variables should be kept. $\mathbb{E}\left[\mathbb{E}(Y|X_{1},X_{2})|X_{1}\right]$ says you first keep randomness in $X_{1}$ and $X_{2}$ and then eliminate the randomness in $X_{2}$. $\mathbb{E}\left[\mathbb{E}(Y|X_{1})|X_{1},X_{2}\right]$ says you first keep randomness in $X_{1}$ only, but after that randomness of $X_{2}$ has already been eliminated, so although the second step, i.e., the outer expectation, asks you to keep randomness of $X_{1}$ and $X_{2}$, $X_{2}$ is already gone. So either way, only randomness of $X_{1}$ is kept, so they both equal $\mathbb{E}(Y|X_{1})$. 
      + A useful application: $\mathbb{E}(X|X)=X$.

  * **Conditional mean independence** if $\mathbb{E}(Y|X)=\mathbb{E}(Y)$ almost surely.

    * $\mathbb{E}(Y|X)=E(Y)$ in general does not imply $\mathbb{E}(X|Y)=\mathbb{E}(X)$.

### 2.3 Independence, Conditional Mean Independence, and Zero Correlation

So far we have learned three different measures characterizing the relationship between two random variables. What is their relationship?

We have
$$
\begin{align*}
X\perp Y\equiv Y\perp X&\implies \mathbb{E}(Y|X)=\mathbb{E}(Y)\  \& \ \mathbb{E}(X|Y)=\mathbb{E}(X),\\
\mathbb{E}(Y|X)=\mathbb{E}(Y)\  \text{or} \ \mathbb{E}(X|Y)=\mathbb{E}(X)&\implies \rho(X,Y)=0\Leftrightarrow cov(X,Y)=0.
\end{align*}
$$

* A useful trick to prove the second implication:
  $$
  \mathbb{E}(XY)=\mathbb{E}[\mathbb{E}(XY|X)]=\mathbb{E}(X\mathbb{E}(Y|X))=\mathbb{E}(X\mu_{Y})=\mu_{X}\mu_{Y}.
  $$

* A generalization which follows the same proof as above:
  $$
  \mathbb{E}(Y|X)=\mathbb{E}(Y)\implies \mathbb{E}(Y\cdot g(X))=\mathbb{E}(Y)\mathbb{E}(g(X)).
  $$

A special example when the reverse is also true is joint-normal distribution.

**Definition**. Suppose $X$ and $Y$ are normal and $aX+bY$ is also normal for all $a$ and $b$, then the joint distribution of $(X,Y)$ is joint-normal.

**Theorem**. If $(X,Y)$ are joint-normal, then $\rho_{XY}=0\Leftrightarrow X\perp Y$.

In [None]:
library(ggplot2)
library(mvtnorm)
library(plotly)
library(latex2exp)
library(ggpointdensity)
library(weights)

x=seq(-4,4,length=100)
y=seq(-4,4,length=100)
## A joint normal distribution can be fully-characterized by mean, variance, and covariances
sigma_pos=matrix(c(1,0.5,0.5,1),ncol=2) # X and Y positively correlated
sigma_neg=matrix(c(1,-0.5,-0.5,1),ncol=2) # X and Y negatively correlated
sigma_ind=matrix(c(1,0,0,1),ncol=2) # X and Y independent
mu=c(0,0) # mean

f_pos=function(x,y){dmvnorm(cbind(x,y),mu,sigma_pos)}
f_neg=function(x,y){dmvnorm(cbind(x,y),mu,sigma_neg)}
f_ind=function(x,y){dmvnorm(cbind(x,y),mu,sigma_ind)}
z_pos=outer(x,y,f_pos)
z_neg=outer(x,y,f_neg)
z_ind=outer(x,y,f_ind)
pp_pos=plot_ly(x=x,y=y,z=z_pos)%>% add_surface()
pp_neg=plot_ly(x=x,y=y,z=z_neg)%>% add_surface()
pp_ind=plot_ly(x=x,y=y,z=z_ind)%>% add_surface()
pp_ind
# pp_neg
# pp_pos

### 2.4 Joint

For a joint distribution of a vector of random variables $X=(X_{1},...,X_{k})'$, 

* The mean $\mu_{X}$ is a $k\times 1$ vector $(\mu_{X_{1}},\ldots,\mu_{X_{k}})'$. 
* The variance $\mathbb{V}(X)$ (usually called the *variance-covariance matrix* or simply *covariance matrix*) is defined as $\mathbb{E}[(X-\mu_{X})(X-\mu_{X})']$. It is a $k\times k$ symmetric positive semi-definite matrix: (why?)

$$
\begin{pmatrix}\sigma_{X_{1}}^{2}&cov(X_{1},X_{2})&\cdots&cov(X_{1},X_{k})\\
cov(X_{2},X_{1})&\sigma_{X_{2}}^{2}&\cdots&cov(X_{2},X_{k})\\
&\cdots&\cdots&\\
cov(X_{k},X_{1})&cov(X_{k},X_{2})&\cdots&\sigma_{X_{k}}^{2}\end{pmatrix}.
$$

+ Can you come up with a sufficient condition so that the covariance matrix is positive definite?

## 3. Large Sample Theory

For almost the entire semester (except for panel data), we will assume our sample $\{Y_{i},X_{i}\}_{i=1,...,n}$ to be *i.i.d.*, that is, **identically and independently distributed**.

* i.i.d. is assumed **across $i$**, that is, we assume Alice and Bob are independent; knowing Alice's income, age education does not help us to know Bob's income, age and education.
* i.i.d. is **NOT** assumed **within** $i$, that is, Alice's age and education are allowed to be correlated with Alice's income.
* Formally, $(Y_{i},X_{i})\perp (Y_{j},X_{j})$ for all $i\neq j$.

In this semester, we will **never** make any assumptions about the exact distribution of $(Y_{i},X_{i})$. For instance, you will never hear me saying that income is normally distributed etc. Then how do we handle the randomness? What's the point of learning all those probability theories?

* Recall the definition of mean, variance, covariance.... They are all built on $\mathbb{E}$, which is built on the density. Without knowing the exact distribution of the sample, we don't know the density, and thus we don't know the exact value of any of the quantities we introduced.

Luckly, it turns out regardless of the *true* distribution of the sample, which is unknown, we can approximate the mean of a sample by its sample average, and approximate the distribution by standard normal distribution very well, as long as the sample size is sufficiently large. The former result is called *the weak law of large number* (WLLN) and the latter is called *the central limit theorem* (CLT).

### 3.1 WLLN

For an arbitrary vector $X$, define its Euclidean distance from 0 by $\|X\|\equiv \sqrt{X'X}$.

**Definition (Convergence in Probability)**. A sequence of $k\times 1$ random vector $W_{n}$ is said to converge to a vector $w$ **in probability** if  **for all ** $\delta>0$, $\lim_{n\to\infty}\Pr(\|W_{n}-w\|<\delta)\to 1$. Vector $w$ is called the **probability limit**, or p-limit of $W_{n}$. Convergence in probability is denoted by $W_{n}\to_{p}w$ or equivalently $(W_{n}-w)\to_{p}0$. It is also called *$W_{n}$ is consistent of $w$*.

* If $W_{n}$ is not random, we define convergence in calculus as follows: for any $\delta>0$, $\|W_{n}-w\|<\delta$ for sufficiently large $n$.
* Now $W_{n}$ is random, $\|W_{n}-w\|<\delta$ is an event, not necessarily happen. However, consistency says the chance that it happens is larger and larger when $n$ is larger and larger.

**Theorem (WLLN)**. Suppose we a $k\times 1$ random vector with $n$ i.i.d. observations: $(X_{ij})$, $i=1,...,n$, $j=1,...,k$. Suppose for each $j=1,...,k$, $\mathbb{V}(X_{ij})=\sigma_{j}^{2}<\infty$, then the $k\times 1$ vector $\bar{X}\equiv (\bar{X}_{1},...,\bar{X}_{k})'$ satisfies $\bar{X}\to_{p}\mu_{X}$ where $\mu_{X}\equiv (\mu_{1},...,\mu_{k})'$.	

Before we move on, what is the intuition behind WLLN? For simplicity, suppose $k=1$. 

* First, WLLN says a random variable is close to nonrandom number. How is that possible? What makes a variable nonrandom?
  * 0 variance.
* Then, what is the variance of $\bar{X}$?
  * $\mathbb{V}(\bar{X})=\mathbb{V}\left(\frac{X_{1}+\cdots +X_{n}}{n}\right)=\frac{1}{n^{2}}(n\sigma_{X}^{2})=\frac{\sigma_{X}^{2}}{n}\to 0$.
* So, the fundamental driving force of convergence in probability in WLLN is that the variance of $\bar{X}$ shrinks to 0 as $n$ goes to $\infty$. In this way, the random $\bar{X}$ becomes more and more like a nonrandom number as $n$ increases.
* Now, what will happen if I multiply $\bar{X}$ by $\sqrt{n}$? You can verify that the sequence's variance will stay constant no matter what $n$ is. This brings up our next theorem, CLT.

### 3.2 CLT

**Definition (Convergence in Distribution)**. A sequence of $k\times 1$ random vector $W_{n}$ is said to be converging to a random vector $W$ **in distribution** if $\Pr(W_{n}\in A)\to \Pr(W\in A)$ for all $A\subseteq \mathbb{R}^{k}$. Convergence in distribution is denoted by $W_{n}\to_{d}W$.

* The convergence in the definition is deterministic: It's convergence of a nonrandom function $\Pr$ to another nonrandom function.

**Theorem (CLT)**. Suppose we a $k\times 1$ random vector with $n$ i.i.d. observations: $(X_{ij})$, $i=1,...,n$, $j=1,...,k$. Suppose for each $j=1,...,k$, $\mathbb{V}(X_{ij})=\sigma_{j}^{2}<\infty$, then the $k\times 1$ vector $\bar{X}\equiv (\bar{X}_{1},...,\bar{X}_{k})'$ satisfies $\sqrt{n}(\bar{X}-\mu_{X})\to_{d}N(0,\Sigma)$ where $\mu_{X}\equiv (\mu_{1},...,\mu_{k})'$, $\Sigma\equiv \mathbb{V}(X_{i})$ is the $k\times k$ covariance matrix.

* CLT says that if we scale up the random sequence of the sample average by $\sqrt{n}$, the sequence will converge to standard normal.
  * From the last part of [Section 3.1](#3.1 WLLN), this is because the variance no longer shrinks to 0.
  * A formal proof is demanding. Be prepared to see characteristic function/moment generating function in the proof. Not required.

In [None]:
library(scales)
library(ggplot2)
library(mvtnorm)
library(latex2exp)
library(ggpointdensity)
library(weights)
## WLLN
set.seed(12)
e=0.1
m=0
n=0
  for (b in 1:200){
      m[1+10*(b-1)]=mean(rbinom(10,1,0.5))
      m[2+10*(b-1)]=mean(rbinom(20,1,0.5))
      m[3+10*(b-1)]=mean(rbinom(30,1,0.5))
      m[4+10*(b-1)]=mean(rbinom(40,1,0.5))
      m[5+10*(b-1)]=mean(rbinom(50,1,0.5))
      m[6+10*(b-1)]=mean(rbinom(60,1,0.5))
      m[7+10*(b-1)]=mean(rbinom(70,1,0.5))
      m[8+10*(b-1)]=mean(rbinom(80,1,0.5))
      m[9+10*(b-1)]=mean(rbinom(90,1,0.5))
      m[10+10*(b-1)]=mean(rbinom(100,1,0.5))
         
      n[1+10*(b-1)]=10
      n[2+10*(b-1)]=20
      n[3+10*(b-1)]=30
      n[4+10*(b-1)]=40
      n[5+10*(b-1)]=50
      n[6+10*(b-1)]=60
      n[7+10*(b-1)]=70
      n[8+10*(b-1)]=80
      n[9+10*(b-1)]=90
      n[10+10*(b-1)]=100
      
}
data1=data.frame(m=m,n=n)
data2=data1[order(n),]
mm=data2$m
prop=0
for (i in 1:10){
  prop[i]=sum((abs(mm[(1+200*(i-1)):(200+200*(i-1))]-0.5)<e))/200
}
#name=paste0("Pr(|μ-0.5|<ε)=",prop,", n=",n)
name=paste0("Pr=",prop,", n=",n)
datap=data.frame(data1,name)
datap=datap[order(n),]

## Figure 1 for WLLN
ggplot(data1,aes(n,m))+geom_bin2d()+scale_fill_gradient(low = "#efedf5", high = "#756bb1",trans="log10")+
scale_x_continuous(breaks = seq(0, 100, 10))+
  geom_hline(yintercept = 0.4,linetype="dashed")+
  geom_hline(yintercept = 0.6,linetype="dashed")+labs(colour=TeX('$|\\bar{X}-0.5|\\leq\\epsilon$'))+ylab("Sample averages")

## Figure 2 for WLLN
ggplot(datap, aes(n, m)) +geom_bin2d()+scale_fill_gradient(low = "#efedf5", high = "#756bb1",trans="log10")+
  facet_wrap(~name, ncol=5)+geom_hline(yintercept = 0.4,linetype="dashed")+
  scale_x_continuous(breaks = seq(0, 100, 20))+
  geom_hline(yintercept = 0.6,linetype="dashed")+labs(colour=TeX('$|\\bar{X}-0.5|\\leq\\epsilon$'))+ylab("Sample averages")

### CLT
namen=paste0("n=",n)
data3=data.frame(data1,namen) # passed to geom_histogram and stat_function
data3$m=sqrt(data3$n)*(data3$m-0.5)/0.5
data3$namen=factor(data3$namen,levels=c("n=10","n=20","n=30","n=40","n=50","n=60","n=70","n=80","n=90","n=100"))
ggplot(data3, aes(m, color=namen)) +
  theme_bw() +
  geom_histogram(aes(y=..density..), binwidth = 0.5,
                 color="white",fill = "#cdc7e0", size = 1) +
  stat_function(fun = function(x) dnorm(x, mean = 0, sd = 1),
                color = "gray61", size = 1)+
  facet_wrap(~namen, ncol=5)+xlab(TeX('$\\frac{\\sqrt{n}(\\bar{X}-\\mu_{X})}{\\sigma_{X}}$'))

### 3.3 Some Useful Properties.

We will use the following properties from time to time. 

* If $X_{n}\to_{p}X$, then $X_{n}\to_{d}X$. 
* (Slutsky's theorem) If $X_{n}\to _{d}X$ and $Y_{n}\to_{p}y$ where $y$ is nonrandom. Then
  * $X_{n}+Y_{n}\to_{d}X+y$.
  * $X_{n}Y_{n}\to_{d}yX$
  * $X_{n}/Y_{n}\to_{d}X/y$ provided that $y\neq 0$.
* (Continuous mapping theorem, CMT) If function $g$ is continuous almost everywhere, then
  * $X_{n}\to_{d}X\implies g(X_{n})\to_{d}g(X)$.
  * $X_{n}\to_{p} X\implies g(X_{n})\to_{p}g(X)$. 
* (Delta method) Suppose $X_{n}$ is a $k\times 1$ vector and $g:\mathbb{R}^{k}\mapsto\mathbb{R}^{l}$, $\sqrt{n}(X_{n}-X)\to_{d}N(0,\Sigma)\implies\sqrt{n}(g(X_{n})-g(X))\to_{d}(0,\partial_{X}'g(X)\Sigma\partial_{X}g(X))$, provided that the Jacobian $\partial_{X}g(X)$ exists and is not zero.
  * Don't worry if this looks too abstract. We'll see examples later when we study least squares.