#  Moment matching

* Central limit theorem states that a distribution for $X_1+\cdots +X_n$ can be approximated with normal distribution $\mathcal{N}(\mu, \sigma)$ for some  $\mu$ and $\sigma$.
* Moment matching is a heuristic way to guess these parameters $\mu$ and $\sigma$ from observations.

## I. Empirical paramater estimation

Let $x_1, \ldots, x_n$ be the observations. Then we can define the following point estimates

\begin{align*}
\hat{\mu}&=\frac{1}{n}\cdot \sum_{i=1}^n x_i\\
\hat{\sigma}&=\sqrt{\frac{1}{n}\cdot \sum_{i=1}^n (x_i-\hat{\mu})^2}
\end{align*}

Again central limit theorem assures that $\hat{\mu}$ is very close to the true mean $\mu$ when the number observations $n$ is large. For the same reason $\hat{\sigma}$ is very close to the true variance $\sigma$.
For smaller datasets, statisticians use modified formulae that are likely to be more closer to the true values. 

## II. Moment matching for the sum

To find the approximation for the sum 

\begin{align*}
S=X_1+\cdots+X_n
\end{align*}

given point estimates $\hat{\mu}$ and $\hat{\sigma}$ we first compute the theoretical mean and variance for the sum

\begin{align*}
\mathbf{E}(S) &= \mathbf{E}\left(\sum_{i=1}^n X_i\right) =\sum_{i=1}^n \mathbf{E}(X_i)=n\mu\\
\mathbf{D}(S) &= \mathbf{D}\left(\sum_{i=1}^n X_i\right) =\sum_{i=1}^n \mathbf{D}(X_i)=n\sigma^2\enspace.
\end{align*}

Now we substitute theoretical parameters with their point estimates and get 

\begin{align*}
S\approx\mathcal{N}(n\hat{\mu}, \sqrt{n}\hat{\sigma})\enspace.
\end{align*}

## III. Moment matching for the average

To find the approximation for the average 

\begin{align*}
 A=\frac{X_1+\cdots+X_n}{n}
\end{align*} 
 
 given point estimates $\hat{\mu}$ and $\hat{\sigma}$ we first compute the theoretical mean and variance for the sum

\begin{align*}
\mathbf{E}(A) &= \mathbf{E}\left(\frac{1}{n}\cdot\sum_{i=1}^n X_i\right) =\frac{1}{n}\cdot \sum_{i=1}^n \mathbf{E}(X_i)=\mu\\
\mathbf{D}(A) &= \mathbf{D}\left(\frac{1}{n}\cdot \sum_{i=1}^n X_i\right) =\frac{1}{n^2}\cdot \sum_{i=1}^n \mathbf{D}(X_i)=\frac{\sigma^2}{n}\enspace.
\end{align*}

Now we substitute theoretical parameters with their point estimates and get 

\begin{align*}
A\approx\mathcal{N}\left(\hat{\mu}, \frac{\hat{\sigma}}{\sqrt{n}}\right)\enspace.
\end{align*}

## IV. Theoretical approximation

The convergence of distributions is elusive concept that can simplified into practical criterion:

A sequence of unidimensional distribution $\mathcal{D}_1,\mathcal{D}_2,\ldots$ converges to a distribution $\mathcal{D}$ if for any range $[a,b]$ the corresponding probabilities are linken

\begin{align*}
|\Pr[x\gets \mathcal{D}: x\in [a,b]]-\Pr[x\gets \mathcal{D}_n: x\in [a,b]]|\leq \delta 
\end{align*}

for large enough $n$. 
This implies that we can use the distribution $\mathcal{D}$ to appoximate probabilities

\begin{align*}
\Pr[x\gets \mathcal{D}: x\in [a,b]] - \delta \leq \Pr[x\gets \mathcal{D}_n: x\in [a,b]] \leq \Pr[x\gets \mathcal{D}: x\in [a,b]] + \delta
\end{align*}

when $n$ is large enough. 
Note that $n$ can depend on the endpoints $a$ and $b$, i.e. the approximation is not universal.

## V. Practical approximation

In practice we are interested which is the best range to find the sum $S$ and the average $A$.
For that we blindly assume that the approximation error $\delta$ is negligible for our calculations.
As the density of normal distribution is higest around the mean we can consider intervals

### Approximation for sum $S$

* 68.0% confidence interval
$[n\hat{\mu}-1\sqrt{n}\hat{\sigma}, n\hat{\mu}+1\sqrt{n}\hat{\sigma}]$

* 95.0% confidence interval
$[n\hat{\mu}-2\sqrt{n}\hat{\sigma}, n\hat{\mu}+2\sqrt{n}\hat{\sigma}]$

* 99.7% confidence interval
$[n\hat{\mu}-3\sqrt{n}\hat{\sigma}, n\hat{\mu}+3\sqrt{n}\hat{\sigma}]$

### Approximation for average $A$

* 68.0% confidence interval
$\left[\hat{\mu}-\frac{\hat{\sigma}}{\sqrt{n}}, \hat{\mu}+\frac{\hat{\sigma}}{\sqrt{n}}\right]$


* 95.0% confidence interval
$\left[\hat{\mu}-\frac{2\hat{\sigma}}{\sqrt{n}}, \hat{\mu}+\frac{2\hat{\sigma}}{\sqrt{n}}\right]$



* 99.7% confidence interval
$\left[\hat{\mu}-\frac{3\hat{\sigma}}{\sqrt{n}}, \hat{\mu}+\frac{3\hat{\sigma}}{\sqrt{n}}\right]$


Traditionally one reports 68% confidence intervals to show the variability. 
This is usually expressed like number $\pm$ standard deviation (e.g. $55\pm 3$ %).