In [1]:
from common.setup_notebook import set_css_style
set_css_style()

# <center> Moments of a distribution and related quantities

In the following, we will use $X$ to represent a random variable living in sample space (the space of all possible values it can assume) $\Omega$.

In the discrete case, the probability of each value $x_i$ will be represented as $p_i$; in the continuous case $p(x) = P(X=x)$ will be the probability density function.

Let's start with mean and variance and then we'll then give the general definitions.

## Expected Value

The **expected value**, or **expectation**, or **mean value** is defined, in the *continuous* case as

$$
\mathbb{E}[X] = \int_\Omega \text{d} x \ x p(x) \ ,
$$

Similarly, in the *discrete* case,

$$
\mathbb{E}[X] = \sum_i^N p_i x_i \ ,
$$

The expectation represents the average of all the possible values the random variable can assume, where such average is intended as a weighted average with the probability of each such outcomes. 

The expected value is typically indicated with $\mu$. Note that the arithmetic mean, done one a sample of the data, is an estimator of the (theoretical) expected value in that it is its representation when the whole population is not known.

### Linearity of the expected value

The expected value is a linear operator:

$$
\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b \mathbb{E}[Y]
$$

*Proof*

We will prove this in the continuous case but it is clearly easily extensible.

$$
\begin{align}
\mathbb{E}[aX + bY] &= \int_{\Omega_X}\limits \int_{\Omega_Y}\limits \text{d} x \ \text{d} y \ (ax + by) p(x, y) \\
&= a \int_{\Omega_X}\limits \int_{\Omega_Y}\limits \text{d} x \ \text{d} y \ x p(x, y) + b \int_{\Omega_X}\limits \int_{\Omega_Y}\limits \text{d} x \ \text{d} y \ y p(x, y) \\
&= a \int_{\Omega_X}\limits \text{d} x \ x p(x) + b \int_{\Omega_Y}\limits \text{d} y \ y p(y) \\
&= a\mathbb{E}[X] + b \mathbb{E}[Y]
\end{align}
$$

This is because $p(x) = \int_{\Omega_Y}\limits\text{d} y \  x p(x, y)$ because we are effectively summing the PDFs over all the possible values of $Y$, hence eliminating the dependency from this random variable. Analogously the other one.

## Variance and standard deviation

The variance is the expected value of the squared difference from the expectation:

$$
Var[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] =  \int_{\Omega_X} \text{d} x \ (x - \mathbb{E}[X])^2 p(x)
$$

The variance is the second moment around the mean. It is typically indicated as $\sigma^2$, $\sigma$ being the **standard deviation**, which gives the measure of error of values from the mean.

### Rewriting the variance

We can also write the variance as

$$
Var[X] = \mathbb{E}[X^2] - \big(\mathbb{E}[X]\big)^2
$$

*Proof*

$$
\begin{align}
Var[X] &= \mathbb{E}[(X - \mu)^2] \\
&= \int_{\Omega_X} \text{d}x \ (x^2 - 2 \mu x + \mu^2) p(x) \\
&= \int_{\Omega_X} \text{d}x \ x^2 p(x) -2 \mu \int_{\Omega_X} \text{d}x \ x p(x) + \mu^2 \int_{\Omega_X} \text{d} x p(x) \\
&= \mathbb{E}[X^2] - 2 \mu^2 + \mu^2 \\
&= \mathbb{E}[X^2] - \big(\mathbb{E}[X]\big)^2
\end{align}
$$

### The variance is not linear

In fact, using the linearity of the expectation

$$
\begin{align}
Var[aX] &= \mathbb{E}[(aX)^2] - \big( \mathbb{E}[aX] \big)^2 \\
&= a^2 \mathbb{E}[X^2] - (a^2 \mu^2)
\end{align}
$$

## General definitions of moments


The $n$-th **raw moment** is the expected value of the $n$-th power of the random variable:

$$
\boxed{\mu_n' = \int \text{d} x \ x^n p(x)}
$$

The expected value is then the first raw moment.


The $n$-th **central moment** around the mean is defined as

$$
\boxed{\mu_n = \int \text{d} x (x-\mu)^n p(x)}
$$

The variance is the second central moment around the mean.

Moments get standardises (normalised) by dividing for the appropriate power of the standard deviation. The $n$-th **standardised moment** is the central moment divided by standard deviation with the same order power:

$$
\boxed{\tilde \mu_n = \frac{\mu_n}{\sigma^n}}
$$

## Skeweness

The **skeweness** is the third standardised moment:

$$
\gamma = \frac{\mathbb{E}[(X-\mu)^3]}{\sigma^3}
$$

The skeweness quantifies how symmetrical a distribution is around the mean: it is zero in the case of a perfectly symmetrical shape. It is positive if the distribution is skewed on the right, that is, if the right tail is heavier than the left one; it is negative if it is skewed on the left, meaning the left tail is heavier than the right one.

## Kurtosis

The **kurtosis** is the fourth standardised moment:

$$
\kappa = \frac{\mu_4}{\sigma^4}
$$

It measures how heavy the tail of a distribution is with respect to a gaussian with the same $\sigma$.

> TODO excess kurtosis and expand

## Further results

### Variance of a matrix of constants times a random vector

In general, with a matrix of constants $\mathbf{X}$ and a vector of observations (random variables) $\mathbf{a}$, using the linearity of the expected value so that $\mathbb{E}[\mathbf{X a}] = \mathbf{X} \mathbb{E}[\mathbf{a}]$, we have

$$
\begin{align}
    Var[\mathbf{X a}] &= \mathbb{E}[(\mathbf{X a} - \mathbb{E}[\mathbf{X a}])^2] \\
                      &= \mathbb{E}[(\mathbf{X a} - \mathbb{E}[\mathbf{X a}])(\mathbf{X a} - \mathbb{E}[\mathbf{X a}])^t] \\ 
                      &= \mathbb{E}[(\mathbf{X a} - \mathbf{X}\mathbb{E}[\mathbf{a}])(\mathbf{X a} - \mathbf{X}\mathbb{E}[\mathbf{a}])^t] \\
                      &= \mathbb{E}[(\mathbf{X a} - \mathbf{X}\mathbb{E}[\mathbf{a}])((\mathbf{X a})^t - (\mathbf{X}\mathbb{E}[\mathbf{a}])^t)] \\
                      &= \mathbb{E}[\mathbf{Xa}\mathbf{a}^t\mathbf{X}^t - \mathbf{Xa} \mathbb{E}[\mathbf{a}]^t \mathbf{X}^t - \mathbf{X} \mathbb{E}[\mathbf{a}]\mathbf{a}^t\mathbf{X}^t + \mathbf{X} \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}]^t\mathbf{X}^t] \\
                      &=  \mathbf{X} \mathbb{E}[\mathbf{a}\mathbf{a}^t] \mathbf{X}^t - \mathbf{X} \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}]^t \mathbf{X}^t - \mathbf{X} \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}^t] \mathbf{X}^t + \mathbf{X} \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}^t] \mathbf{X}^t \\
                      &= \mathbf{X} \mathbb{E}[\mathbf{a}\mathbf{a}^t] \mathbf{X}^t - 2 \mathbf{X} \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}]^t \mathbf{X}^t + \mathbf{X} \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}^t] \mathbf{X}^t = \\
                      &= \mathbf{X} (\mathbb{E}[\mathbf{a} \mathbf{a}^t] - \mathbb{E}[\mathbf{a}] \mathbb{E}[\mathbf{a}^t]) \mathbf{X}^t = \\
                      &= \mathbf{X} Var[\mathbf{a}] \mathbf{X}^t
\end{align}
$$

> TODO plots