In [1]:
%run ../../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython
config_ipython()
setup_matplotlib()
set_css_style()

# Covariance and correlation

## Covariance

Given the random variables $X$ and $Y$ with respective means $\mu_x$ and $\mu_y$, their *covariance* is defined as

$$
\text{cov}(X, Y) = \mathbb{E}[(X - \mu_x)((Y - \mu_y)]
$$

It is a measure of how jointly the two variables vary: a positive covariance means that when $X$ grows, $Y$ grows as well and a negative covariance means that when $X$ grows, $Y$ decreases. 

## Correlation

The word *correlation* is measured by a *correlation coefficient* which exists in several definitions depending on what is exactly measured; it is always a sort of normalised covariance. 
The correspondent of the covariance itself is Pearson's definition, which defines the correlation coefficient as the covariance normalised by the product of the standard deviations of the two variables:

$$
\rho_{xy} = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y} = \frac{\mathbb{E}[(x - \mu_x)(y - \mu_y)]}{\sigma_x \sigma_y} \ ,
$$

and it can also be written as 

$$
\begin{align}
    \rho_{xy} &= \frac{\mathbb{E}[(xy - x \mu_y - \mu_x y + \mu_x \mu_y)]}{\sigma_x \sigma_y} \\
    &= \frac{\mathbb{E}[xy] - \mu_x\mu_y - \mu_y\mu_x + \mu_x\mu_y}{\sigma_x \sigma_y} \\
    &= \frac{\mathbb{E}[xy] - \mu_x\mu_y}{\sigma_x \sigma_y} \ .
\end{align}			
$$

The correlation coefficient has these properties:

* $-1 \leq \rho_{xy} \leq 1$
* It is symmetric: $\rho_{xy} = \rho_{yx}$
* If the variables are independent, then $\rho_{xy} = 0$ (but the reverse is not true)

### Independence and correlation

Let's expand on the last point there really. We said that if two random variables are [independent](../concepts/independence.ipynb), then the correlation coefficient is zero. This is easy to prove as it follows directly from the definition above (also bear in mind [Fubini's theorem](https://en.wikipedia.org/wiki/Fubini's_theorem)):

$$
\mathbb{E}[XY] = \int_{\Omega_X } \int_{\Omega_Y} \text{d} x \text{d} y \ xy P(x,y) = \int_{\Omega_X } \int_{\Omega_Y} \text{d} x \text{d} y \ xy P(x) P(y) = \mu_x \mu_y \ .
$$

The reverse is not true. Look at this amazing Q&A on [Cross Validated](https://stats.stackexchange.com/questions/12842/covariance-and-independence#) for a well explained counter-example.

### Correlation and the relation between variables

<img src="../../imgs/correlation-ex.png" width="600" align="center"/>

Correlation says "how much" it happens that when $x$ grows, $y$ grows as well. It is not a measure of the slope of the linear relation between $x$ and $y$. This is greatly illustrated in the figure above (from Wikipedia's [page](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)), which reports sets of data points with $x$ and $y$ and their correlation coefficient. 

In the center figure, because the variance of $y$ is 0, then the correlation is undefined. In the bottom row, the relation between variables is not linear, the correlation does not capture that.