### <span style="color:LightGreen">Correlation vs. Independence</span>

Last time we discussed the correlation and covariance between two variables:
The **correlation between x and y** is defined as:
$$ \Large
\text{Corr}_{xy} \equiv \langle\left( x - \overline{x}\right) \left( y - \overline{y}\right)\rangle \; .
$$

A useful combination of the correlation and variances is the **correlation coefficient**,

$$ \Large
\rho_{xy} \equiv \frac{\text{Corr}_{x,y}}{\sigma_x \sigma_y} \; ,
$$

It is tempting to think that if two variables are uncorrelated, they are independent.  However that is not the case.  

Let's take the simple case two random variables X and Y where
$X = x$ and $Y = x^2$.  Clearly these two variables are not independent.  If you know any value of x, you can determine y and vise versa.  Let's use the machinary from last week to calculate the correlation coefficient with x sampled from a Guassian centered at $x=0$

In [None]:
import scipy.stats
from scipy.stats import moment
from scipy.stats import poisson
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import os
import subprocess
from random import gauss


ntries = 10000
gaussarray = np.array([])
gaussarray_sq = np.array([])

for _ in range(ntries):
  value = gauss(0, 2)
  if value > -100:
    gaussarray = np.append(gaussarray,value)
    gaussarray_sq = np.append(gaussarray_sq,value*value)


print(np.corrcoef(gaussarray, gaussarray_sq))
plt.scatter(gaussarray, gaussarray_sq, s=10)


The plot looks sensible, but the correlation coefficent is approximately zero.  Running more events isn't going to give us the strong correlation we might expect.  We can clearly see that given a value of X, Y is uniquely determined.

Let's take a look at this mathematically for our case of $X = x$ and $Y = x^2$ with x being symmetric around the origin.  In that case:
$$
\bar{X} = 0 \\
\bar{Y} = \langle x^2 \rangle
$$
plugging in what we know from last time:
$$
\bar{Y} = \sigma^2_x
$$
so the correlation between X and Y is:
$$\text{Corr}_{xy} = \langle\left( x\right) \left( x^2 - \overline{x^2}\right)\rangle \\
= \langle (x\left( x^2 - \sigma^2\right)\rangle \\
=\langle x^3 - x^2\langle x\rangle \rangle = 0.
$$

This works because the function is symmetric about $x=0$.  Let's look at the same Gaussian centered at the origin, but only with the requirement that $x>0$.



In [None]:
ntries = 10000
gaussarray = np.array([])
gaussarray_sq = np.array([])

for _ in range(ntries):
  value = gauss(0, 2)
  if value > -0:
    gaussarray = np.append(gaussarray,value)
    gaussarray_sq = np.append(gaussarray_sq,value*value)


print(np.corrcoef(gaussarray, gaussarray_sq))
plt.scatter(gaussarray, gaussarray_sq, s=10)


We have seen that if two variables are uncorrelated they are not necessarily independent.  However, if two variables are *independent* then they are uncorrelated.  This means you can't necessarily directly take the correlation between two variables as telling you how much extra information is being provided by the second quantity.  

If we just had the datasets themselves and didn't know the $Y = x^2$ relationship set out above, we might calculate the correlation coefficient and think that the two variables are independent.  The plots however make clear that if you know $x$ you directly know $Y$.


$\rho_{xy}$ is known as the Pearson correlation coeffient and is sensitive to the degree of linear correlation.  There are other measures of the correlation (e.g. Spearman's rho).  The Pearson correlation coeffiecent is very common.  Here is a paper from ATLAS where the observable is $\rho$ between the magnitude of the correlations between particles and the average momentum of the particles produced in the collision of two nuclei: https://arxiv.org/pdf/2205.00039.pdf.

### <span style="color:LightGreen">Uncertainties on the Mean</span>

Given a dataset of a finite size, there are uncertainties on any of the associated moments.  We'll discuss the special case of counting experiments using the Poisson distribution.  

Last week we discussed:

$$\Large P(k; \lambda) = \frac{e^{-\lambda} \lambda^k}{k!}$$

The Poisson distribution is a useful model for describing the statistics of event-counting rates in (uncorrelated) counting measurements (which are ubiquitous in astronomy and particle physics).

It's useful to note that the Poisson distribution is defined for integer values of $k$, but $\lambda$ of course doesn't need to be an integer.

$\lambda$ is the probability times the number of trials ($\lambda = np$) so it
is the expectation value for the mean,
 but what about the variance (recall $\sigma_x^2 \equiv \langle\left( x - \overline{x} \right)^2\rangle$ and the standard deviation is $\sigma_x$)?  It requires a bit of  work to show, but for a Poisson distribution
 $\sigma^2 = \lambda$
 so the standard deviation is $\sqrt{\lambda}$.




In [None]:
mu = 4
rv = poisson(mu)
R = poisson.rvs(mu, size=1000)
plt.hist(R,bins=25)
print("mean: ", np.mean(R), " & variance: ", np.var(R))

This means that the relative uncertainty on the mean of a counting experiment is $1/\sqrt{\lambda}$. That means
that for a counting experiment the uncertainty on the mean
becomes smaller with $1/\sqrt{n}$.  This is really fundamental to much of experimental physics.  
If you want 10% precision you need:
$$1/\sqrt{N} = 0.1\\
N=100$$
if you want 1% precision you need:
$$1/\sqrt{N} = 0.01\\
N=10000$$

Precision improves quickly at first as you add data, but then much more slowly.
