# Sample Variance and Variance of Sample Means

Let $\mu_p$ be the population mean and $\sigma_p^2$ the population variance. 

If you draw samples of size $n$, then the mean of those samples will have distribution with mean $\mu_p$ and variance $\sigma^2_p/N$. 

If you one sample of size $n$, then the variance within that sample will be $s^2 =\sum_{i=1}^{n} |x_i - \mu_s|^2/(n-1)$ where $\mu_s$ is the sample mean $\mu_s = \sum_{i=1}^{n} x_i/n$.

## Example: Data Matrix, Covariance Matrix
Say you have a data matrix with $n$ samples and $p$ features, $\mathbf{X}\in\mathbb{R}^{n \times p}$. 

You add all the rows and calculate the mean of $\mathbf{X}$. How far might the mean be away from the population mean? The standard error is $\sigma_p/\sqrt{n}$.  

You want to know what the covariance of the data is, so you look at the covariance across the columns within the sample. Hence, the covariance matrix is $\frac{\mathbf{X}^T\mathbf{X}}{n-1}$.

In [3]:
import numpy as np

def var_1(sample):
    """
    population variance estimated based on variance within individual samples.
    """
    mean = np.mean(sample)
    n = len(sample)
    population_variance = np.sum([(i-mean)**2 for i in sample])/(n-1)
    
    return population_variance

def var_2(samples):
    """
    population variance estimated based on the variance of the means of a collection of samples.
    """
    n = len(samples[0]) 
    means = [np.mean(sample) for sample in samples]
    population_mean = np.mean(means)
    population_variance = n*np.std(means)**2
    
    return population_variance


N = 5
var = 1
samples = [np.sqrt(var)*np.random.randn(N) for i in range(500000)]

# via intra-sample variance
pvar1 = np.mean([var_1(sample) for sample in samples])
pvar1_std = np.std([var_1(sample) for sample in samples])

# via sample means
pvar2 = var_2(samples)

print("True Population Variance:\n%1.3f\n" % var)

print("Estimated via intra-sample variance:\n %1.3f (+/- %1.3f)\n" % (pvar1,pvar1_std))

print("Estimated via variance of sample means:\n %1.3f\n" % pvar2)

True Population Variance:
1.000

Estimated via intra-sample variance:
 1.000 (+/- 0.708)

Estimated via variance of sample means:
 1.002

