## Class Exercise: Day 2 - Probability and Statistics

### Problem B: Analyzing the Data

It is possible to calculate the mean and variance for any data set representing a PDF using:

Mean: $ \bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i $

Variance (of the PDF):  $ \sigma_{x}^{2} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \bar{x})^2 = \frac{1}{N} \sum_{i=1}^{N} x_i^2 - ( \frac{1}{N} \sum_{i=1}^{N} x_i)^2 = \bar{x^2} - \bar{x}^2$

It is also useful to quantify the accuracy of the mean estimate that is generated. 

## Background

This can be accomplished using the standard deviation of the mean (also called the standard error). The standard error describes the *dispersion of the mean for many subsets of the data*. There are two ways to determine the standard error:

1. It is possible to estimate the standard error using the variance (of the PDF) from a single data set of size N using $\sigma_{\bar{x}}^{2} = \frac{\sigma_x^2}{N} $.

2. If many independent data sets of length N are used to estimate the mean, there are many estimates of the true mean (all grouped around the true mean, $\mu$).  From these estimates, we can directly calculate the variance of those estimates of the mean.

## Your Tasks
In the Jupyter notebook, add to the template in order to do the following for all three data files:

1. Using all the data, calculate the mean value without using the `np.mean()` function.

2. Using all the data, calculate the standard deviation without using the `np.std()` function. Compare this to the standard deviation calculated using the built-in `np.std()` function.
   
3. Divide the data into 20 equally sized sets (✅).  For each of the 20 data sets:
   
  - estimate the mean value, the standard deviation (which describes the dispersion of the data), and the standard error (which describes the dispersion of the means).
  - Calculate the standard error in two ways:
      1. using the standard deviation of the PDF data and the sample size
      2. using the standard deviation of the 20 mean values. You may use the built-in functions
  - plot a histogram of the 20 mean values. ✅

Answer the following questions as a comment at the end of problem_b.ipynb:

  - Does the distribution of the mean values match your expectation based on your estimates of the standard error?
  - How do the estimates of the standard error and the distribution of means change if you divide the same sets of data into 40 equally sized sets instead of 20? 80? 100? 1000?


Please submit your Jupyter notebook files to Canvas upon completion.

## Questions

 - Does the distribution of the mean values match your expectation
 based on your estimates of the standard error?

 - Would your expectations be better or worse matched if you divided
   the same sets of data into 40 equally sized sets instead of 20?


In [None]:
from matplotlib import pyplot as plt
import numpy as np

def problem_b(pdf, numSets, label):
    """
    Calculate the mean and standard deviation of a vector of data sampled
    from a PDF.
    Divide that vector of data in subsets of equal size and calculate the
    mean and standard deviation of each subset, as well as the standand
    deviation of the set of mean values.

    Parameters
    ----------
    pdf     - a vector of numbers that is the data
    numSets - the number of subsets to analyze
    label   - string to use when describing the data
    """

    # B.1 Using all the data, calculate the mean value
    size = pdf.size
    my_mean = None
    mean = None

    # B.2 Using all the data, calculate the standard deviation
    my_std_dev = None
    std_dev = None

    # B.3 Divide the data set into 20 equally sized subsets
    subsets = np.array_split(pdf, numSets)

    set_size = subsets[0].size

    # the mean of each subset provides an independent estimate of the
    # mean of this PDF

    # calculate the mean of each subset
    means = None

    # the standard deviation of each subset provides an independent
    # estimate of the standard deviation of this PDF

    # calculate the standard deviation of the means of each subset
    stds = None

    # the standard error can be determined in two ways:

    # 1) by the relationship between the standard deviation and standard error
    #    for a sample size N

    # calculate the standard error of the means of each subset
    stde = None

    # 2) by evaulating the standard deviation of a set of independent estimates
    #    of the mean

    # calculate the standard deviation of the subset means
    mean_of_subset_means = None
    mean_of_subset_std_devs = None
    std_err_of_the_subset_std_devs = None

    # Calculate the standard deviation of the set of means
    std_of_subset_means = None

    # Print the results
    print(f'For PDF{label} data broken into {numSets} subsets:')
    print(f'  Mean of entire data set: {mean}')
    print(f'  Standard deviation of entire data set: {std_dev}')
    print()
    print(f'  Mean of means of subsets: {mean_of_subset_means}')
    print(f'  Mean of the std. dev. of each subset: {mean_of_subset_std_devs}')
    print(f'  Standard deviation of the subset means: {std_err_of_the_subset_std_devs}')

    print(f'Max variation of subset means from the "true" mean: {np.max(np.abs(means - mean))}')

    # generate a gaussian around the true mean with a width based on the std. dev.
    six_sigma = 3 * std_of_subset_means
    gauss_xs = np.linspace(mean - six_sigma, mean + six_sigma, 500)
    gauss_ys = 10 * np.exp(-0.5 * (gauss_xs-mean)**2 / std_of_subset_means**2)
    
    # Plot the results
    plt.figure()
    plt.hist(means, numSets)
    plt.xlabel('Mean')
    plt.ylabel('Frequency')
    plt.plot(gauss_xs, gauss_ys)
    # plt.xlim(mean - 0.1, mean + 0.1)
    plt.vlines(mean, 0, 10, colors=['black'], linestyle='dashed')
    plt.show()

In [None]:
num_sets = 1000
pdf = np.fromfile('pdfa.npy')
problem_b(pdf, num_sets, 'A')
pdf = np.fromfile('pdfb.npy')
problem_b(pdf, num_sets, 'B')
pdf = np.fromfile('pdfc.npy')
problem_b(pdf, num_sets, 'C')