# Confidence limits

In the exercise on the constructing histograms we found that the probability distribution function for the sum of a large number of (identically distributed) uniform random variables was a normal distribution.  I then told you that this result holds more generally and that it is known as the central limit theorem.  To be clear this theorem states:

$$
\lim_{n \rightarrow \infty} P\left(\frac{\frac{S_n}{n} - \mu}{ \sigma/\sqrt{n} } \le z \right) = \Phi(z)
$$

where $S_n = X_1 + X_2 + X_3 + \dots + X_n$ and where $X_1, X_2, X_3, \dots, X_n$ are all idependent and identically distributed random variables.  $\mu$ and $\sigma$, meanwhile, are the expectation and variance of the random variable, $X_1$.  If either of these quantities is infinite then the central limit theorem does not hold.  Last but not least $\Phi(z)$ is the probability distribution function for a normal random variable with an expectation of zero and a variance of 1.0.  This material is covered in more detail in the following video: https://www.youtube.com/watch?v=-XJe3s_BCKw

The aim of this exercise is to investigate the implications of the central limit thoerem by looking at the issue of confidence limits and error bars.  As always we begin though with some code that I have written for you.  Press shift and enter on the cell below now




In [None]:
from scipy.stats import norm
import math
%matplotlib notebook
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.animation as anim
from matplotlib import rc
from IPython.display import HTML
import random
rc('animation', html='none')

class plotobj(object):
    def __init__(self,ngen,expectation,climit):
        self.ngen = ngen
        self.expectation = expectation
        self.climit = climit
        self.fig = plt.figure()
        self.ymin = expectation - 2*climit
        self.ymax = expectation + 2*climit
        self.ax = plt.axes(xlim=(0, self.ngen),ylim=(self.ymin,self.ymax))
        x1data,y1data,y2data,y3data = [], [], [], []
        for i in range(0,self.ngen):
            x1data.append(i)
            y1data.append(self.expectation)
            y2data.append(self.expectation+self.climit)
            y3data.append(self.expectation-self.climit)
        
        line1 = self.ax.plot(x1data,y1data,'-')
        line2 = self.ax.plot(x1data,y2data,'-')
        line3 = self.ax.plot(x1data,y3data,'-')
        
    def setup(self):
        self.xdata=[]
        self.ydata=[]
        self.line, = self.ax.plot([],[],'.')
        return self.line
    
    def run(self,data):
        t,y = data
        if y>self.ymax :
            self.ymax = y
        if y<self.ymin :
            self.ymin =y
        self.ax.set_ylim(self.ymin,self.ymax)
        self.xdata.append(t)
        self.ydata.append(y)
        self.line.set_data(self.xdata, self.ydata)
        return self.line
    
def raw( ngen, myvar, p1 ):
    cnt = 0
    while cnt < ngen :
        cnt += 1
        yield cnt, myvar(p1)
    
def dynamicplot( ngen, myvar, p1, expectation, climit ):
    myplot = plotobj( ngen, expectation, climit )
    return anim.FuncAnimation(fig=myplot.fig, func=myplot.run, frames=raw( ngen, myvar, p1 ), interval=10, 
                              init_func=myplot.setup, blit=False, repeat=False)

# Introduction

You are hopefully aware that when we perform scientific investigations we perform our experiments multiple times in order to confirm our results.  This is not the sole reason for repeating experiments, however.  Whenever we perform an experiment we assume that the result, $Y$, is a random variable with some underlying probability distribution function.  That is to say that we assume that the outcome of our experiment is random.  The aim of the experiment is thus not to get the value of $Y$.  Instead we want get some information about the probability distribution function that underlies the random variable, $Y$, that underlies our experiment.  Now, as we have seen in all the exercises on the law of large numbers, we can get information on this distribution function by generating multiple random variables, adding them together and dividing by the total number of random variables we have generated.  This procedure gives us an estimate of the expectation.  Furthermore, I have just described how the central limit theorem gives us information on the probability distribution function for this estimate. 

The key word in the above is estimate.  The point is that the "expectation" we get from performing an experiment multiple times is not the true expectation for a random variable as the law of large numbers tells us that our esimtate only converges to the true expectation once we have performed an infinite number of experiments.  What we thus need to do is get an estimate of how close to the true expectation our esimtate lies.  The central limit theorem and the business of confidence limits that is described in the video above provides us with a method for doing just this.  Interpretting what these confidence limits represent is (I think) rather difficult though so in this exercise we are going to do a short exercise that I hope will help you to understand confidence limits better.

To begin this exercise I want you to write yet another random number generated in the cell below.  This random number generator should take an integer, $N$, as input and should return the random variable:

$$
Y = \frac{1}{N} \sum_{i=1}^N X_i 
$$

where the $X_i$ is one of $N$ uniform random variables that lie between 0 and 1.

In [None]:
# This function needs to be modified because at
# the moment it only returns 1 which is not very random
# Do not change the name of this function.
def randno(N):
    return 1

We know that the value of the random variable $Y$ should converge to 0.5 as it is a sum of $N$ identically distributed random variables.  Furthermore, we can use the equation for the error bars from the video https://www.youtube.com/watch?v=-XJe3s_BCKw to get an estimate for the confidence limit.  In the cell below write a function that takes the number of random variables generated, $N$, and the level of confidence required as input and which then uses the central limt theorem to calculate and return an estimate for the size of the error bar.  Notice you can get the inverse normal function, $\Psi^{-1}(z)$, using the function:

norm.ppf(z)

In [None]:
# This function needs to be modified so that it 
# returns the lim % confidence limit when N experiments
# have been performed.  Do not change the name of this function
def confidencelimit(lim,N):
    return 1.0

Now that we have these two functions I have, in the cell below written a routine that will generate and plot 20 instances of your random variable $Y$ with $N=20$ together with lines that show the 90 % confidence limit around the true expectation (0.5).  Press shift and enter on the cell below and then answer the questions below the cell in your notes on this exercise.

In [None]:
Nsets = 20     # Number of instances of random variable Y to generate 
N = 20         # Number of uniform random variables that are added together to get Y
expectation = 0.5 # The expectation for a uniform random variable that lies between 0 and 1
climit = 0.9      # Confidence limit we are interested in
dynamicplot( Nsets, randno, N, expectation, confidencelimit(climit,N) )

- Are all the value of $Y$ that you obtain within the error bars?  If they are all within the error bars run the cell above again until you find that some of the estimates are outside the confidence limits.
- Explain, making reference to the central limit theorem, why some of the values of $Y$ that you obtain are outside the error bars.
- How many of our 20 estimates of $Y$ would we expect to lie within the 90 % confidence limit.
- What happens when different values are used for the confidence limit and when different numbers of experiments are performed.  Does the number of $Y$ values in the 90 % confidence limit increase/decrease when $N$ is increased to 50?
- What happens to the size of the error bars as $N$ is increased.