# Data Analysis Homework 1

In [None]:
from __future__ import division
from IPython.display import HTML
from IPython.display import display
from scipy.special import erf
from scipy.special import erfc
from math import factorial as factorial
from random import seed
from random import randint
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Question 1: Mean, Standard Deviation and Standard Error

12 measurements of the sensitivity of a photo diode circuit (in Amps/Watt) are: 
<bf>
11.45, 10.91, 11.60, 10.59, 10.32, 10.34, 11.00, 10.94, 11.67, 11.67, 11.06, and 10.57. 
<bf>
Calculate:
<blockquote>
<bf>
(i) The Mean.
<bf>
(ii) The Standard Deviation.
<bf>
(iii) The Standard Error.
<bf>
(iv) How would you report the result?

### (i) Calculate the Mean

In [None]:
data = [11.45, 10.91, 11.60, 10.59, 10.32, 10.34, 11.00, 10.94, 11.67, 11.67, 11.06, 10.57]

def one_i(data):
    '''Return the mean'''
    mean = np.mean(data)

    return mean

# Please print output of functions to make marking easier
one_i(data)

### (ii) Calculate the sample Standard Deviation.

In [None]:
data = [11.45, 10.91, 11.60, 10.59, 10.32, 10.34, 11.00, 10.94, 11.67, 11.67, 11.06, 10.57]

def one_ii(data):
    '''Return the standard deviation'''
    standard_deviation = np.std(data,ddof = 1)

    return(standard_deviation)
    
one_ii(data)

### (iii) Calculate the Standard Error.

In [None]:
data = [11.45, 10.91, 11.60, 10.59, 10.32, 10.34, 11.00, 10.94, 11.67, 11.67, 11.06, 10.57]

def one_iii(data):
    '''Return the standard error'''
    st_dev = one_ii(data)
    n = len(data)
    standard_error = st_dev/np.sqrt(n)

    return(standard_error)

one_iii(data)

### (iv) How would you report the result?

The sensitivity of the photo diode is 11.01 $\pm$ 0.14 Amps/Watt

## Question 2: Error in the error

Consider a set of measurements with the standard error calculated to be $\alpha=0.987654321$.  Here we address the question of how many significant figures should be quoted.  

Required:
<blockquote>

(i) Using pandas or any other software package, make a CSV file with four columns.  The first column should be $N$, the number of measurements on which $\alpha$ is based.  In the second column write $\alpha$ to the nine significant figures quoted above. The third and fourth columns should be ${\displaystyle \alpha\left(1-\frac{1}{\sqrt{2N-2}}\right)}$    and  ${\displaystyle \alpha\left(1+\frac{1}{\sqrt{2N-2}}\right)}$, respectively.  As we are interested in the variation over a large dynamic range, choose values for $N$ such as 2, 3, 5, 10, 20, 30, etc. 
<bf>
(ii) Verify the statement from Section 2.7.1 that the number of data points,  N , needs to approach a few tens of thousands before the second significant figure in the error can be quoted, i.e. when the values in the three columns become equal to the second significant figure. Use the model that you constructed in the previous part of the question and make appropiate comments using data
<bf>
(iii) Repeat the analysis for the case where α=0.123456789, i.e. the first significant digit of the error is 1. Make appropiate comments.
<bf>
(iv) How many data points must be collected before the third significant figure can be quoted?

### (i) Using pandas or any other software package, make a CSV file with four columns.  The first column should be $N$, the number of measurements on which $\alpha$ is based.  In the second column write $\alpha$ to the nine significant figures quoted above. The third and fourth columns should be ${\displaystyle \alpha\left(1-\frac{1}{\sqrt{2N-2}}\right)}$    and  ${\displaystyle \alpha\left(1+\frac{1}{\sqrt{2N-2}}\right)}$, respectively.  As we are interested in the variation over a large dynamic range, choose values for $N$ such as 2, 3, 5, 10, 20, 30, etc. 

In [None]:
# Note: this is one way in which this can be done efficiently, although there are many other ways
# of coding the solution

N_range = np.logspace(1,10, base=5, num=10).astype(int)
alpha = 0.987654321*np.ones(len(N_range))

def make_dataframe(N_range, alpha, outname):
    'Creates a pandas dataframe and saves the output as a csv file given a range of N values, and a given alpha'
    
    plus_fn = lambda N: 1+(1/np.sqrt(2*N - 2))
    minus_fn = lambda N: 1-(1/np.sqrt(2*N - 2))
    
    # Note: using the numpy library where possible is often a good idea since it has in-build
    # automatic parallelisation. By mapping a lambda function over a numpy array the same operation
    # can be carried out on each element very efficiently.
    
    alpha_plus = alpha*plus_fn(N_range)
    alpha_minus = alpha*minus_fn(N_range)
    
    # create dictionary
    data = {'N':N_range, 'alpha': alpha, 'alpha minus': alpha_minus, 'alpha plus': alpha_plus}
    
    # create dataframe from dictionary
    df = pd.DataFrame(data)
    df.to_csv(outname)
    
    return df

df = make_dataframe(N_range, alpha, 'two_i.csv')

df

### (ii) Verify the statement from Section 2.7.1 that the number of data points,  N , needs to approach a few tens of thousands before the second significant figure in the error can be quoted, i.e. when the values in the three columns become equal to the second significant figure. Use the model that you constructed in the previous part of the question and make appropiate comments using data

From the table above it can be seen that the numbers in the final 3 columns only begin to agree to the second significant figure at ~ N = 70000

### (iii) Repeat the analysis for the case where  α=0.123456789, i.e. the first significant digit of the error is 1. Make appropiate comments.

In [None]:
N_range = np.logspace(1,10, base=5, num=10).astype(int)
alpha = 0.123456789*np.ones(len(N_range))

df = make_dataframe(N_range, alpha, 'two_ii.csv')
df

It can be seen from the table above that the values of the final 3 columns agree to two significant figures at only ~ N = 15000

### (iv) How many data points must be collected before the third significant figure can be quoted?

In [None]:
# From the above table the point at which the numbers agree to three significant figures is somewhere
# between ~ 2000000 and 9500000. I therefore ajust my search range in to this region

N_range = np.logspace(9,10, base=5, num=10).astype(int)
alpha = 0.123456789*np.ones(len(N_range))

df = make_dataframe(N_range, alpha, 'two_iv.csv')
df

It can be seen from the table above that the final 3 columns agree to three significant figures at ~ N = 4700000

## Question 3: Confidence limits for a Gaussian Distribution

|Centred on Mean | Measurements within range | Measurements outside range |
| --- | --- | --- |
| $\pm\sigma$ | 68% | 32% |
| $\pm1.65\sigma$ | 90% | 10% |
| $\pm2\sigma$ | 95% | 5% |
| $\pm2.58\sigma$ | 99% | 1% |
| $\pm3\sigma$ | 99.7% | 0.3% |

(i) Verify the results of the above table for the fraction of the data which lies within different ranges of a Gaussian probability distribution function. 
<bf>
(ii) What fraction of the data lies outside the following ranges from the mean? 
<blockquote>
<bf>
(a) $\pm4\sigma$ 
<bf>
(b) $\pm5\sigma$.  
</blockquote>

(iii) What is the (symmetric) range within which the following fractions of the data lie, leaving your answer in terms of $\sigma$? 
<blockquote>
<bf>
(a) 50% 
<bf>
(b) 99.9%.

### (i) Verify the results of the above table for the fraction of the data which lies within different ranges of a Gaussian probability distribution function. You must return your answers as an ARRAY OF PERCENTAGES.

In [None]:
def three_i():
    '''Return the measurements in range as an array. '''
    
    measurements_in_range = []
    sigma_multiples = [1,1.65,2,2.58,3]
    mean = 10
    st_dev = 3
    
    for i in sigma_multiples:
        x_1 = ((mean+(i*st_dev))-mean)/((np.sqrt(2)*st_dev))
        x_2 = ((mean-(i*st_dev))-mean)/((np.sqrt(2)*st_dev))
        measurements_element = (0.5*(1+erf(x_1)))-(0.5*(1+erf(x_2)))
        measurements_in_range.append(measurements_element*100)
    
    return measurements_in_range

three_i()

### (ii) What fraction of the data lies outside the following ranges from the mean? You must return your answer as a PERCENTAGE.
<blockquote>
<bf>
(a) $\pm4\sigma$ 
    
<bf>
(b) $\pm5\sigma$.  
</blockquote>


In [None]:
def three_iia():
    '''Return the fraction of measurements outside the range as a PERCENTAGE'''
    
    mean = 10
    st_dev = 3
    x_1 = ((mean+(4*st_dev))-mean)/((np.sqrt(2)*st_dev))
    x_2 = ((mean-(4*st_dev))-mean)/((np.sqrt(2)*st_dev))
    four_sigma = (0.5*(1+erf(x_1)))-(0.5*(1+erf(x_2)))
    outside_range = (1-four_sigma)*100
    
    return outside_range
   

def three_iib():
    '''Return the fraction of measurements outside the range as a PERCENTAGE'''
    
    mean = 10
    st_dev = 3
    x_1 = ((mean+(5*st_dev))-mean)/((np.sqrt(2)*st_dev))
    x_2 = ((mean-(5*st_dev))-mean)/((np.sqrt(2)*st_dev))
    five_sigma = (0.5*(1+erf(x_1)))-(0.5*(1+erf(x_2)))
    outside_range = (1-five_sigma)*100
    
    return outside_range

three_iia(), three_iib()

### (iii) What is the (symmetric) range within which the following fractions of the data lie, leaving your answer in terms of $\sigma$?
<blockquote>
<bf>
(a) 50% 
<bf>
(b) 99.9%.

In [None]:
def phi(x):
    'Cumulative distribution function for the standard normal distribution'
    # Note: this function is needed to account for the difference
    # in error function definitions between Huges and Hayes and the
    # scipy library
    return (1.0 + erf(x / np.sqrt(2.0))) / 2.0

In [None]:
def three_iiia():
    '''Return the multiple of sigma'''
    
    test_range = np.arange(0,10,0.001)
    for i in test_range:
        solution = i
        if phi(i)-phi(-i) > 0.5:
            break
    
    return solution
    

def three_iiib():
    '''Return the multiple of sigma'''
    
    test_range = np.arange(0,10,0.001)
    for i in test_range:
        solution = i
        if phi(i)-phi(-i) > 0.999:
            break
    return solution

three_iiia(), three_iiib()

## Question 4: Identifying a Potential Outlier

Seven successive measurements of the charge stored on a capacitor (all in $\mu C$) are: 
<bf>
45.7, 53.2, 48.4, 45.1, 51.4, 62.1 and 49.3. 
<bf>
The sixth reading appears anomalously large. 

Required:
<blockquote>
(i) Apply Chauvenet’s criterion to ascertain whether this data point should be rejected. In the comment, you must state 'ACCEPT' or 'REJECT'. 
</blockquote>
<blockquote>
(ii) Having decided whether to keep 6 or 7 data points, calculate:
<bf>
<blockquote>
(a) The Mean
</blockquote>
<bf>
<blockquote>
(b) Standard Deviation
</blockquote>
<bf>
<blockquote>
(c) Error of the Charge.

### (i) Apply Chauvenet’s criterion to ascertain whether this data point should be rejected. In the comment, you must state 'ACCEPT' or 'REJECT'. 

In [None]:
def four_i():
    '''Your function must return the probability of an outlier, n_out and your comment to ACCEPT or REJECT'''
    
    data = [45.7, 53.2, 48.4, 45.1, 51.4, 62.1, 49.3]
    
    suspicious_position = 5 # position in the data point array of suspicious result
    
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    
    # select the upper and lower bounds of the error function integral
    x_1 = mean - data[suspicious_position]
    x_2 = mean + data[suspicious_position]
    
    P_out = 1 - (phi(x_2/std) - phi(x_1/std)) # probability of finding a value outside the bounds
    # note: the error function assumes a standard deviation of 1 and so the upper and lower bounds
    # must be divided by the standard deviation - see definintion of error function and it's scipy implementation
    n_out = P_out*len(data)
    
    comment = 'ACCEPT'
    if n_out < 0.5:
        comment = 'REJECT'
        
    return P_out, n_out, comment

four_i()

### (iia) Having decided whether to keep 6 or 7 datapoints, calculate the mean.

In [None]:
def four_iia():
    '''Your function should return the mean'''
    data_points_1 = [45.7, 53.2, 48.4, 45.1, 51.4, 49.3]
    mean = np.mean(data_points_1)
    return mean

four_iia()

### (iib) Having decided whether to keep 6 or 7 datapoints, calculate the standard deviation.

In [None]:
def four_iib():
    '''Your function should return the standard deviation'''

    data_points_2 = [45.7, 53.2, 48.4, 45.1, 51.4, 49.3]
    standard_deviation = np.std(data_points_2,ddof = 1)
    return standard_deviation

four_iib()

### (iic) Having decided whether to keep 6 or 7 datapoints, calculate the standard error.

In [None]:
def four_iic():
    '''Your function should return the standard error'''

    data_points_3 = np.array([45.7, 53.2, 48.4, 45.1, 51.4, 49.3])
    st_dev = four_iib()
    standard_error = st_dev/(np.sqrt(len(data_points_3)))
    
    return standard_error

four_iic()

## Question 5: Poisson and Gaussian

Required:
<bf>
<blockquote>
(i) Plot a histogram of a Poisson distribution with mean 35.  
<bf>
</blockquote>
<blockquote>
(ii) Using the same axes plot the continuous function of a Gaussian with a mean of 35, and standard deviation $\sqrt{35}$.  
</blockquote>
<bf>
<blockquote>
(iii) Comment on the similarities and differences between the distributions.

### (i) Plot a histogram of a Poisson distribution with mean 35.  
<bf>
### (ii) Using the same axes plot the continuous function of a Gaussian with a mean of 35, and standard deviation $\sqrt{35}$. 

In [None]:
def five_i_and_ii():
    '''This function should plot the appropiate histograms'''

    mean = 35
    
    fig, ax1 = plt.subplots()
    
    # Poisson distribution
    poisson_data = np.random.poisson(lam=35,size=100000)
    color = 'tab:blue'
    # plot normalised histogram with integer bins
    # in order to avoid the bins beginning on the integers, the bins are shifted by 0.5
    ax1.hist(poisson_data, density=True, bins = np.arange(0.5,70.5), label = 'Poisson', color=color)
    ax1.set_ylabel('Poisson units')
    ax1.set_ylim(0)
    
    ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
    
    # Gaussian distribution
    st_dev = np.sqrt(35)
    gaussian = lambda i: (1/(st_dev*np.sqrt(2*np.pi)))*np.exp((-(i-mean)**2)/(2*st_dev**2))
    data = np.arange(0,70) # range over which the Gaussian will be plotted
    gaussian_data = gaussian(data)
    color = 'tab:red'
    ax2.plot(data, gaussian_data, label = 'Gaussian', color=color)
    ax2.set_ylabel('Gaussian units')
    ax2.set_ylim(0)
    
    fig.legend(bbox_to_anchor=(0.45, 0.4, 0.4, 0.5)) # set location of legend
    fig.tight_layout()
    fig.show()
    
five_i_and_ii()

### (iii) Comment on the similarities and differences between the distributions.

Similarities:
- Same height peak
- Same general distribution shape at a large number of points (note the apparent shift of the Poisson distribution to the left is due to the fact that the Poisson distribution takes cannot accept negative numbers and will therefore naturally by skewed to the left)

Differences:
- The Gaussian distribution is continuous, whereas the Poisson is discrete
- The Poisson distribution only extends over non-negative integers, whereas the Gaussian distribution extents over all real numbers
- The Poisson distribution has a natural left skewness as mentioned above, although the result tends to a Gaussian as you add more points


## Question 6: Bumps that go missing

(Some of this text is from a Physics World article: http://physicsworld.com/cws/article/news/2016/apr/19/theorizing-about-the-lhcs-750-gev-bump)
<bf>
Last year, the LHC's ATLAS and CMS experiments both reported a small 'bump' in their data that denoted an excess of photon pairs with a combined mass of around 750 GeV. As this unexpected bump could be the first hint of a new massive particle that is not predicted by the Standard Model of particle physics, the data generated hundreds of theory papers that attempt to explain the signal.  Taking into account what is known as the look-elsewhere effect, CMS says it has seen an excess with a statistical significance of 1.6$\sigma$, while ATLAS reports a significance of about 2$\sigma$ –- corresponding, respectively, to a roughly 1 in 10 and 1 in 20 chance that the result is a fluke.

While these levels are far below the 5$\sigma$ 'gold standard' that must be met to claim a discovery, the fact that both collaborations saw a bump at the same energy has excited theoretical physicists.}
<bf>
From the Cern Courier, October 2016:
<bf>

![title](figure_1A.JPG)

> In a dramatic parallel session, both ATLAS and CMS revealed that their 2016 data do not confirm the previous hints of a diphoton resonance at 750 GeV (figure 1); apparently, those hints were nothing more than tantalising statistical fluctuations.

### (i) Write two sentences explaining what the 'look-elsewhere effect' is.

The look-elsewhere effect states that, when collecting large amounts of data, you are increasingly likely to find at least one event that is many standard deviations away from the mean. For example, in normally distributed data, there is approximately a 0.3% chance of finding data three standard deviations away from the mean. When collecting large amounts of data, such as that taken in the Large Hadron Collider, the chance of finding such significant data becomes much higher since e.g. when collecting 1,000,000 data points, you are likely to find 3,000 events at the three sigma level. This does not necessarily mean you have discovered a new particle, rather it may mean that you have randomly happened upon a significant event.

### (ii) There  are typically more than 100 papers a year from these detectors, each of which has up to 10 histograms, each of which has 50-100 bins.  Assuming that there are $10^5$ bins in a year, how many 2, 3, 4 and 5$\sigma$ events will there be?

In [None]:
def six_ii():
    '''Your function should return the number of 2,3,4,5 sigma events'''
    bins = 10**5
    
    # function to calculate the number of bins at a given sigma significance
    num_bins = lambda bins, sigma: (1-(phi(sigma)-phi(-sigma)))*bins
    
    two_sigma = num_bins(bins,2)
    three_sigma = num_bins(bins,3)
    four_sigma = num_bins(bins,4)
    five_sigma = num_bins(bins,5)

    return two_sigma, three_sigma, four_sigma, five_sigma

six_ii()

### (iii) What are the chances of two 2$\sigma$ events at the same energy?

Although this question asked for a calculation, it is fine to explain your answer logically in words.

We assume the question is asking what the chance of finding two 2 sigma bins are where each of the two significant bins are in different plots and that they are at the same energy. In this case, given that there are expected to be around 4500 such bins across 1000 histograms there are expected to be 4-5 significant bins per histogram. Additionally, there are 499500 ways of choosing 2 different histograms out of the 1000 we have and given two histograms have been chosen there is a 16-20% chance of having one bin in each where there are significant events at the same energy. Therefore, the chance of finding two 2 sigma events at the same energy is incredibly likely.

## Coding Exercise

Choose one of the distributions we discussed in the context of the Central Limit theorem: either the uniform distribution, the triangular distribution or a Gaussian distribution. They should span the interval 0 to 1.
<bf>
Write code that allows you to choose numbers at random from this distribution. Then,

<blockquote>
<bf>
(i) Choose 1,000 numbers at random, and plot a histogram of their occurrences.
</blockquote>
<bf>
<blockquote>
(ii) Choose 2 numbers from  the distribution at random, and average them.  Repeat this 1,000 times and plot a histogram.
<bf>
</blockquote>
<blockquote>
(iii) Do the same for the sum of 3, 4 and 5 numbers, and make the corresponding plots.
<bf>
</blockquote>
<blockquote>
(iv) Comment on your results.

In [None]:
data = np.random.uniform(size = 1000)

def averaging(data, num_to_average):
    data = np.array(data) # ensure data is in a numpy array format
    averaged_data = []
    for i in range(1000):
        # select the desired number of indices at random to determine the position the elements to be averaged
        indices = np.random.randint(0, len(data), size=num_to_average)
        averaged_data.append(np.mean(data[indices]))
    return averaged_data

plt.hist(data, bins='auto')
plt.title('Uniformly distributed')
plt.show()

plt.figure()
nums_to_average = [2,3,4,5]

for i in nums_to_average:
    averaged_data = averaging(data,i)
    plt.hist(averaged_data, bins='auto')
    plt.title('Uniformly distributed with {} averaged points'.format(i))
    plt.show()

### (iv) Comment on your results.

The distribution increasingly approaches a normal distribution as more points are averaged over. This is a consequence of the Central Limit Theorem.