# Moderne Methoden der Datenanalyse SS2021
# Practical Exercise 6

## Exercise 6.1: Hypothesis Testing

"Is this a new discovery or just a statistical fluctuation?" Statistics offers some methods to give a quantitative answer. But these methods should not be used blindly. In particular, one should know exactly what the obtained numbers mean and what they don't mean.

### Exercise 6.1.1 (obligatory to solve either 6.1.1 or 6.1.2)

The following table shows the number of winners in a horse race for different track numbers:

| track      ||| 1    || 2    || 3    || 4    || 5    || 6    || 7    || 8    |
|------------|||------||------||------||------||------||------||------||------|
| # winners  ||| 29   || 19   || 18   || 25   || 17   || 10   || 15   || 11   |

Use a $\chi ^2$ test to check the hypothesis that the track number has *no* influence on the chance to win. Define a significance level, e.g., $\alpha = 5 \, \%$ or $\alpha = 1 \, \%$, *before*  you do the test.

In [None]:
# pick your poison
from ROOT import TMath
from scipy.stats import chi2

def exercise6_1_1(confidenceLevel):
  
    # numbers given in the exercise
    nTracks = 8
    nWin = [29, 19, 18, 25, 17, 10, 15, 11]
    
    return

### Exercise 6.1.2 (obligatory to solve either 6.1.1 or 6.1.2)

In a counting experiment, 5 events are observed while $\mu _\mathrm{B} = 1.8$ background events are expected. Is this a significant ($= 3 \sigma$) excess? Calculate the probability of observing 5 or more events when the expectation value is 1.8 using Poisson statistics.

In [None]:
import numpy as np

# pick your poison
from ROOT import gRandom, TMath, Math
from scipy.stats import poisson, norm

In [None]:
def exercise6_1_2():
  
    # numbers given in the exercise
    nBackground = 1.8
    nObserved = 5

    # calculate the probability to observe 5 or more events
  
    # optional: make toy experiments
    
    return

## Exercise 6.2: Parameter Estimation

### Exercise 6.2.1 (voluntary)

Consider the following set of values approximately following a Gaussian distribution (see also excercise_6_2_1.csv):

<style>
    alternative
| $x_i$ || $y_i$ || $\sigma_i$  ||||| $x_i$ || $y_i$ || $\sigma_i$  ||||| $x_i$ || $y_i$ || $\sigma_i$  ||||| $x_i$ || $y_i$ || $\sigma_i$  |
|-------||-------||-------------|||||-------||-------||-------------|||||-------||-------||-------------|||||-------||-------||-------------|
| 0.46  || 0.19  || 0.05        ||||| 0.69  || 0.27  || 0.06        ||||| 0.71  || 0.28  || 0.05        ||||| 1.04  || 0.62  || 0.01        |
| 1.11  || 0.68  || 0.05        ||||| 1.14  || 0.70  || 0.07        ||||| 1.17  || 0.74  || 0.08        ||||| 1.20  || 0.81  || 0.09        |
| 1.31  || 0.93  || 0.10        ||||| 2.03  || 2.49  || 0.03        ||||| 2.14  || 2.73  || 0.04        ||||| 2.52  || 3.57  || 0.01        |
| 3.24  || 3.90  || 0.07        ||||| 3.46  || 3.55  || 0.03        ||||| 3.81  || 2.87  || 0.03        ||||| 4.06  || 2.24  || 0.01        |
| 4.93  || 0.65  || 0.10        ||||| 5.11  || 0.39  || 0.07        ||||| 5.26  || 0.33  || 0.05        ||||| 5.38  || 0.26  || 0.08        |
</style>

<table id="sometable">
  <tr style="border-bottom: 1px solid darkgray">
    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style="padding-right: 40px;"> &sigma;<sub>i</sub></th>
    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style="padding-right: 40px;"> &sigma;<sub>i</sub></th>
    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style="padding-right: 40px;"> &sigma;<sub>i</sub></th>
    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style="padding-right: 40px;"> &sigma;<sub>i</sub></th>
  </tr> <tr>
    <td>0.46</td> <td>0.19</td> <td style="padding-right: 40px;">0.05</td>   <td>0.69</td><td>0.27</td><td style="padding-right: 40px;">0.06</td>
    <td>0.71</td> <td>0.28</td> <td style="padding-right: 40px;">0.05</td>   <td>1.04</td><td>0.62</td><td>0.01</td>
  </tr> <tr>
    <td>1.11</td> <td>0.68</td> <td style="padding-right: 40px;">0.05</td>   <td>1.14</td><td>0.70</td><td style="padding-right: 40px;">0.07</td>
    <td>1.17</td> <td>0.74</td> <td style="padding-right: 40px;">0.08</td>   <td>1.20</td><td>0.81</td><td>0.09</td>
  </tr> <tr>
    <td>1.31</td> <td>0.93</td> <td style="padding-right: 40px;">0.10</td>   <td>2.03</td><td>2.49</td><td style="padding-right: 40px;">0.03</td>
    <td>2.14</td> <td>2.73</td> <td style="padding-right: 40px;">0.04</td>   <td>2.52</td><td>3.57</td><td>0.01</td>
  </tr> <tr>
    <td>3.24</td> <td>3.90</td> <td style="padding-right: 40px;">0.07</td>   <td>3.46</td><td>3.55</td><td style="padding-right: 40px;">0.03</td>
    <td>3.81</td> <td>2.87</td> <td style="padding-right: 40px;">0.03</td>   <td>4.06</td><td>2.24</td><td>0.01</td>
  </tr> <tr>
    <td>4.93</td> <td>0.65</td> <td style="padding-right: 40px;">0.10</td>   <td>5.11</td><td>0.39</td><td style="padding-right: 40px;">0.07</td>
    <td>5.26</td> <td>0.33</td> <td style="padding-right: 40px;">0.05</td>   <td>5.38</td><td>0.26</td><td>0.08</td>
  </tr>    
</table>  


$$ y(x) = a _\mathrm{1} \cdot \exp \{-\frac{1}{2} (\frac{ x-a _\mathrm{2} }{ a _\mathrm{3} } )^2 \} $$


with $\sigma _i$ being the uncertainty on $y _i$.

 - Determine the values of the three parameters $a _\mathrm{1}$, $a _\mathrm{2}$, and $a _\mathrm{3}$ as well as their uncertainties.
 
 - Afterwards, use the transformation $z = \ln (y)$ to obtain the linear function $z(x) = b _\mathrm{1} + b _\mathrm{2} x + b _\mathrm{3} x^2$. Determine the new parameters $b _\mathrm{1}$, $b _\mathrm{2}$, and $b _\mathrm{3}$ and uncertainties in two ways and compare the results: first, by fitting this new function; second, by a calculation using the results for $a _\mathrm{j}$ and the transformation $z = \ln (y)$.

#### ROOT approach

In [None]:
from ROOT import TF1, TGraphErrors, TCanvas
import numpy as np

def exercise6_2_1(verbose = 0):
    # read exercise_6_2_1.csv or type are values explicitly
    
    # fit a Gaussian function
    
    #print(" --------  Fit Parameters ---------- ")
    #print(" a1 =     {0:.3f} +/- {1:.3f} ".format(a1, err_a1))
    #print(" a2 =     {0:.3f} +/- {1:.3f} ".format(a2, err_a2))
    #print(" a3 =     {0:.3f} +/- {1:.3f} ".format(a3, err_a3))
    #print(" chi2 =   {0:.3f} ".format(chi))
 

    # now with log
    
    # calculation
    
    
    #print(" --------  Fit Parameters ---------- ")
    #print(" b1 =     {0:.3f} +/- {1:.3f} ".format(b1, err_b1))
    #print(" b2 =     {0:.3f} +/- {1:.3f} ".format(b2, err_b2))
    #print(" b3 =     {0:.3f} +/- {1:.3f} ".format(b3, err_b3))
    #print(" chi2 =   {0:.3f} ".format(chi2))
  
    #print("\n --------  Estimated Parameters ---------- ")
    #print(" b1 =     {0:.3f} +/- {1:.3f} ".format(b1est, err_b1est))
    #print(" b2 =     {0:.3f} +/- {1:.3f} ".format(b2est, err_b2est))
    #print(" b3 =     {0:.3f} +/- {1:.3f} ".format(b3est, err_b3est))

  
    # draw graphs

    return

#### Python approach

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize

In [None]:
def exercise_6_2_1():
    # read exercise_6_2_1.csv or type are values explicitly

    # fit a Gaussian function

    #print("Fit Result:")
    #print("a1 = {0:.3f} +- {3:.3f}\na2 = {1:.3f} +- {4:.3f}\na3 = {2:.3f} +- {5:.3f}".
    #      format(a1, a2, a3, a1err, a2err, a3err))
    
    
    # now with log
        
    # calculation
    

    #print("Fit Result:")
    #print("b1 = {0:.3f} +- {3:.3f}\nb2 = {1:.3f} +- {4:.3f}\nb3 = {2:.3f} +- {5:.3f}".
    #      format(*bopt, np.sqrt(bcov[0, 0]), np.sqrt(bcov[1, 1]), np.sqrt(bcov[2, 2])))
    #print("Transformed Result:")
    #print("b1 = {0:.3f} +- {3:.3f}\nb2 = {1:.3f} +- {4:.3f}\nb3 = {2:.3f} +- {5:.3f}".
    #      format(b1, b2, b3, b1err, b2err, b3err))
    
    # plot
    
    return

### Exercise 6.2.2 (obligatory)

This exercise aims at constructing the error band around a function, $f(x)$, fitted to data points $(x,y)$ - i.e., the errors on the fitted parameters are transformed into errors on the value of the function at each value of $x$. (If you do not want to use `ROOT` to optimize the curve fit in this exercise, you can instead use the fitting algorithm provided by scipy, e.g., scipy.optimize.curve_fit - see Python approach).

Let us attack the problem in steps:

 - First, define a function for our problem: a straight line as $f(x) = a + bx$ with parameters a and b.
 
 - Next, consider `npoints=11` data points in the interval `[xof, xof+npoints-1]` with `xof=10` and y(x) = x. To simulate measurement errors, shift the data points in y-direction by a random shift, drawn from a Gaussian distribution with $\sigma = 0.5$ and $\mu = 0.0$.

 - Now fit the straight line defined above to your data points.
 
   Draw the data points and the fit result. Print the correlation coefficient of the rrors on the parameters $a$ and $b$.
   
   Try to give an intuitive argument why the two parameters $a$ (axis intercept) and $b$ (slope) are strongly correlated.
   
 - Next, construct the error band around the fit function: write a function `make_band(f, cov, withCorr=0)`, which takes the function $f$ and the covariance matrix, as input arguments and calculates for each value of x the error on $y$, $\Delta _y (x)$. As a first approach, use the simple formula for error propagation, which, in this case, results in $\Delta _y (x) ^2 = \Delta _a ^2 + (x \cdot \Delta _b )^2$ if correlations are neglected.
 
   Draw the curves $y(x) \pm \Delta_y (x)$, i.e., the "error band" on top of your data and fit result. Does this look correct?
   
 - Derive the appropriate formula for error propagation taking into account the correlation of the errors. Re-calculate $\Delta _y (x)$ and plot the corresponding error band. Compare with the above result obtained without correlations.
  
To get a better idea of what happens here and what the effect of the correlations is, you might want to repeat the whole exercise setting `xof=-5.`, meaning now that the mean of the $x$-values of the data points is approximately at $0$. Alternatively, you may choose the parametrisation of the function as `y(x) = a' + b(x-(xof+(npoints-1)/2))`.

#### ROOT Apprach

In [None]:
from ROOT import TF1, gRandom, TGraphErrors, TAxis, TVirtualFitter, TMatrixD, TH1, TCanvas, TFile, TLegend
from array import array
from math import fabs, sqrt

In [None]:
# Evaluates error band with/without taking into account the correlation
def make_band_r (f, cov, withCorr=0):
    # As first approach, use the simple formula for error propagation
    
    # Derive the appropriate formula taking into account the correlation of the errors
    
    return

In [1]:
def drawUncertaintyBand(npoints=11, xof=10, sigma=0.5):
    # First: Define a function (e.g. a straight line)
  

    # Next: `npoints` data points in the interval [xof, xof+npoints-1] with y(x) = x.
    # Shift the data points in y-direction

    
    # Store the values in a TGraphErrors
  

    # Now fit the straight line defined above to your data points

    
    # Access the correlation matrix
    # Hint: you can use the code in the appendix (see bottom of this notebook!)
    
    
    # Define above the function `make_band` and draw the "error band" on top of your data and fit result
    # w/o correlations  

    # w/ correlations
    
    
    return

#### Python approach

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize

In [None]:
# First: Define a function (e.g. a straight line)
def linear(x, *p):

In [None]:
def make_band_p(function, popt, pcov, range, use_correlation=False):
    # As first approach, use the simple formula for error propagation
    
    # Derive the appropriate formula taking into account the correlation of the errors
    

In [None]:
def exercise_6_2_2(xof=10):
    # Next: `npoints` data points in the interval [xof, xof+npoints-1] with y(x) = x.
    # Shift the data points in y-direction

    # Now fit the straight line defined above to your data points

    # Define above the function `make_band` and draw the "error band" on top of your data and fit result
    # w/o correlation

    # w/ correlation
    

### Appendix: Accessing the covariance matrix

#### Example code for pyROOT / JupyROOT: