<hr style="height: 1px;">
<i>This notebook was authored by the 8.S50x Course Team, Copyright 2022 MIT All Rights Reserved.</i>
<hr style="height: 1px;">
<br>

<h1>Lesson 12: Hypothesis Testing Part 2</h1>


<a name='section_12_0'></a>
<hr style="height: 1px;">


## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.0 Overview</h2>


<h3>Navigation</h3>

<table style="width:100%">
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_1">L12.1 Looking at Higgs Data</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_1">L12.1 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_2">L12.2 Fitting the Higgs Data and Introducing the f-test</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_2">L12.2 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_3">L12.3 Computation Using the f-statistic</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_3">L12.3 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_4">L12.4 Fitting the Higgs Signal Part 1</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_4">L12.4 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_5">L12.5 Fitting the Higgs Signal Part 2</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_5">L12.5 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_6">L12.6 Fitting Using Higher Order Polynomials</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_6">L12.6 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_7">L12.7 Building Interpolated Distributions</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_7">L12.7 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_12_8">L12.8 An Example Fitting Z Boson Data</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_12_8">L12.8 Exercises</a></td>
    </tr>
</table>


<h3>Learning Objectives</h3>

In this class, we will formalize how we actually make a scientific hypothesis and test it!

This lesson covers the Higgs boson discovery and the F-statistic. Specifically, we discuss the Higgs boson discovery, which involved the use of the F-statistic to test the hypothesis that the data collected by the Large Hadron Collider was consistent with the Higgs boson particle. Furthermore, we explain how the F-statistic was used to fit the Higgs data and how higher-order polynomials were used to improve the fit.

The lesson also covers several topics related to function choice, including how to build interpolated distributions and how to deal with non-analytic forms. We explain how different functions can be used to model different types of data and how to choose the best function for a particular dataset.

<h3>Data</h3>

>description: Higgs to di-photon<br>
>source: https://zenodo.org/record/8035284 <br>
>attribution: Harris, Philip based on Staufen, Christian (CMS Collaboration), DOI:10.5281/zenodo.8035284  

In [None]:
#>>>RUN: L12.0-runcell00

#If you are in a Google Colab environment, run this cell to import the data for this notebook.
#Otherwise, if you have downloaded the course repository, you do not have to run this cell.

!git init
!git remote add -f origin https://github.com/mitx-8s50/nb_LEARNER/
!git config core.sparseCheckout true
!echo 'data/L12' >> .git/info/sparse-checkout
!git pull origin main

<h3>Importing Libraries</h3>

Before beginning, run the cell below to import the relevant libraries for this notebook.


In [None]:
#>>>RUN: L12.0-runcell01

!pip install george #or use #conda install -c conda-forge george
!pip install lmfit


In [None]:
#>>>RUN: L12.0-runcell02

import numpy as np                #https://numpy.org/doc/stable/ 
import lmfit                      #https://lmfit.github.io/lmfit-py/ 
import matplotlib.pyplot as plt   #https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
from scipy import stats           #https://docs.scipy.org/doc/scipy/reference/stats.html
from scipy import interpolate     #https://docs.scipy.org/doc/scipy/reference/interpolate.html
import george                     #https://george.readthedocs.io/en/latest/
from george import kernels        #https://george.readthedocs.io/en/latest/user/kernels

<h3>Setting Default Figure Parameters</h3>

The following code cell sets default values for figure parameters.


In [None]:
#>>>RUN: L12.0-runcell03

#set plot resolution
%config InlineBackend.figure_format = 'retina'

#set default figure parameters
plt.rcParams['figure.figsize'] = (9,6)

medium_size = 12
large_size = 15

plt.rc('font', size=medium_size)          # default text sizes
plt.rc('xtick', labelsize=medium_size)    # xtick labels
plt.rc('ytick', labelsize=medium_size)    # ytick labels
plt.rc('legend', fontsize=medium_size)    # legend
plt.rc('axes', titlesize=large_size)      # axes title
plt.rc('axes', labelsize=large_size)      # x and y labels
plt.rc('figure', titlesize=large_size)    # figure title


<a name='section_12_1'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.1 Looking at Higgs Data</h2>  

| [Top](#section_12_0) | [Previous Section](#section_12_0) | [Exercises](#exercises_12_1) | [Next Section](#section_12_2) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid1" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L12/slides_L12_02.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L12.1-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L12/slides_L12_02.html', width=970, height=550)

<h3>Reading the Higgs Data</h3>

Let's start the process of fitting some Higgs boson data, specifically decays of the Higgs boson to two photons collected using the CMS detector at the LHC in 2011. First, let's look at the data. 

Note that the file `out.txt` is opened in the editor in the video but the file that is opened in software and then plotted is `out_2011.txt`.

In [None]:
#>>>RUN: L12.1-runcell01

import numpy as np
import csv
import matplotlib.pyplot as plt

#Let's fit a bunch of polynomials with lmfit
x = []
y = []
y_err = []

#change the filename here

#2012 data
#filename = 'out.txt'
#filename = 'out2.txt'
#filename = 'out3.txt'
#filename = 'out4.txt'
#filename = 'out5.txt'

#2011 data
filename = 'out_2011.txt' #TO MATCH THE VIDEO, USE THIS DATA SET
#filename = 'out2_2011.txt'
#filename = 'out3_2011.txt'
#filename = 'out4_2011.txt'
#filename = 'out5_2011.txt'


label='data/L12/' + filename
with open(label,'r') as csvfile:
    plots = csv.reader(csvfile, delimiter=' ')
    for row in plots:
        if float(row[1]) > 150 or float(row[1]) < 110:
            continue
        x.append(float(row[1]))
        y.append(float(row[2]))
        #add poisson uncertainties                                                                                                 
        y_err.append(np.sqrt(float(row[2])))

weights = np.linspace(0.,len(y),num=len(y))
for i0 in range(len(y)):
    weights[i0] = float(1./y_err[i0])

#Now we plot it. 
plt.title(filename)
plt.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0)
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.show()

<a name='exercises_12_1'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_1) | [Next Section](#section_12_2) |

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.1.1</span>

In the code cell above, we plotted 2011 data from the highest purity category. Now, try plotting data from the other categories. Consider the following files, which represent the different categories:

<pre>
#2012 data
#filename = 'out.txt'
#filename = 'out2.txt'
#filename = 'out3.txt'
#filename = 'out4.txt'
#filename = 'out5.txt'

#2011 data
filename = 'out_2011.txt' #THIS IS SHOWN IN THE VIDEO
#filename = 'out2_2011.txt'
#filename = 'out3_2011.txt'
#filename = 'out4_2011.txt'
#filename = 'out5_2011.txt'
</pre>

In general, which of the following statements give accurate characterizations of the data? Select all that apply. Note that the uncertainties in the data are shown by the vertical error bars.

A) The data from higher number categories has larger uncertainties compared to lower categories.\
B) The 2011 data has larger uncertainties compared to the 2012 data.\
C) The data from higher number categories are flatter compared to lower categories.\
D) There appear to be other features (some resembling bumps) beyond the Higgs mass ($m_{\gamma\gamma}\approx$125) in several of the datasets.\
E) The data all follow a perfectly smooth falling trend with no suggestions of bumps or other features.

Hint: You can plot one file at a time by changing which `filename` line is uncommented in the code cell above, OR you can write a function to load all of the files to view them at all at once. You may find the command `plot(axs[0,0],"data/L12/out.txt")` useful.


<a name='section_12_2'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.2 Fitting Higgs Data and Introducing the f-test</h2>  


| [Top](#section_12_0) | [Previous Section](#section_12_1) | [Exercises](#exercises_12_2) | [Next Section](#section_12_3) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid2" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L12/slides_L12_03.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L12.2-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L12/slides_L12_03.html', width=970, height=550)

<h3>Fitting Higgs Data</h3>

The Higgs Data looks like a falling distribution, and it's not obvious what to fit this with. Lets just fit it with a bunch of polynomial functions, and see how it works. 

In [None]:
#>>>RUN: L12.2-runcell01

import lmfit 

#TO MATCH THE VIDEO, USE DATA 'out_2011.txt' 
#IF NEEDED, RERUN L12.2-runcell01 WITH THIS FILE
#filename = 'out_2011.txt' 

def pol0(x,p0):
    pols=[p0]
    y = np.polyval(pols,x)
    return y

def pol1(x,p0,p1):
    pols=[p0,p1]
    y = np.polyval(pols,x)
    return y

def pol2(x, p0, p1,p2):
    pols=[p0,p1,p2]
    y = np.polyval(pols,x)
    return y

def pol3(x, p0, p1,p2,p3):
    pols=[p0,p1,p2,p3]
    y = np.polyval(pols,x)
    return y

def pol4(x, p0, p1,p2,p3,p4):
    pols=[p0,p1,p2,p3,p4]
    y = np.polyval(pols,x)
    return y

def pol5(x, p0, p1,p2,p3,p4,p5):
    pols=[p0,p1,p2,p3,p4,p5]
    y = np.polyval(pols,x)
    return y

def fitModel(iX,iY,iWeights,iFunc):
    model  = lmfit.Model(iFunc)
    p = model.make_params(p0=0,p1=0,p2=0,p3=0,p4=0,p5=0)
    result = model.fit(data=iY,params=p,x=iX,weights=iWeights)
    #result = lmfit.minimize(binnedLikelihood, params, args=(iX,iY,(iY**0.5),iFunc))
    output = model.eval(params=result.params,x=iX)
    return output

result0 = fitModel(x,y,weights,pol0)
result1 = fitModel(x,y,weights,pol1)
result2 = fitModel(x,y,weights,pol2)
result3 = fitModel(x,y,weights,pol3)
result4 = fitModel(x,y,weights,pol4)
result5 = fitModel(x,y,weights,pol5)

plt.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0,label="data")
plt.plot(x,result0,label="pol0")
plt.plot(x,result1,label="pol1")
plt.plot(x,result2,label="pol2")
plt.plot(x,result3,label="pol3")
plt.plot(x,result4,label="pol4")
plt.plot(x,result5,label="pol5")
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

#res0.plot()
#result1.plot()
#result2.plot()
#result3.plot()
#result4.plot()
#result5.plot()

Let's look at one of the higher order polynomials. 

In [None]:
#>>>RUN: L12.2-runcell02

plt.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0,label="data")
plt.plot(x,result5,label="pol5")
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

You can see that the fit is starting to pick up some of the fluctuations. Adding even higher order polynomials will pick up even more fluctuations. The f-test can tell us when we are adding too many polynomials. To see this, let's look at the residuals. 

In [None]:
#>>>RUN: L12.2-runcell03

def residual(iY,iFunc,iYErr):
    resid = (iY-iFunc)/iYErr
    tmp_vals, tmp_bin_edges = np.histogram(resid, bins=10,range=[-7,7])
    tmp_bin_centers = 0.5*(tmp_bin_edges[1:] + tmp_bin_edges[:-1])
    print("Mean:",resid.mean(),"\tSTD:",resid.std())
    return tmp_bin_centers,tmp_vals

delta_p0,delta_y0 = residual(y,result0,y_err)
delta_p1,delta_y1 = residual(y,result1,y_err)
delta_p5,delta_y5 = residual(y,result5,y_err)
plt.errorbar(delta_p0,delta_y0,yerr=delta_y0**0.5,label="pol0",marker='.',drawstyle = 'steps-mid')
plt.errorbar(delta_p1,delta_y1,yerr=delta_y1**0.5,label="pol1",marker='.',drawstyle = 'steps-mid')
plt.errorbar(delta_p5,delta_y5,yerr=delta_y5**0.5,label="pol5",marker='.',drawstyle = 'steps-mid')
plt.legend()
plt.show()

<h3>f-test (Chow-test)</h3>

In the interest of generalizing the t-test, the statistician Ronald Fisher developed the <a href="https://en.wikipedia.org/wiki/F-test" target="_blank">f-test</a>. This is really a generalization of t-test. This has become very useful in physics due to the work of Gregory Chow at MIT in the late 1950s. At that time, he developed the Chow test aimed at trying to come up with a way for how well a fit is behaving. To understand the Chow-test, let's delve into the f-test. 

The f-test is used when you want to compare a few distributions with each other. Imagine for example you have $N$ groups of data, each with $m$ points. If these samples are all from a Gaussian distribution of mean $\mu$ and variance $\sigma^{2}$, i.e. $\mathcal{N}(x,\mu,\sigma)$, we can define a new statistic defined conceptually as 

$$
\begin{equation}
 f = \frac{\rm variance~across~samples}{\rm variance~within~each~sample}  
\end{equation}
$$

which should be close to 1 if the samples are all from the same underlying distribution, but it should not be 1 if the samples are from different distributions. 

Let's say we have two distributions $a$ and $b$ each with number of degrees of freedom given by $n_{a}$ and $n_{b}$, we can then write the f-distribution as the ratio of their variances: 

$$
\begin{equation}
 f = \frac{\frac{S^{2}_{a}}{n_{a}} }{ \frac{S^{2}_{b}}{n_{b}} }\\
\end{equation}
$$

More generally, we can write the f-statistic as the variance of distinct samples over the average variance over the individual samples. For a total amount of $N$ samples with $K$ groups, each with $n_{i}$ events and mean $\bar{x}_{i}$, we can write the f-statistic as 

$$
\begin{equation}
 \sigma^{2}_{\rm group} = \frac{1}{K-1}\sum_{i=1}^{K} n_{i} \left(\bar{x}_{i}-\bar{x}\right)^2 \\
 \sigma^2 = \frac{1}{N-K}\sum_{i=1}^{K}\sum_{j=1}^{n_{i}}\left(x_{ij}-\bar{x}_{i}\right)^{2}\\ 
 f = \frac{  \sigma^{2}_{\rm group} }{\sigma^{2}}
\end{equation}
$$

The idea is that the numerator and denominator are both $\chi^{2}$ distributed variables with $K-1$ degrees of freedom on top, and $N-K$ degrees of freedom on the bottom. This statistic is most powerful for checking if the variances are consistent with being from the same distribution or a different distribution. 

From the above formulas, it has been derived that the f-statistic follows the 
<a href="https://en.wikipedia.org/wiki/F-distribution" target="_blank">f-distribution</a>, which has a very complex form that we will write here once for posterity. 


$$
\begin{equation}
f(x,d_{1}=K-1,d_{2}=N-K) = \frac{1}{\beta\left(\frac{d_{1}}{2},\frac{d_{2}}{2}\right)}\left(\frac{d_{1}}{d_{2}}\right)^{\frac{d_{1}}{2}}x^{\frac{d_{1}}{2}-1}\left(1+\frac{d_{1}}{d_{2}}x\right)^{-\frac{d_{1}+d_{2}}{2}}\\
\end{equation}
$$

where $\beta(x,y)$ is the <a href="https://en.wikipedia.org/wiki/Beta_function" target="_blank">$\beta$-function</a>.

So, why do we care about the f-test? 

Recall that to get a good fit you want the fit residuals to look like a Gaussian distribution. Unfortunately, it's often the case that the fit residuals are not Gaussian. Let's say you fit a line to a distribution, and the fit is not good. 

Well, then you can try fitting a more complicated function, how about a quadratic? Suppose that the residuals look better. 

What about a 3rd order polynomial? 

When do we know where to stop? This is where the f-statistic can help us. The idea is that we can compare the fit residuals from each function. If the fit residual ratio has a high likelihood given an f-distribution, then we know that the additional polynomial is not needed. 

More generally, the f-distribution helps you determine if a new fit is actually better than the previous one? One ways is to see if the $\chi^{2}$ is better, but what if the $\chi^{2}$ is approximately the same? Moreover, what if your $\chi^{2}$ was originally good, but not Gaussian. Comparing the residuals of the two fits can tell us if our new fit function is actually better. The f-test helps us quantify this. 

Let's say we want to compare the variance of two fits. 
If the variances are from the same underlying distribution, they will follow the f-statistic. This comes up when we are trying to figure out if our fit is actually working. 

So, as we go to higher order polynomials, we get to progressively smaller standard deviations. The issue is when do we stop. Let's now compute the f-statistic and compare it to our samples.  The f-statistic is defined at the ratio of the RMS distributions, we can write this as: 

$$
\begin{equation}
 f = \frac{  \sigma^{2}_{\rm group} }{\sigma^{2}} \\
 \sigma^{2}_{\rm group} = \frac{ -\sum_{i=1}^{N} \left(y_{i}-f_{2}(x_{i})\right)^{2} + \sum_{i=1}^{N} \left(y_{i}- f_{1}(x_{i})\right)^{2}}{\Delta^{2\rightarrow 1}_{\rm dof}} \\
 \sigma^{2} = \frac{1}{N - n_{\rm f_{2}~dof} }\sum_{i=1}^{N} \left(y_{i}- f_{1}(x_{i})\right)^{2}
\end{equation}
$$

or, in other words, the variation from a higher order polynomial to a lower order polynomial should be smaller than the average variation of the residuals. This is the f-statistic. In the next section, we will compute the f-statistic for a few instances.


<a name='exercises_12_2'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_2) | [Next Section](#section_12_3) |

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.2.1</span>

The f-function can be calculated in python using `stats.f.pdf(x,d1,d2)`, where `d1` and `d2` denote the degrees of freedom for the numerator (`dfn` in the code) and denominator (`dfd`), respectively. Run the code below to plot the f-distribution and vary the degrees of freedom in order to answer the question below. Note that "positively skewed" means a distribution that has a long tail extending out to larger values.

Which of the following statements accurately describes the characteristics of the f-distribution as the degrees of freedom are varied? Select all that apply:

A) The f-distribution is always symmetric, regardless of the degrees of freedom.\
B) The f-distribution is positively skewed when the degrees of freedom are small.\
C) The shape of the f-distribution becomes more symmetric when both the numerator and denominator degrees of freedom increase.\
D) The f-distribution becomes more concentrated around 0 as the degrees of freedom increase.\
E) The shape of the f-distribution is not affected by the degrees of freedom.



In [None]:
#>>>EXERCISE: L12.2.1

#change these values to plot different degrees of freedom
#degrees of freedom in numerator/denominator
dfn, dfd = 5, 20  

fig, ax = plt.subplots(figsize=(8, 4))

x_vals = np.linspace(0, 4, 500)
y_vals = stats.f.pdf(x_vals, dfn, dfd)

ax.plot(x_vals, y_vals, 'r-', lw=2, alpha=0.6, label='f pdf')

#plot area corresponding to p-value < 0.05
plt.fill_between(x_vals, 0, y_vals, where=(x_vals >= stats.f.ppf(0.95, dfn, dfd)), 
                 color='red', alpha=0.2, label = '0.05 > p' )

plt.xlim(0,4)
plt.ylim(0,)

plt.legend()
plt.title('f-distribution (dfn=' + str(dfn) + ', dfd=' + str(dfd) + ')')
plt.xlabel('f value')
plt.ylabel('Probability density')
plt.show()

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.2.2</span>

Let's think more about the degrees of freedom, while considering the case where we are comparing the fits of two different functions to a set of data. For instance, consider 100 data points that are fit by a zeroth-order polynomial (with one degree of freedom), compared to a first-order polynomial (with two degrees of freedom). What are the corresponding degrees of freedom in the numerator, `dfn`, and denominator, `dfd`, of the f-test? Enter your answer as a list of two numbers: `[dfn, dfd]`

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.2.3</span>

Now we will use the f-test to compare two polynomial fits to a set of data (as we will do in the next section). Run the code in your notebook (not shown here), which does the following:

The code first generates random data, `x_e` and `y_e`, with a slight positive correlation of 0.5. Then it defines a function `fitModel_e` for fitting a polynomial to the data using `lmfit`. It computes the sum of squared residuals using `residual2_e`, then performs an f-test using the function `ftest_e`. Finally, it calculates a p-value and plots the data and the corresponding f-distribution.

Your specific objective will be to compare the fits of a zeroth-order polynomial and first-order polynomial. Run the code, and try increasing the number of data points that are fit (the default is `data_len = 10`). Also try varying the correlation coefficient. Which of the following best describes the results (select all that apply):

A) As we increase the number of points that we sample, the p-value decreases, indicating that it is increasingly unlikely that the first-order polynomial is a better fit by chance alone.\
B) As we increase the number of points that we sample, the correlation between the data becomes more apparent, and the first-order polynomial fit begins to perform better than the zeroth order fit.\
C) As we increase the correlation coefficient for a fixed number of data points, the p-value decreases because the first-order fit performs better than the zeroth order fit.\
D) Even if the correlation coefficient is very small, the f-test will always show that the first-order polynomial is a better fit, as long as we sample enough data points.\
E) The degrees of freedom in the numerator is always 1 when we are comparing two adjacent polynomials.

In [None]:
#>>>EXERCISE: L12.2.3

import lmfit
import numpy as np
import matplotlib.pyplot as plt

# Generate random data with slight positive slope
#------------------------------------------------
np.random.seed(0)
data_len = 10
x_e = np.random.random(size=data_len)
y_e = 0.5 * x_e + np.random.normal(loc=0, scale=0.5, size=data_len)


# Here we define a fit function and fit the polynomials
#------------------------------------------------------ 
def fitModel_e(iX,iY,iFunc):
    model  = lmfit.Model(iFunc)
    p = model.make_params(p0=0,p1=0,p2=0,p3=0,p4=0,p5=0)
    result = model.fit(data=iY,params=p,x=iX)
    #result = lmfit.minimize(binnedLikelihood, params, args=(iX,iY,(iY**0.5),iFunc))
    output = model.eval(params=result.params,x=iX)
    return output

#polynomial 1 (zeroth order)
ndof1_e = 1
result0_e = fitModel_e(x_e,y_e,pol0)

#polynomial 2 (first order)
ndof2_e = 2
result1_e = fitModel_e(x_e,y_e,pol1)


# Here we define the residuals and f-test
#----------------------------------------
#returns sum of the squared residuals
def residual2_e(iY,iFunc):
    residval = (iY-iFunc)
    return np.sum(residval**2)
    
def ftest_e(iY,f1,f2,indof1,indof2):
    r1=residual2_e(iY,f1)
    r2=residual2_e(iY,f2)
    sigma2group=(r1-r2)/(indof2-indof1)
    sigma2=r2/(len(iY)-indof2)
    return sigma2group/sigma2


# Calculate the F-statistic and p-value
#--------------------------------------
#the number of degrees of freedom used for the numerotor is (ndof2-ndof1)
#the number of degrees of freedom used for the denominator is (len(x_e)-ndof2)
dfn_e = ndof2_e-ndof1_e
dfd_e = len(x_e)-ndof2_e
f10_e=ftest_e(y_e,result0_e,result1_e,ndof1_e,ndof2_e)
p_value_e = 1 - stats.f.cdf(f10_e, dfn=dfn_e, dfd=dfd_e)

print('dfn, dfd:', dfn_e, dfd_e)
print('F-statistic:', f10_e)
print('p-value:', p_value_e)


# Plots 
#--------------------------------
# Plot the data and fit functions
plt.scatter(x_e, y_e, color='blue', label='Data')
plt.plot(x_e, result0_e, color='red', label='Zeroth Order Fit')
plt.plot(x_e, result1_e, color='green', label='First Order Fit')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Polynomial Fit Comparison')
plt.legend()
plt.show()


#Let's see what the f-distribution looks like
x_vals_e = np.linspace(0, 10, 500)
y_vals_e = stats.f.pdf(x_vals_e, dfn_e, dfd_e)

plt.axvline(x=f10_e, color='black', linestyle='--', label='f-stat')
plt.plot(x_vals_e, y_vals_e, 'r-', lw=2, alpha=0.6, label='f pdf')

#plot area corresponding to p-value < 0.05
plt.fill_between(x_vals_e, 0, y_vals_e, where=(x_vals_e >= stats.f.ppf((1-0.05), dfn_e, dfd_e)), 
                 color='red', alpha=0.2, label = '0.05 > p' )

plt.xlim(0,10)
plt.ylim(1e-3,)
plt.yscale('log')
plt.legend()
plt.title('f-distribution (dfn=' + str(dfn_e) + ', dfd=' + str(dfd_e) + ')')
plt.xlabel('f value')
plt.ylabel('Probability density')
plt.show()


<a name='section_12_3'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.3 Computation Using the f-statistic</h2>  

| [Top](#section_12_0) | [Previous Section](#section_12_2) | [Exercises](#exercises_12_3) | [Next Section](#section_12_4) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid3" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Now, let's use the f-statistic formalism to investigate the significance of fitting the Higgs background with polynomials of increasing order. The goal is to determine at what point increasing the number of terms in the polynomial no longer makes a meaningful improvement in the fit.

The code below plots the f-statistic found when comparing the fits each time the polynomial order increases by 1. In addition, the blue line shows the p-value corresponding those f-statistics.


In [None]:
#>>>RUN: L12.3-runcell01

import scipy.stats as stats 

#TO MATCH THE VIDEO, USE DATA 'out_2011.txt' 
#IF NEEDED, RERUN L12.2-runcell01 WITH THIS FILE
#filename = 'out_2011.txt' 

def residual2(iY,iFunc,iYErr):
    residval = (iY-iFunc)
    return np.sum(residval**2)
    
def ftest(iY,iYerr,f1,f2,ndof1,ndof2):
    r1=residual2(iY,f1,iYerr)
    r2=residual2(iY,f2,iYerr)
    sigma2group=(r1-r2)/(ndof2-ndof1)
    sigma2=r2/(len(iY)-ndof2)
    return sigma2group/sigma2

f10=ftest(y,y_err,result0,result1,1,2)
f21=ftest(y,y_err,result1,result2,2,3)
f32=ftest(y,y_err,result2,result3,3,4)
f43=ftest(y,y_err,result3,result4,4,5)
f54=ftest(y,y_err,result4,result5,5,6)

xrange=np.linspace(0,300,100)
farr=1-stats.f.cdf(xrange,1,len(y)-5) #number of bins - 5 floating parameters
fig, ax = plt.subplots(figsize=(9,6))

ax.axvline(x=f10,linewidth=3,c='b',label='0 to 1')
ax.axvline(x=f21,linewidth=3,c='g',label='1 to 2')
ax.axvline(x=f32,linewidth=3,c='purple',label='2 to 3')
ax.axvline(x=f43,linewidth=3,c='yellow',label='3 to 4')
ax.axvline(x=f54,linewidth=3,c='orange',label='4 to 5')

ax.set_yscale('log')
plt.plot(xrange,farr,label='f(1,N)')
plt.legend()
plt.xlabel('f-statistic')
plt.ylabel('p-value')
plt.show()

xrange=np.linspace(0,3,100)
farr=1-stats.f.cdf(xrange,1,len(y)-5) 
fig, ax = plt.subplots(figsize=(9,6))
ax.axvline(x=f32,linewidth=3,c='purple',label='2 to 3')
ax.axvline(x=f43,linewidth=3,c='yellow',label='3 to 4')
ax.axvline(x=f54,linewidth=3,c='orange',label='4 to 5')
ax.set_yscale('log')
plt.xlabel('f-statistic')
plt.plot(xrange,farr,label='f(1,N)')
plt.ylabel('p-value')
plt.legend()
plt.show()


#PERFORM F-TEST BELOW
#print f-stat and p-value
f10=ftest(y,y_err,result0,result1,1,2)
p_val10 = 1 - stats.f.cdf(f10, dfn=1, dfd=len(y)-2)
print('f-stat 0 to 1:', f10)
print('p-value 1 better than 0 by chance:', p_val10)
print()
f21=ftest(y,y_err,result1,result2,2,3)
p_val21 = 1 - stats.f.cdf(f21, dfn=1, dfd=len(y)-3)
print('f-stat 1 to 2:', f21)
print('p-value 2 better than 1 by chance:', p_val21)
print()
f32=ftest(y,y_err,result2,result3,3,4)
p_val32 = 1 - stats.f.cdf(f32, dfn=1, dfd=len(y)-4)
print('f-stat 2 to 3:', f32)
print('p-value 3 better than 2 by chance:', p_val32)
print()
f43=ftest(y,y_err,result3,result4,4,5)
p_val43 = 1 - stats.f.cdf(f43, dfn=1, dfd=len(y)-5)
print('f-stat 3 to 4:', f43)
print('p-value 4 better than 3 by chance:', p_val43)
print()
f54=ftest(y,y_err,result4,result5,5,6)
p_val54 = 1 - stats.f.cdf(f54, dfn=1, dfd=len(y)-6)
print('f-stat 4 to 5:', f54)
print('p-value 5 better than 4 by chance:', p_val54)

We see that the f-statistic found when increasing the polynomial order from 2 to 3 already has a relatively large probability, which means that there is a high probability that any improvements in the fit found by adding a 3rd order term are due to random chance. So, we likely only need a 2nd order polynomial. Let's check the $\chi^{2}$ value as well. 

In [None]:
#>>>RUN: L12.3-runcell02

def chi2(iY,iFunc,iYErr,iNDOF):
    resid = (iY-iFunc)/iYErr
    chi2value = np.sum(resid**2)
    print("Mean:",resid.mean(),"\tSTD:",resid.std())
    chi2prob=1-stats.chi2.cdf(chi2value,len(iY)-iNDOF)
    print("chi2 prob:",chi2prob)
    return chi2value/(len(iY)-iNDOF)

chi2value=chi2(y,result2,y_err,3) #chisquare of a 2nd order polynomial (3 floating parameters)
print("Normalized chi2:",chi2value)

plt.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0,label="data")
plt.plot(x,result2,label="pol2")
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

Overall, this shows that the 2nd order polynomial gives a good fit!

<h3>Challenge Question</h3>

How about if we just compare the $\chi^{2}$ probabilities of the fit? Could we have made the decision of what order polynomial to use with a $\chi^{2}$ test? 

ANSWER: Yes (with a caveat). In this instance, we see a pretty dramatic jump in the chi2 value of the fit when going from 1st to 2nd order, and very little change when adding a 3rd order term. However, we have to be careful! The f-test is testing something different; it is not testing when the chi2 reaches a good value, it is testing when adding more polynomials doesn't gain you a meaningful improvement.  

In [None]:
#>>>RUN: L12.3-runcell03

print('chisq 1st order fit')
print("Normalized chi2:",chi2(y,result1,y_err,2))
print()

print('chisq 2nd order fit')
print("Normalized chi2:",chi2(y,result2,y_err,3))
print()

print('chisq 3rd order fit')
print("Normalized chi2:",chi2(y,result3,y_err,4))
print()

<a name='exercises_12_3'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_3) | [Next Section](#section_12_4) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.3.1</span>

The p-value for the f-distribution that compares the 0th order polynomial (i.e. a flat line) to the 1st order polynomial (i.e. a linear slope) is smaller than $10^{-100}$. Which of the following statements best characterizes how to interpret this:

A) There is a $10^{-100}$ probability that the 0th order model provides a better fit to the data purely by chance.\
B) There is a $10^{-100}$ probability that the 1st order model provides a better fit to the data purely by chance.\
C) A 0th degree polynomial is a $10^{100}$ times better fit to the data than a 1st degree polynomial.\
D) A 1st degree polynomial is a $10^{100}$ times better fit to the data than a 0th degree polynomial.\
E) None of the above.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.3.2</span>

Complete the code below to randomly sample data from a Gaussian distribution and fit with various polynomials, then compute the f-test to compare the polynomial fits to this Gaussian data.

What is the order of the polynomial at which the f-test first tells you to stop? In other words, what is the lowest order polynomial that the f-test indicates is a good fit? Hint: the f-test comparing the next higher order polynomial will not show a significant improvement (if any).

Consider the options below:

A) 0\
B) 1\
C) 2\
D) 3\
E) 4\
F) 5


In [None]:
#>>>EXERCISE: L12.3.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

np.random.seed(10)
gausrandom = np.random.normal(0,10,1000)
y_test,bin_edges = np.histogram(gausrandom,bins=30,range=(-30,30))
x_test = 0.5*(bin_edges[1:] + bin_edges[:-1])

x_test = x_test[y_test > 0]
y_test = y_test[y_test > 0]
y_test_err = np.sqrt(y_test)
weights_test = 1./np.sqrt(y_test)

result0 = fitModel(x_test,y_test,weights_test,pol0)
result1 = fitModel(x_test,y_test,weights_test,pol1)
result2 = fitModel(x_test,y_test,weights_test,pol2)
result3 = fitModel(x_test,y_test,weights_test,pol3)
result4 = fitModel(x_test,y_test,weights_test,pol4)
result5 = fitModel(x_test,y_test,weights_test,pol5)

plt.errorbar(x_test,y_test,y_test_err, lw=2,fmt=".k", capsize=0,label="data")
plt.plot(x_test,result0_test,label="pol0")
plt.plot(x_test,result1_test,label="pol1")
plt.plot(x_test,result2_test,label="pol2")
plt.plot(x_test,result3_test,label="pol3")
plt.plot(x_test,result4_test,label="pol4")
plt.plot(x_test,result5_test,label="pol5")
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

#PERFORM F-TEST BELOW
#YOUR CODE HERE

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.3.3</span>

Now compute the $\chi^{2}$ of all of your fits. What do you conclude? Which order polynomial fit would you use? Choose from the options below:

A) 0\
B) 1\
C) 2\
D) 3\
E) 4\
F) 5\
G) none of the above


In [None]:
#>>>EXERCISE: L12.3.3
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

###test the chi2

<a name='section_12_4'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.4 Fitting the Higgs Signal Part 1</h2>  

| [Top](#section_12_0) | [Previous Section](#section_12_3) | [Exercises](#exercises_12_4) | [Next Section](#section_12_5) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid4" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

With all of these pieces together, I would like to compute the significance of the Higgs boson discovery in one of its main channels. To do this, we are going to use all of the tools that we have been discussing. Let's first look at the data. For the Higgs boson data, there are 2 years of data, each with 5 categories. Here is what all of them look like. 

In [None]:
#>>>RUN: L12.4-runcell01

def load(iLabel,iRange=False):
    x = np.array([])
    y = np.array([])
    label=iLabel
    with open(label,'r') as csvfile:
        plots = csv.reader(csvfile, delimiter=' ')
        for row in plots:
            if not iRange and (float(row[1]) > 150 or float(row[1]) < 110):
                continue
            x = np.append(x,float(row[1]))
            y = np.append(y,float(row[2]))
            #add poisson uncertainties                                                                                                 
    weights = 1./y**0.5 
    return x,y,y**0.5,weights

def plot(ax,iLabel):
    x,y,y_err,weights=load(iLabel)
    #Now we plot it. 
    ax.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0,label=iLabel)
    #ax.x_label("$m_{\gamma\gamma}$")
    #ax.y_label("$N_{events}$")
    ax.legend()
    #ax.show()
    
fig, axs = plt.subplots(2, 3, figsize=(12,8))
#2012 data    
plot(axs[0,0],"data/L12/out.txt")
plot(axs[0,1],"data/L12/out2.txt")
plot(axs[0,2],"data/L12/out3.txt")
plot(axs[1,0],"data/L12/out4.txt")
plot(axs[1,1],"data/L12/out5.txt")
plt.show()

fig, axs = plt.subplots(2, 3, figsize=(12,8))
#2011 data    
plot(axs[0,0],"data/L12/out_2011.txt")
plot(axs[0,1],"data/L12/out2_2011.txt")
plot(axs[0,2],"data/L12/out3_2011.txt")
plot(axs[1,0],"data/L12/out4_2011.txt")
plot(axs[1,1],"data/L12/out5_2011.txt")


As you can see by comparing the size of the uncertainties in the above plots, there are many more events in the 2012 data. Let's take the category with the smallest uncertainties (`out.txt`), and perform an f-test on it. We will neglect including the signal for now, but we will get back to that in a sec. The code cell below also includes an additional 2012 dataset (`out2.txt`). Note that these are different datasets than the one we were investigating previously.

In [None]:
#>>>RUN: L12.4-runcell02

def fitAll(iLabel,iPlot=False):
    x,y,y_err,weights=load(iLabel)
    result0 = fitModel(x,y,weights,pol0)
    result1 = fitModel(x,y,weights,pol1)
    result2 = fitModel(x,y,weights,pol2)
    result3 = fitModel(x,y,weights,pol3)
    result4 = fitModel(x,y,weights,pol4)
    result5 = fitModel(x,y,weights,pol5)

    if iPlot:
        plt.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0,label="data")
        plt.plot(x,result0,label="pol0")
        plt.plot(x,result1,label="pol1")
        plt.plot(x,result2,label="pol2")
        plt.plot(x,result3,label="pol3")
        plt.plot(x,result4,label="pol4")
        plt.plot(x,result5,label="pol5")
        plt.xlabel("$m_{\gamma\gamma}$")
        plt.ylabel("$N_{events}$")
        plt.legend()
        plt.show()
    return x,y,y_err,result0,result1,result2,result3,result4,result5

def ftestAll(iLabel):
    x,y,y_err,result0,result1,result2,result3,result4,result5=fitAll(iLabel)
    f10=ftest(y,y_err,result0,result1,1,2)
    f21=ftest(y,y_err,result1,result2,2,3)
    f32=ftest(y,y_err,result2,result3,3,4)
    f43=ftest(y,y_err,result3,result4,4,5)
    f54=ftest(y,y_err,result4,result5,4,5)
    print("f 1 to 0:",1-stats.f.cdf(f10,1,len(y)-1))
    print("f 2 to 1:",1-stats.f.cdf(f21,1,len(y)-2))
    print("f 3 to 2:",1-stats.f.cdf(f32,1,len(y)-3))
    print("f 4 to 3:",1-stats.f.cdf(f43,1,len(y)-4))
    print("f 5 to 4:",1-stats.f.cdf(f54,1,len(y)-5))
    print()
    
print('2012 Data: Category 1')
fitAll("data/L12/out.txt",True)
ftestAll("data/L12/out.txt")

print('2012 Data: Category 2')
fitAll("data/L12/out2.txt",True)
ftestAll("data/L12/out2.txt")

So, from this looks like a 4th order polynomial gives an f-test with a probability above roughly 5% for both the category with the largest yield and the second largest yield. Therefore, a 4th order polynomial seems like a reasonable choice to use as our background function. Let's proceed with a signal function.   

<h3>Hypothesis Test for the Higgs Signal</h3>

Now, to fit a Higgs signal, what we want to do is a hypothesis test like we did above. Except now, we will cast our hypothesis slightly differently than before. 

**Null Hypothesis:** The Higgs signal has a mass of $m_{\gamma\gamma}$ at a specific $m_{0}$, and a fixed width of 1.2 GeV. 

**Alternative Hypothesis:** The Higgs signal is not there. 

The reason for the fixed width is that we know the actual width of the Higgs from theoretical predictions is a very tiny value of 4 MeV. However, we also know that the detector has a finite resolution for detecting the energy of photons, and we measure this using known energy photons in the experiment in other regions. In this case, we actually use $Z\rightarrow ee$. The use of electrons instead of photons to find the resolution works because both electrons and photons deposit the bulk of their energy in the electromagnetic calorimeters (described in Section 9.3). Furthermore, the photon actually converts into an electron-positron pair, which then leave their energy in the EM calorimeters.

The known photon resolution results in a $m_{\gamma\gamma}$ width of 1.2 GeV for a Higgs mass around 125 GeV.

In this case, we are also going to fix the Higgs mass at 125 GeV. That way, the only variable we are varying in the fit is the amplitude (i.e. the numbers of Higgs particles), and the significance we can quote by taking $-2\Delta\log(\mathcal{L})$ and noting this should follow a $\chi^{2}_{1}$ distribution with 1 degree of freedom.


In [None]:
#>>>RUN: L12.4-runcell03

def sigpol3(x,p0,p1,p2,p3,p4,amp,mass,sigma):
    bkg=pol3(x,p0,p1,p2,p3)
    sig=amp*stats.norm.pdf(x,mass,sigma)
    return sig+bkg

def sigpol4(x,p0,p1,p2,p3,p4,amp,mass,sigma):
    bkg=pol4(x,p0,p1,p2,p3,p4)
    sig=amp*stats.norm.pdf(x,mass,sigma)
    return sig+bkg

def sigpol5(x,p0,p1,p2,p3,p4,p5,amp,mass,sigma):
    bkg=pol5(x,p0,p1,p2,p3,p4,p5)
    sig=amp*stats.norm.pdf(x,mass,sigma)
    return sig+bkg

#this function uses fixed iM
def fitModel_new(iX,iY,iWeights,iM,iFunc):
    model  = lmfit.Model(iFunc)
    p = model.make_params(p0=0,p1=0,p2=0,p3=0,p4=0,p5=0,amp=0,mass=iM,sigma=1.2)
    try:
        p["mass"].vary=False
        p["sigma"].vary=False
    except:
      a=1
      #print("Mass and Sigma not in fit")
    result = model.fit(data=iY,params=p,x=iX,weights=iWeights)
    output = model.eval(params=result.params,x=iX)
    return output,result.residual

def fitSig(iLabel,iM,SBfunc,Bfunc,iPlot=False):
    x,y,y_err,weights=load(iLabel)
    resultSB,likeSB=fitModel_new(x,y,weights,iM,SBfunc)
    resultB, likeB =fitModel_new(x,y,weights,iM,Bfunc)
    if iPlot:
        plt.errorbar(x,y,y_err, lw=2,fmt=".k", capsize=0,label="data")
        plt.plot(x,resultSB,label="S+B")
        plt.plot(x,resultB, label="B")
        plt.xlabel("$m_{\gamma\gamma}$")
        plt.ylabel("$N_{events}$")
        plt.legend()
        plt.show()
    return np.sum(likeB**2)-np.sum(likeSB**2)

NLL=fitSig("data/L12/out.txt",125,sigpol4,pol4,True)
print("out.txt  2NLL:",NLL,"p-value",1-stats.chi2.cdf(NLL,1))

NLL=fitSig("data/L12/out2.txt",125,sigpol4,pol4,True)
print("out2.txt 2NLL:",NLL,"p-value",1-stats.chi2.cdf(NLL,1))

NLL=fitSig("data/L12/out3.txt",125,sigpol4,pol4,True)
print("out3.txt 2NLL:",NLL,"p-value",1-stats.chi2.cdf(NLL,1))

NLL=fitSig("data/L12/out4.txt",125,sigpol4,pol4,True)
print("out4.txt 2NLL:",NLL,"p-value",1-stats.chi2.cdf(NLL,1))

NLL=fitSig("data/L12/out5.txt",125,sigpol4,pol4,True)
print("out5.txt 2NLL:",NLL,"p-value",1-stats.chi2.cdf(NLL,1))

**Note:** Run the cell above before moving on to the next section. The result is discussed in the video for the next section, but you are meant to think through the result here before proceeding.



<a name='exercises_12_4'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_4) | [Next Section](#section_12_5) |

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.4.1</span>

Which of the following statements accurately describes how to interpret the p-value in the above calculation? Select all the apply:

A) The p-value represents the probability that a random fluctuation in the data enables us to fit a peak plus background.\
B) The p-value represents the probability that a peak plus background is the best fit.\
C) The p-value represents the probability that the data are best modeled by a background fit only.\
D) The p-value represents the probability of realizing a particular negative log-likelihood ratio, based on a chi-squared distrubiton.\
E) None of the above.


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.4.2</span>

Which category shows the most significant evidence for the Higgs signal?

A) dataset 1: out.txt\
B) dataset 2: out2.txt\
C) dataset 3: out3.txt\
D) dataset 4: out4.txt\
E) dataset 5: out5.txt



### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.4.3</span>

When physicists searched for the Higgs boson, they did this blind. That means they did not look at the data. However, they did do f-tests, and even $\chi^{2}$ goodness of fit tests. Knowing that the f-test for the dataset `out.txt` with a 4th order polynomial background is good, complete the code below to compute the normalized $\chi^{2}$ for that fit (background only). Report the value as a number with precision 1e-3.


In [None]:
#>>>EXERCISE: L12.4.3

def fitModel_background(iX,iY,iWeights,iFunc):
    model  = lmfit.Model(iFunc)
    p = model.make_params(p0=0,p1=0,p2=0,p3=0,p4=0,p5=0)
    result = model.fit(data=iY,params=p,x=iX,weights=iWeights)
    output = model.eval(params=result.params,x=iX)
    return output

def fitAll_background(iLabel):
    x,y,y_err,weights=load(iLabel)
    result4_background = fitModel_background(x,y,weights,pol4)
    return x,y,y_err,result4_background


def chi2testAll_background(iLabel):
    #YOUR CODE HERE
    return

chi2testAll_background("data/L12/out.txt")

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.4.4</span>

As mentioned above, when physicists searched for the Higgs boson, they performed a blind analysis and did not look at the data. However, even looking only at the significance of the fits, and not the data, could ruin the blind analysis. Which of the following describes how analyzing chi-squared fit values could result in unblinding?

A) If the chi-squared values for the background-only fit are large in some region, this could indicate that a strong signal is present.\
B) If the chi-squared values for both the signal-plus-background fit and the background-only fit are both small, then there is definitely no signal present.\
C) We can always analyze the significance of background-only fits because the influence of the signal will not be present.


<a name='section_12_5'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.5 Fitting the Higgs Signal Part 2</h2>  

| [Top](#section_12_0) | [Previous Section](#section_12_4) | [Exercises](#exercises_12_5) | [Next Section](#section_12_6) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid5" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Based on the analysis in the previous section, we see a fairly significant Higgs bump at 125, especially in the `out.txt` and `out2.txt` datasets!

However, those fits were made assuming a fixed value of the mass, Now, let's scan a range of masses and make the so-called p-value plot. This is just a plot of the significance as a function of mass. What we will do is move through the mass values and perform the same p-value calculation as we did previously for a fixed signal location of 125, and then plot the results as a function of mass. 

Note that this code takes a while to run because it is fitting all 5 datasets for 120 different values of the mass.

In [None]:
#>>>RUN: L12.5-runcell01

def pvalueCalc(iLabel,pMass,iSBFunc,iBFunc):
    NLL=fitSig(iLabel,pMass,iSBFunc,iBFunc,False)
    NLLp = 1-stats.chi2.cdf(NLL,1)
    return NLLp

def pvaluePlot(iLabel,iSBFunc,iBFunc):
    pvalue = np.array([])
    massrange=np.linspace(110,150,120)
    for pMass in massrange:
        pvalue = np.append(pvalue,pvalueCalc(iLabel,pMass,iSBFunc,iBFunc))
    return massrange,pvalue

m0,p0 = pvaluePlot("data/L12/out.txt",sigpol4,pol4)
m1,p1 = pvaluePlot("data/L12/out2.txt",sigpol4,pol4)
m2,p2 = pvaluePlot("data/L12/out3.txt",sigpol4,pol4)
m3,p3 = pvaluePlot("data/L12/out4.txt",sigpol4,pol4)
m4,p4 = pvaluePlot("data/L12/out5.txt",sigpol4,pol4)

plt.plot(m0,p0,label="Category 1")
plt.plot(m1,p1,label="Category 2")
plt.plot(m2,p2,label="Category 3")
plt.plot(m3,p3,label="Category 4")
plt.plot(m4,p4,label="Category 5")
plt.ylim((0.0001,1))
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("p-value")
plt.yscale("log")
plt.legend()
plt.show()

<a name='exercises_12_5'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_5) | [Next Section](#section_12_6) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.5.1</span>

In this section, we have fit the data with a 4th order polynomial, and in the next section we will examine fitting the data with a 5th order polynomial.

In this question, consider what the p-value plot (from runcell `L12.5-runcell01`) would look like using a 3rd order polynomial. Try to make this plot in the code cell provided.

Which of the following best describes the features of the 3rd order fit, compared to the 4th order fit?

A) The Higgs bump is enhanced (i.e. it has an even lower p-value).\
B) The Higgs bump completely disappears.\
C) The Higgs bump remains, but it not the strongest feature (i.e.  the one with the lowest p-value) in the plot.\
D) The Higgs bump is not as prominent compared to other features (i.e. the p-values for other features are now closer to that for the Higgs bump).


In [None]:
#>>>EXERCISE: L12.5.1
# Use this cell for drafting your solution (if desired),
# then enter your answer in the interactive problem online to be graded.
# You may find that copying and editing some lines from L12.5-runcell01 will be helpful



### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.5.2</span>

What do you expect the p-value plot would look like using a 5th order polynomial (don't try it yet)? Select all answers from the following that you think are possible:

A) The Higgs bump is enhanced.\
B) The Higgs bump completely disappears.\
C) The Higgs bump remains, but it not the strongest feature in the plot.\
D) The Higgs bump is not as prominent as it was before, but it is still the strongest feature.




<a name='section_12_6'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.6 Fitting Using Higher Order Polynomials</h2>  

| [Top](#section_12_0) | [Previous Section](#section_12_5) | [Exercises](#exercises_12_6) | [Next Section](#section_12_7) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid6" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

### Challenge Question

Let's answer the last question from the previous exercise by finding the Higgs boson p-value significance plot with a 5th order polynomial. Then, we'll narrow in on the category 1 data (`out.txt`) and see how the fits with a 4th and 5th order polynomial compare. 

In [None]:
#>>>RUN: L12.6-runcell01

#THESE TWO FUNCTION WERE DEFINED PREVIOUSLY
def pol5(x, p0, p1,p2,p3,p4,p5):
    pols=[p0,p1,p2,p3,p4,p5]
    y = np.polyval(pols,x)
    return y

def sigpol5(x,p0,p1,p2,p3,p4,p5,amp,mass,sigma):
    bkg=pol5(x,p0,p1,p2,p3,p4,p5)
    sig=amp*stats.norm.pdf(x,mass,sigma)
    return sig+bkg

m0,p0 = pvaluePlot("data/L12/out.txt",sigpol5,pol5)
m1,p1 = pvaluePlot("data/L12/out2.txt",sigpol5,pol5)
m2,p2 = pvaluePlot("data/L12/out3.txt",sigpol5,pol5)
m3,p3 = pvaluePlot("data/L12/out4.txt",sigpol5,pol5)
m4,p4 = pvaluePlot("data/L12/out5.txt",sigpol5,pol5)

plt.plot(m0,p0,label="Category 1")
plt.plot(m1,p1,label="Category 2")
plt.plot(m2,p2,label="Category 3")
plt.plot(m3,p3,label="Category 4")
plt.plot(m4,p4,label="Category 5")
plt.ylim((0.0001,1))
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("p-value")
plt.yscale("log")
plt.legend()
plt.show()

In [None]:
#>>>RUN: L12.6-runcell02

#answer
def pol5(x, p0, p1,p2,p3,p4,p5):
    pols=[p0,p1,p2,p3,p4,p5]
    y = np.polyval(pols,x)
    return y

def sigpol5(x,p0,p1,p2,p3,p4,p5,amp,mass,sigma):
    bkg=pol5(x,p0,p1,p2,p3,p4,p5)
    sig=amp*stats.norm.pdf(x,mass,sigma)
    return sig+bkg

NLL=fitSig("data/L12/out.txt",125,sigpol5,pol5,True)


m0_pol4,p0_pol4 = pvaluePlot("data/L12/out.txt",sigpol4,pol4)
m0_pol5,p0_pol5 = pvaluePlot("data/L12/out.txt",sigpol5,pol5)

plt.plot(m0_pol4,p0_pol4,label="Category 1 4th order")
plt.plot(m0_pol5,p0_pol5,label="Category 1 5th order")
plt.ylabel("p-value")
plt.yscale("log")
plt.legend()
plt.show()

#The 5th order polynomial is less sensivitive since there are more degrees of freedom

<h3>Combining p-values</h3>

Now, if we have 5 subsets of the data, each giving a p-value at a specific mass point, how do we combine these p-values to get the significance for the entire data sample? The strategy is to realize that these are each independent experiments. 

We start by relating the individual p-values to their corresponding $\chi^2$ distribution:

$$
\begin{equation}
\chi^{2}_{\nu=2} = -2 \log(p_{i})
\end{equation}
$$

Now, let's say we have $n$ measurements each with probability $p_{i}$ for the i-th category. If we take the $-2\log(p_{i})$ and sum them for all of the categories, we have a sum of $n$ $\chi^{2}$ distributions for 2 degrees of freedom. This is just a $\chi^{2}_{\nu=2n}$ distribution. 

$$
\begin{equation}
\chi^{2}_{\nu=2n} = -2 \sum_{i=1}^{n} \log(p_{i})
\end{equation}
$$

From this relation, we can immediately get the combined p-value by checking the sum of the individual p-value for a $\chi^{2}_{\nu=2n}$ distribution.  Let's see this in action!

In [None]:
#>>>RUN: L12.6-runcell03

def pvalueCalc(iLabel,pMass,iSBFunc,iBFunc):
    logp=0
    for pLabel in iLabel:
        NLL=fitSig(pLabel,pMass,iSBFunc,iBFunc,False)
        NLLp = 1.-stats.chi2.cdf(NLL,1)
        logp = logp - 2.*np.log(NLLp)
    pPVal  = 1-stats.chi2.cdf(logp,2*len(iLabel))
    return pPVal

files=["data/L12/out.txt","data/L12/out2.txt","data/L12/out3.txt","data/L12/out4.txt","data/L12/out5.txt"]
mC,pC = pvaluePlot(files,sigpol4,pol4)

for pVal in range(4):
    sigmas = 1-stats.norm.cdf(pVal+1)
    plt.axhline(y=sigmas, color='r', linestyle='-')
plt.plot(m0,p0,label="Category 1")
plt.plot(m1,p1,label="Category 2")
plt.plot(m2,p2,label="Category 3")
plt.plot(m3,p3,label="Category 4")
plt.plot(m4,p4,label="Category 5")
plt.plot(mC,pC,label="Combined Category")
plt.ylim((0.0001,1))
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("p-value")
plt.yscale("log")
plt.legend()
plt.show()


Note that the plot above also has horizontal lines at the p-values corresponding the 1, 2, and 3 standard deviations.

Because the plot with all of the categories overlaid is quite messy, let's just plot the combined p-values alone. Notice that there are several "bumps" in individual categories, some reaching close to 2$\sigma$, which are significantly reduced when combining all of the categories together.

In [None]:
#>>>RUN: L12.6-runcell05

for pVal in range(4):
    sigmas = 1-stats.norm.cdf(pVal+1)
    plt.axhline(y=sigmas, color='r', linestyle='-')
    
plt.plot(mC,pC,label="Category 1")
plt.ylim((0.0001,1))
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("p-value")
plt.yscale("log")
plt.legend()
plt.show()

Ok, now that we have done that, let's compare our result for the 8TeV measurement (2012 data) with that in the original paper, which can be found <a href="https://arxiv.org/pdf/1407.0558.pdf" target="_blank">here</a>. The plot of p-value versus mass, comparable to what is shown above, is Fig. 18 on page 40. The red curve for the 8TeV data goes down to about $3\times10^{-5}$, corresponding to 4 standard deviations, whereas we found a minimum p-value equivalent to only 3 standard deviations. The reason is that the analysis in the paper is more complicated. To compare our result with something closer in significance, look at Fig. 2 on page 10 of the Higgs discovery <a href="https://arxiv.org/pdf/1207.7235.pdf" target="_blank">paper</a>.

The published Higgs analysis is more complicated because of the fact that the combination of different categories which we did here is naive. We are assuming that all of the categories contribute equally. In reality, we know from simulation of the signal that the relative contributions of each category is not equal, but varies. As a consequence, the technique used in this section really only works to get an "approximate" sensitivity. The better way to do this is to simultaneously fit all categories taking into account the known relative signal strength of each one. 

<a name='exercises_12_6'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_6) | [Next Section](#section_12_7) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.6.1</span>

Noting that the p-value on the y-axis of the plots shown previously is just a translation of the likelihood, and the p-value is computed from a $\chi^{2}$ distribution, convert from $\chi^{2}$ probability back to 2$\log(\mathcal{L})$, and from this compute the best fit mass for the Higgs boson with 1 standard deviation uncertainty. Use the results of the combined data, denoted `mC` and `pC`? As always, the "best fit mass" is the one corresponding to the lowest $\chi^2$ and the $1\sigma$ uncertainty corresponds to the range over which the $\chi^2$ is within 1 of the minimum value. You may choose to perform this computation in the code cell provided.

How does the best fit mass that you find from this analysis compare to the accepted value of the Higgs?

Report your answer as a list of two numbers with precision 1e-1:`[best fit mass, uncertainty]`

**Note:** You can answer this question by simply looking at the $\chi^2$ for individual masses in the $mC$ array, but a more accurate answer (less sensitive to the details of the binning) can be found by fitting a parabola to a few points right around the minimum. Use whatever method you like!

In [None]:
#>>>EXERCISE: L12.6.1
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

#The strategy here is to realize that p-value plot is also 2*Log(L) of our best fit,
#thus, we just need to go 1 standard deviation from the minimum in likelihood

<a name='section_12_7'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.7 Building Interpolated Distributions</h2>  

| [Top](#section_12_0) | [Previous Section](#section_12_6) | [Exercises](#exercises_12_7) | [Next Section](#section_12_8) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid8" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L12/slides_L12_08.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L12.7-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L12/slides_L12_08.html', width=970, height=550)

<h3>Overview</h3>

To find the right fit function, we used a library of functions to profile and find the one that best fit the data. Thanks to the f-statistic, we don't need to just throw one function at the problem, we can throw many. In fact, modern searches aiming for the most sensitivity will use a whole library of functions to fit a signal. For a more detailed analysis of how you would do this, look at this <a href="https://arxiv.org/pdf/1408.6865.pdf" target="_blank">paper</a>. 

However, we can also build our function by just matching it to the data. We will discuss two ways to do this, starting with a spline interpolated function. 

In [None]:
#>>>RUN: L12.7-runcell01

#Now let's load some data and do gaussian kernels with it
x,y,y_err,weights=load("data/L12/out_2011.txt")


from scipy import interpolate
tck = interpolate.splrep(x, y) #setup the spline
x2 = np.linspace(110, 150) #range
y2 = interpolate.splev(x2, tck)#apply the spline

plt.plot(x, y, 'go',label='data')
plt.plot(x2, y2, 'b',label='spline')
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

So what happened is, we just took the data points themselves and used then to generate a smooth function that we can evaluate anywhere. This is done by chunking up the data into little pieces, fitting higher order polynomials to each, and then joining those fits into a single smooth function. 

We can do whatever we want with the resulting function. For example, this spline fit has a lot of wiggles which probably do not result from the underlying physics which creates our background. As one way to address that, let's smooth this function out by convolving it with a Gaussian distribution. 

In [None]:
#>>>RUN: L12.7-runcell02

#spline convolve by hand
def splineconvolve(tck,f2,x,iMin=-15,iMax=15,iN=500):
    step=float((iMax-iMin))/float(iN)
    pInt=0
    for i0 in range(iN):
            pX   = i0*step+iMin
            pVal = interpolate.splev(x-pX,tck)*f2(pX)
            pInt += pVal*step
    return pInt

def gaussian(x,mean=0,sigma=1):#try changing the sigma value
    return 1./(sigma * np.sqrt(2 * np.pi)) * np.exp( - (x - mean)**2 / (2 * sigma**2)) 

fig, ax = plt.subplots()
x_in=np.linspace(110, 150, 10)
#x_in=np.linspace(115, 145, 10)
#x_in=np.linspace(122, 142, 10)
conv_out=[]
for val in x_in:
    pConv_out=splineconvolve(tck,gaussian,val)
    conv_out.append(pConv_out)

#now we can plot it
plt.plot(x, y, 'go',label='data')
plt.plot(x2, y2, 'b',label='spline')
plt.plot(x_in,conv_out,c='orange',label='convolution')
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

As you probably expected, the convolution smoothed out those artificial looking wiggles. You can also see one of the issues with this method, namely that it has so-called "edge" effects. These occur where the Gaussian used in the convolution extends beyond the region where there is data. Try one of the commented-out `x_in=np.linspace` lines to see what happens when you convolve over a narrower data range.

In fact, in recent times interpolation with all sorts of functions, not just polynomials have become very popular. Just to show you another one that is used often, let's try a Gaussian process which uses Gaussian distributions to fit the data piece-wise. Conceptually, this strategy is like the f-test. In this procedure, you keep adding Gaussians to fit the data until it is well described. See <a href="https://en.wikipedia.org/wiki/Gaussian_process" target="_blank">here</a> for a much deeper explanation of how this works.

One advantage of this second method is that it outputs an uncertainty band around the fitted output. These are determined by using the uncertainties in the data points to find the range of functions that give a reasonable fit.

In [None]:
#>>>RUN: L12.7-runcell03

import george
from george import kernels

kernel = np.var(y) * kernels.Matern52Kernel(5.0)
#kernel = np.var(y) * kernels.Matern52Kernel(125.0)
gp = george.GP(kernel)
gp.compute(x, y_err)
x_pred = np.linspace(110, 150, 100)
pred, pred_var = gp.predict(y, x_pred, return_var=True)

plt.fill_between(x_pred, pred - np.sqrt(pred_var), pred + np.sqrt(pred_var),color="k", alpha=0.2,label="gaussian Process")
plt.plot(x_pred, pred, "k", lw=1.5, alpha=0.5)
plt.errorbar(x, y, yerr=y_err, fmt=".k", capsize=0)
plt.plot(x2, y2, 'b',label='spline')
plt.plot(x_in,conv_out,c='orange',label='convolution')
plt.xlabel("$m_{\gamma\gamma}$")
plt.ylabel("$N_{events}$")
plt.legend()
plt.show()

The comparison to the convolved spline shows that this second procedure also has an issue with generating wiggles. We could again convolve this output with a Gaussian smoothening but there is another option, As with the spline, there are parameters for how the fit averages over the data points. Try uncommenting the second `kernel=np.var` line to see an example of how this works. The difference is that you are effectively changing the size of the window of interpolation. This will allow us to either smooth or unsmooth the resulting output distribution. Note that even the alternative Gaussian process might still retain some unphysical wiggles. 

<a name='exercises_12_7'></a>     

| [Top](#section_12_0) | [Restart Section](#section_12_7) | [Next Section](#section_12_8) |

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-12.7.1</span>

As we saw, when we interpolate data we obtain a function that can be used on any set of data points within the interpolated range. We can also augment the interpolation by convolving it with other functions, the most useful of which is a Gaussian.

Consider how this can be useful, in the context of fitting and hypothesis testing. Which of the following describes the usefulness of creating new functions in this manner? Select all that apply:

A) We can use a Gaussian convolution to smooth out features of the data when creating a background model.\
B) We can use the interpolation to create a background model of the data, then add a signal to the interpolation and perform a fit.\
C) We can use Gaussian convolution to enhance the resolution of the interpolated function, allowing for more precise analysis of the data.\
D) We can exploit the smoothness of the interpolated function combined with Gaussian convolution to detect and remove outliers in the data before fitting.

<a name='section_12_8'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L12.8 An Example Fitting Z Boson Data (Ungraded)</h2>     


| [Top](#section_12_0) | [Previous Section](#section_12_7) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS12/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS12_vid9" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L12/slides_L12_09.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L12.8-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L12/slides_L12_09.html', width=970, height=550)

<h3>Overview</h3>

Now, we will go a bit away from just fitting functions. It's not always the case that we have a nice function that we can use in a fit to describe our data. It is sometimes the case that we have a simulated shape. What we will do in this section is create an interpolated function (blue line) from simulated data (blue histogram), and then convolve our interpolation with a Gaussian function. Ultimately, we will try to tweak the parameters of our interpolation+Gaussian in order to fit the actual data (black points).

For high energy physics, simulated shapes often come from so-called "Monte-Carlo" simulations. In this approach, we construct distributions by randomly sampling many millions of events generated using some sort of model. The resulting simulated events can then be treated like data.

The data set below (black points) represents another Higgs boson channel, the Higgs decay to 4 leptons. However, in this example, we will try to fit the peak at 90, corresponding to the Z boson, which is very apparent. Again, we will try to fit our interpolation+Gaussian convolved function (blue line), which we derived from simulated data.

In [None]:
#>>>RUN: L12.8-runcell01

x,y_data,y_err,weights=load("data/L12/data.txt",True)
x,y_mc,y_mc_err,_=load("data/L12/zz_narrow.txt",True)

tck = interpolate.splrep(x, y_mc)
# y_shift=y_mc*3
# tck = interpolate.splrep(x, y_shift)

x2 = np.linspace(50, 160,1000)
y2 = interpolate.splev(x2, tck)
#y2 = interpolate.splev(x2, tck)*2.5

plt.errorbar(x,y_data,yerr=y_err,marker='.',linestyle = 'None', color = 'black')
plt.plot(x2, y2, 'b')
plt.plot(x,y_mc,drawstyle = 'steps-mid')
plt.xlabel("$m_{4\ell}$")
plt.ylabel("$N_{events}$")
plt.show()

The plot above shows the number of simulated events as a histogram, along with a spline fit to hat histogram. Clearly, there are  a few things off, most obviously the number of events is different between the data and the simulation. Secondly, the shapes don't look exactly the same. To see the latter effect more clearly, try replacing the `tck = interpolate.splrep` line with the two commented-out lines below it.

Oftentimes, when we want to do a precision fit, we will rely on our simulated samples to extract the signal. What we will do is allow the shape to be modified by a number of different approaches. One approach is to apply a numerical convolution of the shape with a Gaussian distribution, so that we can smear it out, making it wider. This what we will do in the fit shown below. 

In [None]:
#>>>RUN: L12.8-runcell02

#spline convolve by hand
def splineconvolvegaus(x,mean,sigma,iMin=-15,iMax=15,iN=500):
    step=float((iMax-iMin))/float(iN)
    pInt=0
    for i0 in range(iN):
            pX   = i0*step+iMin
            pVal = interpolate.splev(x-pX,tck)*gaussian(pX,mean,sigma)
            pInt += pVal*step
    return pInt

def gausconv(x,mean,sigma,amp,a,b):
    #Try fitting just this val first
    val=splineconvolvegaus(x,mean,sigma)*amp
    
    #Next, try uncommenting this line to change the function
    #val=a + b*x + val
    return val

model  = lmfit.Model(gausconv)
p = model.make_params(mean=0,sigma=1.0,amp=3.0,a=1,b=0)
#try adjusting the sigma values below, and also try floating the fit value
#p["sigma"].value=0.1
#p["sigma"].vary=False
result = model.fit(data=y_data,params=p,x=x,weights=weights)
lmfit.report_fit(result)
result.plot()
plt.show()

This fit doesn't look particularly good. Let's try some other options. In the following code, we first try adding a linear background and then turning off the smearing. 

In [None]:
#>>>RUN: L12.8-runcell03

def splineconvolvegaus(tck,f2,x,mean,sigma,iMin=-15,iMax=15,iN=500):
    step=float((iMax-iMin))/float(iN)
    pInt=0
    for i0 in range(iN):
            pX   = i0*step+iMin
            pVal = interpolate.splev(x-pX,tck)*f2(pX,mean,sigma)
            pInt += pVal*step
    return pInt

def gausconv(x,mean,sigma,sig,baseline,slope):
    val=splineconvolvegaus(tck,gaussian,x,mean,sigma)
    output = baseline+sig*val+slope*x
    return output

model  = lmfit.Model(gausconv)
p = model.make_params(mean=0,sigma=1,sig=2,baseline=2,slope=0)
p["mean"].vary = False
p["sigma"].vary = True
result = model.fit(data=y_data, params=p, x=x, weights=weights)
lmfit.report_fit(result)
result.plot()
plt.show()

#Now let's not smear the data
p["sigma"].value = 0.01
p["sigma"].vary = False
result = model.fit(data=y_data, params=p, x=x, weights=weights)
lmfit.report_fit(result)
result.plot()
plt.show()

<h3>Going beyond</h3>

This is just the start of constructing a fit function. The next set of tools we can explore is how to build deep learning algorithms to model the data. As we have gone through this and the previous lectures, we have been abstracting and standardizing the tools more and more to get to the point of being able to automate the whole procedure. This is what we will start in the following sections of the course. 
