# Lab 7 and 8

## Jennifer Vo, lab partners with William Olsen

### Lab 7: Event Selection Optimization

## Introduction and Selection

__You and your lab partner should pick different pT (transverse momentum) samples for this lab. In each pT sample, there are dedicated training samples for event selection optimization. All studies should be carried out by normalizing Higgs and QCD samples in each pT sample to give expected yields accordingly (See Dataset descriptions).__

In this lab, my partner and I will be optimizing the event selections in our LHC training samples. I will be working with the low PT training sample datasets, which come in two files: the QCD background dataset and the Higgs Boson signal dataset. There are 100k total events (jets) in each data set. The expected yield for the Higgs Boson signal data is 100 jets and the expected yield for QCD background data is 20,000 jets.

In the previous labs (Lab 5), I identified the different variables which held discrimination power between the Higgs Boson signal data and the QCD background data. In this lab, I will be plotting more histograms to work through the optimization process comprehensively. 

First let's import the required libraries for this lab and load the files' datasets into arrays that I can work with.

In [2]:
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy
from scipy import stats
import h5py
import math
import pickle

#this sets the size of the plots to something useful
plt.rcParams["figure.figsize"] = (7,4)

In [3]:
# import library
import pickle

# open the qcd background data file, and use pickle loading
infile = open("qcd250-500.pkl",'rb')
qcd_dict = pickle.load(infile)

# open the higgs signal data file, and use pickle loading
infile = open("higgs250-500.pkl",'rb')
higgs_dict = pickle.load(infile)

## Part 1
__Make a stacked histogram plot for the feature variable: mass__

First, I will be working with the variable "mass". This variable represents the sum of the masses of the jet's particles, produced in the collision. I would first like to create a stacked histogram plot for mass. In this case, I will create a stacked histogram by accessing the data from each file, creating a two-dimensional array with 100,000 rows and 2 columns, which contains the mass measurement from all jet events in both the background and the Higgs signal data. Once I plot this on the stacked histogram I will distinguish them by color.

In [None]:
# plt.hist(qcd_dict['mass'], bins = 100, label = 'Background')
# plt.xlabel('mass (GeV)', fontsize = 15)
# plt.ylabel('Event counts', fontsize = 15)

# plt.hist(higgs_dict['mass'], bins = 100, label = 'Higgs signal')
# plt.xlabel('mass (GeV)', fontsize = 15)
# plt.ylabel('Event counts', fontsize = 15)

# x needs to be an 2D array of dimensions 100k rows by 2 columns.

w, h = 100, 2
mass_combined = [[0 for x in range(w)] for y in range(h)]
for i in range (0, 100):
    mass_combined[0][i] = qcd_dict['mass'][i]
    mass_combined[1][i] = higgs_dict['mass'][i]

plt.hist(mass_combined, 30, density=True, histtype='bar', stacked=True)
plt.show()


- Evaluate expected significance without any event selection.
    - Use Poisson statistics for significance calculation
    - Compare the exact significance to the approximation  $N_{Higgs}/\sqrt{N_{QCD}}$ . If they are equivalent, explain your findings.

## Part 2
Identify mass cuts to optimize the expected significance.
- Try different mass cuts systematically
- Evaluate expected significance for each set of mass cuts
- Identify the set of mass cuts which give you the highest significance.

## Part 3
Make two sets of stacked histogram plots for the rest of the features
- Set A without any event selection
    - Can you identify another feature as discriminative as mass? (i.e. equal or better significance after feature cut)
- Set B with your optimal mass cuts
    - Can you identify another feature to further improve your expected signifiance?

## Part 4
Optimize event selections using multiple features (if necessary)
- Find a set of feature cuts which achieve high expected significance.
- Compare significance (before/after event selection) derived in your pT samples to your lab partner. Describe your findings.

## Part 5
Bonus (optional):
- Plot 2-dimensional plots using the top two most discriminative features
    - Can you find a curve or a linear combination in this 2D plane which gives even better sensitivity? Extended reading: Lab 7 is a classificaition problem using multi-dimensional features in supervised machine learning. We can use popular machine learning tools to develop an optimial classifier which can maximize information by using all features. For interested students, you can read https://scikit-learn.org/stable/supervised_learning.html