# 2. Monte Carlo weighting and events selections

### 2.1 Monte Carlo weighting

Today we will understand how it works to make the MC simulation be in agreement with data.

This problem presents to us because when simulating events it is not always possible to simulate them normalized to the processes cross section. This means that the number of MC events produced will be 'random' and related to the available computing power at the moment of production. 

For example the QCD-multijet background is never produced via MC simulation but always in a data driven way (we will do it in the next sessions) for two reasons:
* the process is extremely difficult to model
* it has a huge cross section

therefore, it would require a huge amount of computation power, which we do not have.

This kind of procedure leads i the end to situations in which:
* one event in MC may represents >50/100 events in data (if we do not have enough computation power to do a 1:1 production)
* one event in MC represents a fraction of an event in data (if we do not have enough computation power to do a 1:1 production)

For example:

The cross section of HH production via gluon fusion is $\sigma = 31.05$fb. 

Therefore at an integrated luminosity of $\mathcal{L} = 59.7$fb$^{-1}$ we would expect a nuber of events $N_{exp} = \sigma\cdot\mathcal{L} \approx 1853$

In our MC simulation we have $N_{MC} = 26940$ events, therefore we will have to weight the MC events by a factor that will make the number of events equal to the expected one at LHC. The scale factor will be $w = N_{exp} / N_{MC} \approx 0.07$

This is what we can call a *luminosity scaling weight*. This kind of scaling has to be done manually in our case and every histogram you will produce will always have to be scale by this weight. Luckily for you this weight is *almost* calculated already and the following scaling will only need to be done:

```python
histo.Scale(lumi/h_eff->GetBinContent(1))
```

where `h_eff` is a histogram contained in the SKIM `.root` files (You might have already noticed it). This is is not a proper weight but it is a scaling, therefore this does not have to enetr in the uncertainty evaluation (see the foloowing paragraph to understand what I am talking about)

This is not the only weight we have to apply, a bunch of other weights related to different sources must be applied too. They are:

| Branch name | Meaning | Samples |
|---|---|---|
| `MC_weigth` | MC simulations weight | all MC samples |
| `PUReweight` | reweghting due to pile-up | all MC samples |
| `L1pref_weight` | correctio to known trigger inefficiency | all MC samples |
| `trigSF` | trigger efficiency scale factors | all MC samples |
| `VBFtrigSF` | VBF trigger efficiency scale factors | all MC samples, only events with isVBFtrigger == 1 |
| `IdAndIsoAndFakeSF_deep _pt` | tauID, eleID, and muID efficiency scale factors | all MC samples |
| `DYscale_MTT` | correction to bad parton shower modelling in DY LO datasets | DY sample only |
| `prescaleWeight` | correction of prescaling techniques | all MC samples |
| `PUjetID_SF` | pile-up jetID efficiency scale factor | all MC samples |
| `bTagweightReshape` | b tagging efficiency | all MC samples |

All of these weights have to be applied at the level of histograms filling and it is very important to make the histogram 'save' the weights! 

As you know from statistics, when you have to calculate the uncrtainty on a number of events it maes a big difference if the events are the result of a counting experiment or if they are the result of a weighting procedure. In the first case the distribution is a Poissonian and the uncertainty is the square root of the number fo entries (i.e. $\sqrt{N}$). On the other hand when a histogram is the result of a sum and weighting procedure, the uncertainty on the number of entries is the sum in quadrature of all the weights (i.e. $\sqrt{\sum w^2}$). 

Therefore, it is very important for the histogram you create to save the weight corresponding to the entry you are filling it with. You do it like this:

```python
h = ROOT.TH1F(...)
for ev in events:
    tree.GetEntry(ev)
    w = tree.<weight1> * tree.<weight2> * ...
    h.Fill(tree.<variable>, w)
```

### 2.2 To do now

Apply the weights to the histograms you created for this time. You can use the same code and just modify it to account for all the weights. 

**Ignore `VBFtrigSF` for the moment**

Both if you have done the work using `ROOT` or fi you haven't, this step is done in a pretty straightforward way.

I would suggest you create a duplicate of the script you have, so you can have both the weighted ad the non-weighted cases for when you will have to write the report. (Putting this part in the report I think is good to show that you are undesrtanding how an analysis proceeds).

### 2.3 Events selections

While the code you just modifyed is running, as I do not expect it to be super-fast, we can go through the following part where I describe to you the event selection procedure.

What we call the `baseline` selection is already done in the SKIMS that I provided you. This selection corresponds to this cuts on events:
* third lepton veto $\rightarrow$  nleps == 0
* at least one b jet candidate $\rightarrow$ nbjetscand > 1 
* $p_{\text{T}}$ of both $\tau$ leptons greater than 25 GeV $\rightarrow$ dau1/2_pt > 25 
* pseudorapidity of both $\tau$ leptons smaller than 2.1 $\rightarrow$  abs(dau1/2_eta) < 2.1 

On top of this `baseline` selection you have apply all the other selections that allow us to create the various diferent categories of the analysis: resolved `2b0j`, resolved `1b1j`, `boosted`, and `VBF`. The specific selections are listed below.

Category `2b0j`:
* not being boosted
* not being VBF
* not passing VBF selections
* both b jets passing medium btag WP

Category `1b1j`:
* not being boosted
* not being VBF
* not passing VBF selections
* one b jets passing medium btag WP and the other not

Category `boosted`:
* being boosted
* not being VBF
* not passing VBF selections
* both b jets passing loose btag WP

Category `VBF`:
* being VBF
* mass of the VBF jets larger then 500GeV
* angular separation of the VBF jets larger than 3
* having at least one b jet passing medium working point
* `(((dau1_pt > 25 && dau2_pt > 25 && (dau1_pt <= 40 || dau2_pt <= 40)) && VBFjj_mass > 800 && VBFjet1_pt > 140 && VBFjet2_pt > 60) || isVBFtrigger==0)`

The related branches names are:
* being boosted $\rightarrow$ `isBoosted` (0/1 bool)
* being VBF $\rightarrow$ `isVBF` (0/1 bool)
* b jets tagging $\rightarrow$ `bjet1_bID_deepFlavor` and `bjet2_bID_deepFlavor` ([0,1] continuous)
* VBF jets mass $\rightarrow$ `VBFjj_mass`
* VBF jets angular separation $\rightarrow$ `VBFjj_deltaEta`

The b tag working points are:
* `loose` = 0.0494
* `medium` =  0.2770

### 2. 4 For next time

Write a code that creates the four categories explained above and populate them with the events from the SKIMS.

Remember to apply the correct weights to the correct events.

Plot some interesting kinematic variable for each category. I leave to you the decision of what can be interesting.

**WORK TOGETHER, I WOULD LIKE ALL OF YOU TO ARRIVE WITH THE SAME LEVEL OF KNOWLEDGE OF WHAT YOU HAVE DONE!**

**AS ALWAYS, I AM AVAILABLE FOR ANY PROBLEM OR DOUBT. CONTACT ME ON SKYOE. BUT I ASK YOU TRY AND FIGURE OUT DOUBTS TOGETEHR BEFORE ASKING ME**