## Calculating the invariant mass

In this section the data-analysis is started by calculating the invariant masses of the muon pairs that are detected in the collision events. Analysis will be done with the Python programming language.

The data used in the analysis has been collected by the CMS detector in 2011. From the original data a CSV file containing only some of the collision events and information has been derived. The original data is saved in AOD format that can be read with ROOT program. Open the link http://opendata.cern.ch/record/17 and take a look how large the original datafile is from the section _Dataset characteristics_.

From the original datafile only collision events with exactly two muons detected have been selected to the CSV file. The selection is done with the code similar to the one in the link http://opendata.cern.ch/record/552. In practice the code will select wanted values from the original file and write them to the CSV file. You can get an example of a CSV file by clicking the link http://opendata.cern.ch/record/545 and downloading one of the CSV files from the bottom of the page to your computer.

The CSV file used in this exercise is already saved to the same repository than this notebook file. Now let's get the file with Python and start the analysis!

### Initialisation and getting the data

In the code cell below needed Python modules _pandas_, _numpy_ and _matplotlib.pyplot_ are imported and named as _pd_, _np_ and _plt_. Modules are files that contain functions and commands for Python language. Modules are imported because not all of the things needed in the exercise could be done with the Python's built-in functions.

Also the data file from the repository is imported and saved to the variable named `ds`. __Don't change the name of the variable.__ The file is imported with the function `read_csv()` from the pandas module. So in the code there has to be an reference to pandas module (that we named as _pd_) in front of the function.

First we want to figure out how many collision events (or in this case data rows) there are in the data file. Add to the code cell below needed code to print out the number of rows of the imported file. With Python printing is done with the `print()` function where the thing that is wanted to be printed will be written inside the brackets. The length of an object can be determined with the `len()` function. Inside the brackets will be written the variable which length is wanted to be determined.

You can run the code cell by clicking it active and then pressing CTRL + ENTER. Feel free to test different solutions for printing the length of the file.

After you have printed the number of the rows in the datafile, you can move on to the next section.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read the data file 'DoubleMuRun2011A.csv' and save it to the variable 'ds'
ds = pd.read_csv('./DoubleMuRun2011A.csv')
print(len(ds))
# Add your own code to print the number of collision events in the datafile!

FileNotFoundError: [Errno 2] File DoubleMuRun2011A.csv does not exist: 'DoubleMuRun2011A.csv'

#### What does the file look like?

The file was saved as a _DataFrame_ structure (practically a table) of _pandas_ module in a variable called `ds`. Next print the five first rows of the file to look properly how does the file look. With the `print()` function it is possible to print a variable inside the brackets. With the function _variablename_`.head()` you can get the first five rows of the data file by changing the _variablename_ with the name of your variable.

Write a code that prints the five first rows of the data file and run the code cell by clicking it active and pressing CTRL + ENTER.

The "\\" symbols in the output tell that a row won't fit totally on a screen but continues to next rows of the output. The first row shows which information about muon pairs the file contains. For example E1 is the energy of the first muon and E2 the energy of the second etc. Here are the different values listed:

- Run = number of the run where data has been collected from
- Event = number of the collision event
- Type = type of the muon, global muon (G) has been measured both in the silicon tracker and muon chambers, tracker myon (T) has been measured only in the silicon tracker (these classifications are hypotheses since the type cannot be known absolutely)
- E = energy of the muon
- px, py, pz = different coordinates of momentum of the muon
- pt = transverse momentum, that is the component of momentum of the muon that is perpendicular to the particle beams
- eta = $\eta$ = pseudorapidity, a coordinate describing an angle (check the image 8)
- phi = $\phi$ = azimuth angle, also a coordinate describing an angle (check the image 8)
- Q = electrical charge of the muon

#### Calculating the invariant mass

Next calculate invariant mass values for muon pairs in each event with the different values from the data file. You have to write a proper equation only once since code executes the equation automatically for each row of the file.

For example if you would like to sum the electrical charges of two muons for each event and save results in a variable _charges_, it could be done with the following code:
```
charges = ds.Q1 + ds.Q2
```

So you have to tell in the code that Q1 and Q2 refer to values in the variable `ds`. This can be done by adding the variable name separated with a dot in front of the value that is wanted, as in the example above.

There are square root, cosine and hyperbolic cosine terms in the equation of invariant mass. Those can be fetched from the _numpy_ module that we named as _np_. You can get a square root with the function `np.sqrt()`, a cosine with `np.cos()` and a hyperbolic cosine with `np.cosh()`. Naturally inside the brackets there will be anything that is inside the square root or brackets in the equation too.

__Write below a code__ that will calculate the invariant mass value for muon pairs in each collision event in the data file. Save the values calculated in the variable `invariant_mass` that is already written in the code cell. Don't change the name of the variable.

After running, the code will print the first five values that are calculated. Also the output will tell if the values are correct. This is done with a small piece of code at the end of the cell.

You can get help from the theory part.

In [5]:
invariant_mass =
print('The first five values calculated (in units GeV):')
print(invariant_mass[0:5])

# Rest of the code is for checking if the values are correct. You don't have to change that.
if 14.31 <= invariant_mass.values[4] <= 14.32:
    print('Invariant mass values are correct!')
else:
    print('Calculated values are not yet correct. Please check the calculation one more time.')
    print('Remember: don´t change the name of the variable invariant_mass.')

The first five values calculated (in units GeV):


NameError: name 'invariant_mass' is not defined

## Making the histogram

Next let's make a histogram from the calculated invariant mass values. The histogram describes how the values are distributed, that is, how many values there has been in each bin of the histogram. In the image 9 there is a histogram that represents how the amount of cash in a wallet has been distributed for some random group of people. One can see from the histogram that for example the most common amount of cash has been 10–15 euros (12 persons have had this).

<figure>
    <img src="../images/histogram.png" alt="image missing" style="height: 350px" />
    <figcaption>Image 9: An example histogram from the distribution of the amount of cash.</figcaption>
</figure>

#### Creating the histogram

Histograms can be created with Python with the _matplotlib.pyplot_ module that was imported before and named as _plt_. With the function `plt.hist()` it is possible to create a histogram by giving different parameters inside the brackets. These parameters can be examined from https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.hist.html.

Now only the first three of the parameters are needed: a variable from which values the histogram is created (_x)_, number of bins (_bins_) and the lower and upper range of the bins (_range_).

Write down a code that will create a histogram from the invariant mass values that were calculated. Because this exercise focuses on the Z boson, set the range wisely to get the values near the mass of the Z boson. Use the Z boson mass value that you looked earlier from the Particle Data Group as a reference.

Try what is the best amount of bins to make a clear histogram. You can try different values and see how they affect to the histogram.

In the code there are already lines for naming the axes and the title of the histogram. Also there are comments marked with # symbols. These comments doesn't affect to the functionality of the code.

 

In [7]:
# Write down there a code that will create the histogram.


#plt.show()

NameError: name 'plt' is not defined

### Question 3

Describe the histogram. What information you can get from it?

## The histogram of the whole data

As an example let's also create a histogram from the all of the invariant masses in the data file without limiting near to the peak of the Z boson.

Plot the whole dataset DoubleMuRun2011A.csv into a histogram. Make the y-axis logarithmic and the x-axis logarithms to base 10 of the values of the invariant masses ( $\log_{10}(\text{value of the mass})$ ).
$$
\log_{10}(\text{mass}) = 0.5
$$

$$
10^{\log_{10}(\text{mass})} = 10^{0.5}
$$

$$
\text{mass} = 10^{0.5} \approx 3.1622 \text{GeV}
$$

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Read the data and the invariant mass 
ds = pd.read_csv('../Data/DoubleMuRun2011A.csv')
invariant_mass_1 = 

no_bins = 500
# Calculate the logarithms of the masses and weighs.

# Plot the weighted histogram. Use weight = no_bins/np.log(10)/invariant_mass

# Name the labels and the title.
#plt.show()

### Question 4

Compare the histogram that you created to the histogram published by the CMS experiment in the image 10 below.

<figure>
    <img src="../images/CMShistogram.png" alt="image missing" style="height: 350px" />
    <figcaption>Image 10: The histogram of the invariant masses published by the CMS experiment. &copy; <a href="https://arxiv.org/abs/1206.4071">CMS Collaboration</a> [5]</figcaption>
</figure>

### Final exercise

Your final exercise is to try to replicate the famous plot bellow. You can use random numbers from numpy to generate distributions. Your goal is to generate two distributions, plot them to histograms and sum them together. Try to include the labels etc. to the plot as well. You don't have to worry about the uncertainty bands!

<figure>
    <img src="../images/higgsgammagamma.png" alt="image missing" style="height: 350px" />
    <figcaption>Image 11: Try to replicate this plot using random data</figcaption>
</figure>

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

[1] P. Mouche, *Overall view of the LHC. Vue d'ensemble du LHC*, 2014.
Url: [https://cds.cern.ch/record/1708847](https://cds.cern.ch/record/1708847).

[2] M. Brice, *View of an open LHC interconnection. Vue d'une interconnection ouverte*, 2005.
Url: [https://cds.cern.ch/record/905940](https://cds.cern.ch/record/905940)

[3] CMS Collaboration, *Detector Drawings*, 2012.
Url: [https://cds.cern.ch/record/1433717](https://cds.cern.ch/record/1433717).

[4] M. Lapka, D. Barney, E. Quigg et al., *Interactive slice of CMS detector*, 2010.
Url: [https://cms-docdb.cern.ch/cgi-bin/PublicDocDB/ShowDocument?docid=4172](https://cms-docdb.cern.ch/cgi-bin/PublicDocDB/ShowDocument?docid=4172).

[5] CMS Collaboration, *Performance of CMS muon reconstruction in pp collision events at $\sqrt{s} =$ 7 TeV*, 2012.
Url: [arXiv:1206.4071](https://arxiv.org/abs/1206.4071).