# 6: Histograms

In nearly every High Energy Physics analysis, histograms have their place:
- plotting of variables
- calculating efficiency correction tables (weights)
- performing binned fits

and many more.

Scikit-HEP offers libraries to deal with histograms in Python in a performant way:
[hist](https://hist.readthedocs.io/en/latest/) is a user-friendly analysis library for histograms
and is directly built on top of the workhorse [boost-histogram](https://boost-histogram.readthedocs.io/en/latest/),
which can also be used directly. Hist simply provides more functionality.

They are written by the same authors and can also be used often together.

Both also work well together with mplhep, the plotting library.

In [None]:
# jupyter magic to load the previous data sets
%store -r bkg_df
%store -r mc_df
%store -r data_df

import boost_histogram as bh
import hist
import mplhep
import numpy as np

In [None]:
# Let's get started with a simple example

# Compose axis however you like; this is a 2D histogram
h = bh.Histogram(bh.axis.Regular(2, 0, 1),
                 bh.axis.Regular(4, 0.0, 1.0))

# Filling can be done with arrays, one per dimension
h.fill([.3, .5, .2],
          [.1, .4, .9])

# NumPy array view into histogram counts, no overflow bins
counts = h.view()
variances = h.variances()
mplhep.hist2dplot(h)

## Axes

A cental part of a histogram are the axes: They difine the binning and other treats of the axis.

A Hist (*this refers by default to hist.Hist, but usually also applies for bh.Histogram as the former inherits
from the latter*) can have multiple axes of different types.

All axes are described [here](https://hist.readthedocs.io/en/latest/user-guide/axes.html#axes).



The most important types are


### Regular

This is an axis with lower, upper limits, **regularly** split into n bins.

```
axis_reg = hist.axis.Regular(nbins, lower, upper, name=name)
```

### Variable

A variable axis allows to set the bin edges arbitrarily using an array-like object.mro
```
axis_var = hist.axis.Variable([0, 0.5, 3.1, 3.4], name="eta")
```

## Axis Name

An axis (in hist, in bh only a label is possible) have a name, which can be used as the identifier
when working with the histogram (instead of using plain integer indexes).

In [None]:
start, stop = data_df['Jpsi_M'].min(), data_df['Jpsi_M'].max()
axis1 = hist.axis.Regular(bins=50, start=start, stop=stop, name="mass")

To create a histogram, we can pass one or multiple axes to a histogram

In [None]:
data_h = hist.Hist(axis1)

In [None]:
data_h.fill(data_df['Jpsi_M'])

In [None]:
mc_h = hist.Hist(axis1).fill(mc_df['Jpsi_M'])  # we can also chain the commands

### Compatibility with mplhep

With bh and hist, the [Unified Histogram Interface](https://github.com/scikit-hep/uhi) was also born. This allows objects to be plotted so that a library such as mplhep knows what to do with it.

In short, mplhep and hist work seemless together:

In [None]:
mplhep.histplot(data_h)

### Plotting with hist


hist itself provides also plotting functionality

In [None]:
data_h.plot1d()

In [None]:
data_h.plot1d()
mc_h.plot1d()

In [None]:
mc_df.columns

## Multiple dimensions

Histograms can be multiple dimensional. Let's add a dimension to it.

In [None]:
start, stop = data_df['BDT'].min(), data_df['BDT'].max()
axis_bdt = hist.axis.Regular(bins=20, start=start, stop=stop, name="BDT")

In [None]:
mc_h2d = hist.Hist(axis1, axis_bdt).fill(BDT=mc_df['BDT'], mass=mc_df['Jpsi_M']) # using names

In [None]:
data_h2d = hist.Hist(axis1, axis_bdt)
data_h2d.fill(data_df['Jpsi_M'], data_df['BDT']) # order based

In [None]:
mplhep.hist2dplot(data_h2d)

## Access Bins

hist allows you to access the bins of your Hist by various ways. Besides the normal access by index, you can use locations (supported by boost-histogram), complex numbers, and the dictionary to access the bins.

In [None]:
# Access by bin number
data_h2d[35, 5]

## Getting Density

If you want to get the density of an existing histogram, .density() is capable to do it and will return you the density array without overflow and underflow bins.

A histogram is a count, so it's an **integral over a density**. To obtain the density, one can devide by the area of the bin, this gives the "average density" in a bin.

In [None]:
data_h2d.density()

## Projecting axes

We can also project onto a certain axis

In [None]:
data_h2d.project("mass")  # we will here retain the 1D histogram

## Accessing everything relevant

Hist is transparent and let's us use many things

In [None]:
data_h2d.axes

In [None]:
data_h2d.axes['mass']

In [None]:
data_h2d.axes['mass'].edges

In [None]:
data_h2d.axes['mass'].centers  # bin centers

In [None]:
data_h2d.axes['mass'].widths  # bin widths

### Multi dimensional

All this attributes are also already available in `edges`, they are ready to be broadcasted. So they have the shape of (1, ..., N, ..., 1).

In [None]:
data_h2d.axes.edges
data_h2d.axes['mass'].centers
data_h2d.axes['mass'].widths
areas = np.prod(data_h2d.axes.widths, axis=0)
print(f"areas = {areas}")

**Exercise**: can you obtain the density?

use `hist.values` or `hist.views()`  (the latter makes no copy, the former does).

## Arithmetics

We can use the histograms to do math! We can multiply, add with each other or with scalars.

We can find the ratio between two histograms by dividing them

In [None]:
data_df_bdt = data_df.query("BDT > 0.9")

data_bdt_h2d = hist.Hist(axis1, axis_bdt)
data_bdt_h2d.fill(data_df_bdt['Jpsi_M'], data_df_bdt['BDT']) # order based

In [None]:
ratio = data_bdt_h2d.project("mass") / data_h2d.project("mass")

In [None]:
ratio.plot1d()

In [None]:
ratio_large = ratio * 10
ratio_large.plot1d()

**Exercise**: use the subtraction to "remove" the signal from the data file using the BDT cut hist. This should be the same as using only "BDT<0.9".

## Weights

Weights are an essential part in HEP histograms and hist fully supports weigths. We can simply give an array of weights when filling the histogram.

We first need to specify the storage type to be of type `Weight` in order to make sure we keep track of the weigths.

In [None]:
weight = np.random.normal(1., 0.1, size=mc_df.shape[0])
storage = hist.storage.Weight()
mc_h2d = hist.Hist(axis1, axis_bdt, storage=storage).fill(BDT=mc_df['BDT'], mass=mc_df['Jpsi_M'], weight=weight) # using names

In [None]:
mc_h2d

In [None]:
mc_h2d.variances()

**Exercise**: implement a function that calculates a weighted chi2 using two histograms.