# Part 2: boost-histogram plans and Hist

Run the code with us through Binder, altering examples and asking "what if" questions along the way :)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/henryiii/histogram-tutorial/master?filepath=talk_2_bhn_hist.ipynb)

## Boost-histogram numbafication plans

An "event loop" is a programming construct that waits for and dispatches events or messages in a program.  To speed up data analysis for HEP, Scikit-HEP ecosystem is long for a fully Numba-enabled event loop. The awkward and vector portions are mostly developed, leaving the histogramming step as the one element missing from a fully Numba enabled event loop. We plan to enable BH fill from inside the Numba loop without stepping through Python. Currently, we can achieve ([here](https://vector.readthedocs.io/en/latest/usage/structure.html)):

```python
@nb.njit
def compute_masses(awkarray):
    out = np.empty(len(awkarray), np.float64)
    for i, event in enumerate(awkarray):
        total = vector.obj(px=0.0, py=0.0, pz=0.0, E=0.0)
        for vec in event:
            total = total + vec
        out[i] = total.mass
    return out

out = compute_masses(awkarray)
hist.fill(out)
```

Our goal to make this work:

```python
@nb.njit
def compute_masses(hist, awkarray):
    for event in array:
        total = vector.obj(px=0.0, py=0.0, pz=0.0, E=0.0)
        for vec in event:
            total = total + vec
        hist.fill(total.mass)

compute_masses(hist, awkarray)
```

## Hist

Hist extends boost-histogram, including features like:

* Named axes and labels
    * You can even *force* the use of names everywhere with `NamedHist`
* Fancy Jupyter reprs
* UHI+: faster, easier to type indexing additions
* QuickConstruct: a system to reduce the typing when making histograms
* Stack of Histograms, including from categorical axes
* Pie plots
* Loading tables from Pandas
* Compute profiles from existing histograms
* Sorting categorical axes

### Name shortcuts in Hist

Hist allows names for Boost-histograms axes, the names are unique identifiers in a histogram which are used to support some useful features such as `.fill()` and `.project()`. Specially, hist designs `NamedHist` to provide name shortcuts for the histograms that contain named axes.

In [None]:
import numpy as np
import boost_histogram as bh
import matplotlib.pyplot as plt
import pandas as pd


from hist import axis, Hist, Stack

# named axes
reg_axis = axis.Regular(10, -3, 3, overflow=False, underflow=False, name="X", label="x [unit]")
var_axis = axis.Variable(range(-5, 6), name="Y", label="y [unit]")
int_axis = axis.Integer(-3, 3, overflow=True, underflow=True, name="Z", label="z [units]")

In [None]:
# histograms with named axes
h = Hist(reg_axis, var_axis, int_axis)

print("Name of axis 0: \t" + h.axes[0].name + ";")
print("Label of axis 1: \t" + h.axes[1].label + ".")

In [None]:
# Normal access
h.fill(np.random.randn(100), np.random.randn(100), np.random.randn(100))
h_2d = h.project(0, 1)

# Named access (safer and more readable)
h.fill(X=np.random.randn(100), Y=np.random.randn(100), Z=np.random.randn(100))
h_2d = h.project("X", "Y")

### Hist Repr

Hist has custom reprs when displaying in a Jupyter and supports dark mode.

In [None]:
h_2d.project("X")

In [None]:
h_2d

In [None]:
h

Besides the fancy repr, the users can explictly see the data by `.plot()` (using mplhep in the backend) and `.show()` (using histoprint).

### UHI+

Uniform Histogram Indexing (UHI) is one of the most important features of hist, which provides HEP users with handy accessing shortcuts. For example, to access the centroid element of a 2d-histogram, we can:

In [None]:
# boost-histogram UHI
print(h_2d[5, 5])
print(h_2d[{0: 5, 1: 5}])

# hist UHI+
print(h_2d[{"X": 5, "Y": 5}])
print(h_2d[{"X": bh.loc(0), "Y": bh.loc(0)}])
print(h_2d[.8j, .5j])

UHI also supports rebin for histograms with the same `j` shortcut.

In [None]:
h_2d[:, -.5j]

In [None]:
h_2d[0:10:2j, -.5j]

### Quick Construct Shortcuts

Besides the standard construction of boost-histogram, hist provides quick construct for HEP users.

In [None]:
unnamed_hist = (
    Hist.new.Reg(50, -5, 5, flow=False)
    .Var(range(-25, 30))
    .Int(-3, 3, flow=True)
    .Double()
)

named_hist = (
    Hist.new.Reg(50, -5, 5, flow=False, name="X", label="x [unit]")
    .Var(range(-25, 30), name="Y", label="y [unit]")
    .Int(-3, 3, flow=True, name="Z", label="z [units]")
    .Double()
)

In [None]:
unnamed_hist.fill(np.random.randn(100), 5*np.random.randn(100), np.random.randn(100)).project(0, 1)

In [None]:
named_hist.fill(X=np.random.randn(100), Y=5*np.random.randn(100), Z=np.random.randn(100)).project("X", "Y")

### Hist Stack

In [None]:
named_hist_copy = (
    Hist.new.Reg(50, -5, 5, flow=False, name="X", label="x [unit]")
    .Var(range(-25, 30), name="Y", label="y [unit]")
    .Int(-3, 3, flow=True, name="Z", label="z [units]")
    .Double()
).fill(X=.5*np.random.randn(100)+3*np.ones(100), Y=5*np.random.randn(100), Z=np.random.randn(100))

s = Stack(named_hist.project(0), named_hist_copy.project("X"))
s.plot()
plt.show()

In [None]:
s.plot(stack=True, histtype='fill')
plt.show()

In [None]:
# Fill with data tagged with quality="good" or "bad"
h = Hist.new.Reg(50,-5, 5, name="x").StrCat(["good", "bad"], name="quality").Double().fill(
    x=np.random.randn(100), quality=["good", "good", "good", "good", "bad"]*20
)

# Turn an existin axis into a stack
s = h.stack("quality")

s.plot(color=["indianred", "steelblue"])
plt.legend()
plt.show()

In [None]:
s.plot(stack=True, histtype='fill')
plt.legend()
plt.show()

In [None]:
s[::-1].plot(stack=True, histtype='fill')
plt.legend()
plt.show()

In [None]:
print(s[0].name)
s[0]

In [None]:
print(s[1].name)
s[1]

### Pandas support in Hist

You can read dicts or Pandas dataframes directly into Hist. The following dataset is of PyPI downloads.

In [None]:
data = pd.read_csv(
    "results-20210227-133657 - results-20210227-133657.csv",
    usecols=("cpu", "num_downloads", "python_version", "pip_version", "glibc_version", "policy"),
    converters={
        "python_version": str,
        "pip_version": lambda x: int(x.split(".")[0]),
        "glibc_version": lambda x: int(float(x.split("-")[0]) % 1 * 100),
    },
)

pd_hist_4d = Hist.from_columns(
    data,
    ("cpu", "python_version", "pip_version", "policy"),
    weight="num_downloads",
)

Now we use Hist's `plot_pie`:

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(9, 6))
for i, py in enumerate(["2.6", "2.7", "3.6", "3.7", "3.8", "3.9"]):
    ax = axs.flatten()[i]
    ph = pd_hist_4d.project("python_version", "pip_version")[py, :]
    ph.plot_pie(ax=ax, normalize=True, autopct='%1.0f%%', pctdistance=.8)
    ax.set_title(f"Python {py} {int(ph.sum()) // 1000000:,} M")

plt.tight_layout()
plt.show()

### Other Shortcuts

We can get the density of an existing histogram via `.density()`.

In [None]:
named_hist = (
    Hist.new.Reg(50, -3, 3, flow=False, name="X", label="x [unit]")
    .Var(range(-25, 30), name="Y", label="y [unit]")
    .Int(-3, 3, flow=True, name="Z", label="z [units]")
    .Double()
)
named_hist.fill(X=np.random.randn(100), Y=5*np.random.randn(100), Z=np.random.randn(100)).project("X", "Y")
named_hist.project("X")[25:30].density()

In [None]:
xy = np.array([[-2, 1.5], [-2, 1.5], [0.0, -2.0], [0.0, -2.0], [0.0, 0.0], [0.0, 1.0], [1.0, 0.0]])
h = Hist(axis.Regular(5, -5, 5, name="x"), axis.Regular(5, -5, 5, name="y")).fill(*xy.T)
h_profile = h.profile("y")
h.values()

In [None]:
h_profile.values()