# Histogrammar basic tutorial

Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. (There is also a scala backend for Histogrammar.) 

This basic tutorial shows how to:
- make histograms with numpy arrays and pandas dataframes, 
- plot them, 
- make multi-dimensional histograms,
- the various histogram types,
- to make many histograms at ones,
- and store and retrieve them. 

Enjoy!

In [None]:
%%capture
# install histogrammar (if not installed yet)
import sys

!"{sys.executable}" -m pip install histogrammar

In [None]:
import histogrammar as hg

In [None]:
import pandas as pd
import numpy as np
import matplotlib

## Data generation
Let's first load some data!

In [None]:
# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])

In [None]:
df.head()

## Let's fill a histogram!

Histogrammar treats histograms as objects. You will see this has various advantages.

Let's fill a simple histogram with a numpy array.

In [None]:
# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]
hist1 = hg.Bin(num=100, low=-5, high=5)

In [None]:
# filling it with one data point:
hist1.fill(0.5)

In [None]:
print (hist1.entries)

In [None]:
# filling the histogram with an array:
hist1.fill.numpy(np.random.normal(size=10000))

In [None]:
print (hist1.entries)

In [None]:
# let's plot it
hist1.plot.matplotlib()

In [None]:
# Alternatively, you can call this to make the same histogram:
# hist1 = hg.Histogram(num=100, low=-5, high=5)

Histogrammar also supports "sparse" historgrams, which are open-ended. Bins in a sparse histogram only get created and filled if the corresponding data points are encountered. 

A sparse histogram has a bin-width, and optionally a bin-origin parameter. Sparse histograms are nice if you don't want to restrict the range, for example for tracking data distributions over time, which may have large, sudden outliers.

In [None]:
hist2 = hg.SparselyBin(binWidth=10)

In [None]:
hist2.fill.numpy(df['age'].values)

In [None]:
hist2.plot.matplotlib()

In [None]:
# Alternatively, you can call this to make the same histogram:
# hist2 = hg.SparselyHistogram(binWidth=10)

## Filling from a dataframe

When importing histogrammar, pandas (and spark) dataframes get extra functions to create histograms that all start with "hg_". For example: hg_Bin or hg_SparselyBin.

Let's make the same 1d (sparse) histogram directly from a (pandas) dataframe.

In [None]:
hist3 = df.hg_SparselyBin(binWidth=10, origin=0, quantity='age')
hist3.plot.matplotlib()

Note that the column "age" is picked by setting quantity="age", and also that the filling step is done automatically.

In [None]:
# Alternatively, do:
hist3 = hg.SparselyBin(binWidth=10, quantity='age')
hist3.fill.numpy(df)
# ... where hist3 automatically picks up column age from the dataframe, 
# ... but needs to be filled by calling fill.numpy() explicitly.

### handy functions

For any 1-dimensional histogram extract the bin entries, edges and centers as follows:

In [None]:
# full range of bin entries, and those in a specified range:
print(hist3.bin_entries(), hist3.bin_entries(low=30, high=80))

In [None]:
# full range of bin edges, and those in a specified range:
print (hist3.bin_edges(), hist3.bin_edges(low=31, high=71))

In [None]:
# full range of bin centers, and those in a specified range:
print (hist3.bin_centers(), hist3.bin_centers(low=31, high=80))

In [None]:
hsum = hist2 + hist3
print (hsum.entries)

In [None]:
hsum *= 4
print (hsum.entries)

There are also: 
- IrregularlyBin histograms, with irregular bin edges, and 
- CentrallyBin histograms, where no bin edges are given but bin centers, and which is open-ended on both sides.

In [None]:
hist4 = df.hg_CentrallyBin(centers=[15, 25, 35, 45, 55, 65, 75, 85, 95], quantity='age')
hist4.plot.matplotlib()

Note the slightly different plotting style for CentrallyBin histograms.

## Multi-dimensional histograms

Let's make a multi-dimensional histogram. In Histogrammar, a multi-dimensional histogram is composed as two recursive histograms. 

We will use histograms with irregular binning in this example.

In [None]:
edges1 = [-100, -75, -50, -25, 0, 25, 50, 75, 100]
edges2 = [-200, -150, -100, -50, 0, 50, 100, 150, 200]

In [None]:
hist1 = hg.IrregularlyBin(edges=edges1, quantity='latitude')
hist2 = hg.IrregularlyBin(edges=edges2, quantity='longitude', value=hist1)

# for 3 dimensions or higher simply add the 2-dim histogram to the value argument
hist3 = hg.SparselyBin(binWidth=10, quantity='age', value=hist2)

In [None]:
hist2.fill.numpy(df)
hist2.plot.matplotlib()

In [None]:
# number of dimensions per histogram
print (hist1.n_dim, hist2.n_dim, hist3.n_dim)

## Histogram types

So far we have covered the histogram types: 
- Bin histograms: with a fixed range and even-sized bins,
- SparselyBin histograms: open-ended and with a fixed bin-width,
- IrregularlyBin histograms: using irregular bin edges,
- CentrallyBin histograms: open-ended and using bin centers.

All of these process numeric variables only.

### Categorical variables

For categorical variables use the Categorize histogram
- Categorize histograms: accepting categorical variables such as strings and booleans.



In [None]:
histy = hg.Categorize('eyeColor')
histx = hg.Categorize('favoriteFruit', value=histy)

In [None]:
histx.fill.numpy(df)
histx.plot.matplotlib()

In [None]:
# show the datatypy(s) of the histogram
print (histx.datatype)

Categorize histograms also accept booleans:

In [None]:
histy = df.hg_Categorize('isActive')
histy.plot.matplotlib()

In [None]:
print (histy.bin_entries())

In [None]:
print (histy.bin_labels())
# histy.bin_centers() will work as well for Categorize histograms

### Other histogram types

There are several more histogram types:
- Minimize, Maximize: keep track of the min or max value of a numeric distribution,
- Average, Deviate: keep track of the mean or mean and standard deviation of a numeric distribution,
- Sum: keep track of the sum of a numeric distribution,
- Stack: keep track how many data points pass certain thresholds.
- Bag: works like a dict, it keeps tracks of all unique values encounterd in a column, and can also do this for vector s of numbers. For strings, Bag works just like the Categorize histogram.

In [None]:
hmin = df.hg_Minimize('latitude')
hmax = df.hg_Maximize('longitude')
print (hmin.min, hmax.max)

In [None]:
havg = df.hg_Average('latitude')
hdev = df.hg_Deviate('longitude')
print (havg.mean, hdev.mean, hdev.variance)

In [None]:
hsum = df.hg_Sum('age')
print (hsum.sum)

In [None]:
# let's illustrate the Stack histogram with longitude distribution
# first we plot the regular distribution
hl = df.hg_SparselyBin(25, 'longitude')
hl.plot.matplotlib()

In [None]:
# Stack counts how often data points are greater or equal to the provided thresholds 
thresholds = [-200, -150, -100, -50, 0, 50, 100, 150, 200]

In [None]:
hs = df.hg_Stack(thresholds=thresholds, quantity='longitude')
print (hs.thresholds)
print (hs.bin_entries())

Stack histograms are useful to make efficiency curves.

With all these histograms you can make multi-dimensional histograms. For example, you can evaluate the mean and standard deviation of one feature as a function of bins of another feature. (A "profile" plot, similar to a box plot.) 

In [None]:
hav = hg.Deviate('age')
hlo = hg.SparselyBin(25, 'longitude', value=hav)
hlo.fill.numpy(df)

In [None]:
hlo.bins

In [None]:
hlo.plot.matplotlib()

### Convenience functions

There are several convenience functions to make such composed histograms. These are:
- Profile: Convenience function for creating binwise averages.
- SparselyProfile: Convenience function for creating sparsely binned binwise averages.
- ProfileErr: Convenience function for creating binwise averages and variances.
- SparselyProfile: Convenience function for creating sparsely binned binwise averages and variances.
- TwoDimensionallyHistogram: Convenience function for creating a conventional, two-dimensional histogram.
- TwoDimensionallySparselyHistogram: Convenience function for creating a sparsely binned, two-dimensional histogram.

In [None]:
# For example, call this convience function to make the same histogram as above:
hlo = df.hg_SparselyProfileErr(25, 'longitude', 'age')
hlo.plot.matplotlib()

### Summary of histograms

Here you can find the list of all available histograms and aggregators and how to use each one: 

https://histogrammar.github.io/histogrammar-docs/specification/1.0/

The most useful aggregators are the following. Tinker with them to get familiar; building up an analysis is easier when you know "there's an app for that."

**Simple counters:**

  * [`Count`](../../specification/#count-sum-of-weights): just counts. Every aggregator has an `entries` field, but `Count` _only_ has this field.
  * [`Average`](../../specification/#average-mean-of-a-quantity) and [`Deviate`](../../specification/#deviate-mean-and-variance): add mean and variance, cumulatively.
  * [`Minimize`](../../specification/#minimize-minimum-value) and [`Maximize`](../../specification/#maximize-maximum-value): lowest and highest value seen.

**Histogram-like objects:**

  * [`Bin`](../../specification/#bin-regular-binning-for-histograms) and [`SparselyBin`](../../specification/#sparselybin-ignore-zeros): split a numerical domain into uniform bins and redirect aggregation into those bins.
  * [`Categorize`](../../specification/#categorize-string-valued-bins-bar-charts): split a string-valued domain by unique values; good for making bar charts (which are histograms with a string-valued axis).
  * [`CentrallyBin`](#centrallybin-fully-partitioning-with-centers) and [`IrregularlyBin`](../../specification/#irregularlybin-fully-partitioning-with-edges): split a numerical domain into arbitrary subintervals, usually for separate plots like particle pseudorapidity or collision centrality.

**Collections:**

  * [`Label`](../../specification/#label-directory-with-string-based-keys), [`UntypedLabel`](../../specification/#untypedlabel-directory-of-different-types), and [`Index`](../../specification/#index-list-with-integer-keys): bundle objects with string-based keys (`Label` and `UntypedLabel`) or simply an ordered array (effectively, integer-based keys) consisting of a single type (`Label` and `Index`) or any types (`UntypedLabel`).
  * [`Branch`](../../specification/#branch-tuple-of-different-types): for the fourth case, an ordered array of any types. A `Branch` is useful as a "cable splitter". For instance, to make a histogram that tracks minimum and maximum value, do this:





## Making many histograms at once

There a nice method to make many histograms in one go. See here.

By default automagical binning is applied to make the histograms.

More details one how to use this function are found in in the advanced tutorial.

In [None]:
hists = df.hg_make_histograms()

In [None]:
print (hists.keys())

In [None]:
h = hists['transaction']
h.plot.matplotlib()

In [None]:
h = hists['date']
h.plot.matplotlib()

In [None]:
# you can also select which and make multi-dimensional histograms
hists = df.hg_make_histograms(features = ['longitude:age'])

In [None]:
hist = hists['longitude:age']
hist.plot.matplotlib()

## Storage

Histograms can be easily stored and retrieved in/from the json format.

In [None]:
# storage
hist.toJsonFile('long_age.json')

In [None]:
# retrieval
factory = hg.Factory()
hist2 = factory.fromJsonFile('long_age.json')
hist2.plot.matplotlib()

In [None]:
# to store many histograms at once:

In [None]:
%%script false --no-raise-error

# we can store the histograms if we want to
import json
from histogrammar.util import dumper

# store
with open('histograms.json', 'w') as outfile:
	json.dump(hists, outfile, default=dumper)

# and load again
with open('histograms.json') as handle:
	hists2 = json.load(handle)

In [None]:
print(hists.keys())

## Advanced tutorial

The advanced tutotial shows:
- How to work with spark dataframes.
- More details on this nice method to make many histograms in one go. For example how to set bin specifications.
