# Histogrammar advanced tutorial

Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. (There is also a scala backend for Histogrammar.) 

This advanced tutorial shows how to:
- work with spark dataframes, 
- make many histograms at ones, which is one of the nice features of histogrammar, and how to configure that. For example how to set bin specifications, or how to deal with a time-axis.


Enjoy!

In [None]:
%%capture
# install histogrammar (if not installed yet)
import sys

!"{sys.executable}" -m pip install histogrammar

In [None]:
import histogrammar as hg

In [None]:
import pandas as pd
import numpy as np
import matplotlib

## Data generation
Let's first load some data!

In [None]:
# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])

In [None]:
df.head()

## What about Spark DataFrames?

No problem! We can easily perform the same steps on a Spark DataFrame. One important thing to note there is that we need to include a jar file when we create our Spark session. This is used by spark to create the histograms using Histogrammar. The jar file will be automatically downloaded the first time you run this command.

In [None]:
# download histogrammar jar files if not already installed, used for histogramming of spark dataframe
try:
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col
    from pyspark import __version__ as pyspark_version
    pyspark_installed = True
except ImportError:
    print("pyspark needs to be installed for this example")
    pyspark_installed = False

In [None]:
# this is the jar file for spark 3.0
# for spark 2.X, in the jars string, for both jar files change "_2.12" into "_2.11".

if pyspark_installed:
    scala = '2.12' if int(pyspark_version[0]) >= 3 else '2.11'
    hist_jar = f'io.github.histogrammar:histogrammar_{scala}:1.0.20'
    hist_spark_jar = f'io.github.histogrammar:histogrammar-sparksql_{scala}:1.0.20'
        
    spark = SparkSession.builder.config(
        "spark.jars.packages", f'{hist_spark_jar},{hist_jar}'
    ).getOrCreate()

    sdf = spark.createDataFrame(df)

## Filling histograms with spark

Filling histograms with spark dataframes is just as simple as it is with pandas dataframes.

In [None]:
# example: filling from a pandas dataframe
hist = hg.SparselyHistogram(binWidth=100, quantity='transaction')
hist.fill.numpy(df)
hist.plot.matplotlib();

In [None]:
# for spark you will need this spark column function:
if pyspark_installed:
    from pyspark.sql.functions import col

Let's make the same histogram but from a spark dataframe. There are just two differences:
- When declaring a histogram, always set quantity to `col('columns_name')` instead of `'columns_name'`
- When filling the histogram from a dataframe, use the `fill.sparksql()` method instead of `fill.numpy()`.

In [None]:
# example: filling from a pandas dataframe
if pyspark_installed:
    hist = hg.SparselyHistogram(binWidth=100, quantity=col('transaction'))
    hist.fill.sparksql(sdf)
    hist.plot.matplotlib();

Apart from these two differences, all functionality is the same between pandas and spark histograms!

Like pandas, we can also do directly from the dataframe:

In [None]:
if pyspark_installed:
    h2 = sdf.hg_SparselyProfileErr(25, col('longitude'), col('age'))
    h2.plot.matplotlib();

In [None]:
if pyspark_installed:
    h3 = sdf.hg_TwoDimensionallySparselyHistogram(25, col('longitude'), 10, col('latitude'))
    h3.plot.matplotlib();

All examples below also work with spark dataframes.

## Making many histograms at once

Histogrammar has a nice method to make many histograms in one go. See here.

By default automagical binning is applied to make the histograms.

In [None]:
hists = df.hg_make_histograms()

In [None]:
# histogrammar has made histograms of all features, using an automated binning.
hists.keys()

In [None]:
h = hists['transaction']
h.plot.matplotlib();

In [None]:
# you can select which features you want to histogram with features=:
hists = df.hg_make_histograms(features = ['longitude', 'age', 'eyeColor'])

In [None]:
# you can also make multi-dimensional histograms
# here longitude is the first axis of each histogram.
hists = df.hg_make_histograms(features = ['longitude:age', 'longitude:age:eyeColor'])

### Working with timestamps

In [None]:
# Working with a dedicated time axis, make histograms of each feature over time.
hists = df.hg_make_histograms(time_axis="date")

In [None]:
hists.keys()

In [None]:
h2 = hists['date:age']
h2.plot.matplotlib();

Histogrammar does not support pandas' timestamps natively, but converts timestamps into nanoseconds since 1970-1-1.

In [None]:
h2.bin_edges()

The datatype shows the datetime though:

In [None]:
h2.datatype

In [None]:
# convert these back to timestamps with:
pd.Timestamp(h2.bin_edges()[0])

In [None]:
# For the time axis, you can set the binning specifications with time_width and time_offset:
hists = df.hg_make_histograms(time_axis="date", time_width='28d', time_offset='2014-1-4', features=['date:isActive', 'date:age'])

In [None]:
hists['date:isActive'].plot.matplotlib();

### Setting binning specifications

In [None]:
# histogram selections. Here 'date' is the first axis of each histogram.
features=[
    'date', 'latitude', 'longitude', 'age', 'eyeColor', 'favoriteFruit', 'transaction'
]

# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs={
    'longitude': {'binWidth': 10.0, 'origin': 0.0},
    'latitude': {'edges': [-100, -75, -25, 0, 25, 75, 100]},
    'age': {'num': 100, 'low': 0, 'high': 100},
    'transaction': {'centers': [-1000, -500, 0, 500, 1000, 1500]},
    'date': {'binWidth': pd.Timedelta('4w').value, 'origin': pd.Timestamp('2015-1-1').value}
}


# this binning specification is making:
# - a sparse histogram for: longitude
# - an irregular binned histogram for: latitude
# - a closed-range evenly spaced histogram for: age
# - a histogram centered around bin centers for: transaction
hists = df.hg_make_histograms(features=features, bin_specs=bin_specs)

In [None]:
hists.keys()

In [None]:
hists['transaction'].plot.matplotlib();

In [None]:
# all available bin specifications are (just examples):

bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},              # SparselyBin histogram
             'y': {'num': 10, 'low': 0.0, 'high': 2.0},           # Bin histogram
             'x:y': [{}, {'num': 5, 'low': 0.0, 'high': 1.0}],    # SparselyBin vs Bin histograms
             'a': {'edges': [0, 2, 10, 11, 21, 101]},             # IrregularlyBin histogram
             'b': {'centers': [1, 6, 10.5, 16, 20, 100]},         # CentrallyBin histogram
             'c': {'max': True},                                  # Maximize histogram
             'd': {'min': True},                                  # Minimize histogram
             'e': {'sum': True},                                  # Sum histogram
             'z': {'deviate': True},                              # Deviate histogram
             'f': {'average': True},                              # Average histogram
             'a:f': [{'edges': [0, 10, 101]}, {'average': True}], # IrregularlyBin vs Average histograms
             'g': {'thresholds': [0, 2, 10, 11, 21, 101]},        # Stack histogram 
             'h': {'bag': True},                                  # Bag histogram
             }

# to set binning specs for a specific 2d histogram, you can do this:
# if these are not provide, the 1d binning specifications are picked up for 'a:f'
bin_specs = {'a:f': [{'edges': [0, 10, 101]}, {'average': True}]}

In [None]:
# For example 
features = ['latitude:age', 'longitude:age', 'age', 'longitude']

bin_specs = {
    'latitude': {'binWidth': 25},
    'longitude:': {'edges': [-100, -75, -25, 0, 25, 75, 100]},
    'age': {'deviate': True},
    'longitude:age': [{'binWidth': 25}, {'average': True}],
}

hists = df.hg_make_histograms(features=features, bin_specs=bin_specs)

In [None]:
h = hists['latitude:age']
h.bins

In [None]:
hists['longitude:age'].plot.matplotlib();