Skip to content
Utilities to make Bokeh plots involving at least one categorical variable.
Jupyter Notebook Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bokeh_catplot Added horizontal options for ECDF. This is v. 0.1.6 release. Nov 11, 2019
images Images for README Jun 15, 2019
notebooks Changed the API for ECDF types, now using style kwarg. Jul 17, 2019
.gitignore Initial commit Jun 13, 2019
LICENSE Initial commit Jun 13, 2019
MANIFEST.in Setup files Jun 15, 2019
README.md
requirements.txt Added colorcet requirement. Oct 14, 2019
setup.cfg Setup files Jun 15, 2019
setup.py Added horizontal options for ECDF. This is v. 0.1.6 release. Nov 11, 2019

README.md

Bokeh-catplot

A utility to use Bokeh to generate plots where one of more variables is categorical. It generates box plots, jitter plots, histograms, and ECDFs.

Motivation

Bokeh is a powerful Python-based plotting library. HoloViews is an excellent high-level plotting package that can use Bokeh to render plots. I suspect that most of the functionality in this module will eventually be incorporated into HoloViews. This module exists because HoloViews lacks some important functionality.

There are all relatively minor fixes for HoloViews, which will likely have this functionality in the near future.

Nonetheless, ECDFs and horizontal strip plots are important visualizations, and this module provides a high-level way to render them using Bokeh.

Installation

You can install bokeh-catplot using pip. You will need to have a recent version of Bokeh and all of its dependencies installed.

pip install bokeh_catplot

Usage

Bokeh-catplot is imported as bokeh_catplot. While I'm at it, I'll import some sample data.

import bokeh_catplot

from bokeh.sampledata import autompg

Sample data

To demonstrate usage, I will use the automobile fuel efficiency sample data set.

df = autompg.autompg_clean

df.head()
mpg cyl displ hp weight accel yr origin name mfr
0 18.0 8 307.0 130 3504 12.0 70 North America chevrolet chevelle malibu chevrolet
1 15.0 8 350.0 165 3693 11.5 70 North America buick skylark 320 buick
2 18.0 8 318.0 150 3436 11.0 70 North America plymouth satellite plymouth
3 16.0 8 304.0 150 3433 12.0 70 North America amc rebel sst amc
4 17.0 8 302.0 140 3449 10.5 70 North America ford torino ford

Importantly, this data set is tidy; each row represents a single observation and each column a variable associated with an observation. Bokeh-catplot assumed that any inputted data frame is in tidy format. In the fuel efficiency example, the columns have different character. For example, 'mpg' contains quantitative measurement of the miles per gallon of each car. The 'origin' is categorical in the sense that it is not quantitative, but is a descriptor of the automobile that takes on a few discrete values.

Functionality

Bokeh-catplot generates plots from tidy data frames where one or more columns contains categorical data and the column of interest in the plot is quantitative.

There are four types of plots that Bokeh-catplot can generate.

  • Plots with a categorical axis

    • Box plots: bokeh_catplot.box()
    • Strip plots: bokeh_catplot.strip()
  • Plots without a categorical axis

    • Histograms: bokeh_catplot.histogram()
    • ECDFs: bokeh_catplot.ecdf()

The first three arguments of each of these functions are necessary to build the plot. They are:

  • data: A tidy data frame
  • cats: A list of columns in the data frame that are to be considered as categorical variables in the plot. If None, a single box, strip, histogram, or ECDF is plotted.
  • val: The column of the data frame to be treated as the quantitative variable.

Here are the four default plots for cats = 'origin' and val = 'mpg'.

p_box = bokeh_catplot.box(df, 'origin', 'mpg', title='box')
p_strip = bokeh_catplot.strip(df, 'origin', 'mpg', title='strip')
p_histogram = bokeh_catplot.histogram(df, 'origin', 'mpg', title='histogram')
p_ecdf = bokeh_catplot.ecdf(df, 'origin', 'mpg', title='ecdf')

bokeh.io.show(
    bokeh.layouts.column([p_box, p_strip, p_histogram, p_ecdf])
)

Additional kwargs

Each of the four functions also has the following additional optional keyword arguments.

  • palette: A list of hex colors to use for coloring the markers for each category. By default, it uses the default color scheme of Vega-Lite.
  • order: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.
  • p: If specified, the bokeh.plotting.Figure object to use for the plot. If not specified, a new figure is created.

The respective plots also have kwargs that are specific to them. Examples highlighting some, but not all, customizations follow.

Any kwargs not in the function call signature are passed to bokeh.plotting.figure() when the figure is instantiated.

Box plots

We can also have horizontal box plots.

p = bokeh_catplot.box(
    data=df,
    cats='origin',
    val='mpg',
    horizontal=True,
    order=['Asia', 'Europe', 'North America']
)

bokeh.io.show(p)

We can independently specify properties of the marks using box_kwargs, whisker_kwargs, median_kwargs, and outlier_kwargs. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs.

p = bokeh_catplot.box(
    data=df,
    cats='origin',
    val='mpg',
    whisker_caps=True,
    outlier_marker='diamond',
    box_kwargs=dict(fill_color='#7C0000'),
    whisker_kwargs=dict(line_color='#7C0000', line_width=2),
)

bokeh.io.show(p)

We can have multiple categories and specify a custom palette.

bkp = bokeh.palettes.d3['Category20c'][20]
palette = bkp[:3] + bkp[4:7] + bkp[8:11]

p = bokeh_catplot.box(
    data=df,
    cats=['origin', 'cyl'],
    val='mpg',
    palette=palette,
    horizontal=True,
    y_axis_label='# of cylinders',
)

p.yaxis.axis_label_text_font_style = 'bold'

bokeh.io.show(p)

Strip plots

We can make a strip plot with dash markers and add some transparency.

p = bokeh_catplot.strip(
    data=df,
    cats='origin',
    val='mpg',
    marker='dash',
    marker_kwargs=dict(alpha=0.3)
)

bokeh.io.show(p)

The problem with strip plots is that they can have trouble with overlapping data points. A common approach to deal with this is to "jitter," or place the glyphs with small random displacements along the categorical axis. I do that here, allowing for hover tools that give more information about the respective data points.

p = bokeh_catplot.strip(
    data=df,
    cats='origin',
    val='mpg',
    jitter=True,
    marker_kwargs=dict(alpha=0.5),
    tooltips=[('year', '@yr'), ('model', '@name')]
)

bokeh.io.show(p)

Strip-box plots

Even while plotting all of the data, we sometimes want to graphically display summary statistics, in which case overlaying a box plot and a jitter plot is useful. To populate an existing Bokeh figure with new glyphs from another catplot, pass in the p kwarg. You should be careful, though, because you need to make sure the cats, val, and horizontal arguments exactly match.

# Make a box plot
p = bokeh_catplot.box(
    data=df,
    cats='origin',
    val='mpg',
    horizontal=True,
    box_kwargs=dict(fill_color='gray', fill_alpha=0.5),
    display_points=False
)

# Overlay a jitter plot
p = bokeh_catplot.strip(
    data=df,
    cats='origin',
    val='mpg',
    p=p,
    horizontal=True,
    jitter=True,
    marker_kwargs=dict(alpha=0.5),
    tooltips=[('year', '@yr'), ('model', '@name')]
)

bokeh.io.show(p)

Histograms

We could plot normalized histograms using the density kwarg, and we'll make the plot a little wider to support the legend.

p = bokeh_catplot.histogram(
    data=df,
    cats='origin',
    val='mpg',
    density=True,
    width=550
)

bokeh.io.show(p)

ECDFs

An empirical cumulative distribution function, or ECDF, is a convenient way to visualize a univariate probability distribution. Consider a measurement x in a set of measurements X. The ECDF evaluated at x is defined as

ECDF(x) = fraction of data points in X that are ≤ x.

By default, ECDFs are colored by category. We could wish to display formal ECDFs (staircases).

p = bokeh_catplot.ecdf(
    data=df,
    cats='origin',
    val='mpg',
    formal=True,
)

bokeh.io.show(p)

We can also display empirical complementary cumulative distribution functions (ECCDFs) using the complementary kwarg.

ECCDF(x) = 1 - ECDF(x)

p = bokeh_catplot.ecdf(
    data=df,
    cats='origin',
    val='mpg',
    complementary=True
)

bokeh.io.show(p)

Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored' kwarg.

p = bokeh_catplot.ecdf(
    data=df,
    cats='origin',
    val='mpg',
    kind='colored'
)

bokeh.io.show(p)

We can also display confidence intervals for the ECDFs acquired by bootstrapping using the conf_int kwarg.

p = bokeh_catplot.ecdf(
    data=df,
    cats='origin',
    val='mpg',
    formal=True,
    conf_int=True,
)

bokeh.io.show(p)

Why no stacked bar graphs?

Stacked bar graphs are useful to displaying relative count data, but unfortunately, their utility is somewhat restricted to that. All four functions in Bokeh-catplot hand arbitrary scalar-valued quantitative data (including negative values), and including stacked bar capabilities would introduce a restriction on what kinds of data we could effectively handle.

Beyond that, there are better choices than stacked bar. To demonstrate, consider making a stacked bar plot of the counts of cars with each number of cylinders from each region of origin.

count_df = df.groupby(['origin'])['cyl'].value_counts().unstack().reset_index().fillna(0)
count_df.columns = count_df.columns.astype(str)
stackers = ['3', '4', '5', '6', '8']

p = bokeh.plotting.figure(
    width=500, 
    height=300,
    x_range=['Asia', 'Europe', 'North America']
)
p.y_range.start = 0
p.vbar_stack(
    stackers=stackers, 
    width=0.5, 
    x='origin', 
    color=["#4e79a7", "#f28e2b", "#e15759", "#76b7b2", "#59a14f"], 
    source=count_df,
    legend=[bokeh.core.properties.value(s) for s in stackers]
)
p.legend.orientation = 'horizontal'
p.legend.location = 'top_left'

bokeh.io.show(p)

To get the actual count of each category (number of cylinders) in the stacks, you need to assess the difference from the top to bottom. Compare that with a strip plot containing the same information.

p = bokeh_catplot.strip(
    data=count_df.melt(id_vars='origin', value_name='count'),
    cats='origin',
    val='count',
    color_column='cyl',
    width=500,
    show_legend=True,
    marker_kwargs=dict(size=10)
)
p.legend[0].orientation = 'horizontal'
p.legend[0].location = 'top_left'

bokeh.io.show(p)

In this case, we can immediately read off the number of cars with the respective number of cylinders. Because, in my opinion, a strip plot is better than a stacked bar for almost all applications, we do not include stacked bar plots.

Why no bar graphs?

I strongly prefer strip plots (with jitter) to box plots and ECDFs for histograms. Why? Because in the jitter plots and ECDFs, you are plotting all of your data. In practice, there are the only two types of visualizations for data with a categorical axis I use (though I'll sometimes overlay a jitter on a box plot to show some of the summary statistics).

A bar graph is the antithesis of plotting all of your data. You distill all of the information in the data set down to one or two summary statistics, and then use giant glyphs to show them. You should plot all of your data, so you shouldn't make bar graphs. Bokeh-catplot will help you do this.

You can’t perform that action at this time.