# Intentional omissions

Bokeh-catplot is intentionally limited in scope. Only four types of plot (albeit with allowance for a fair amount of configurability) are allowed. For example, we could allow for two quantitative variable and make scatter plots colored by the categorical variable, two-dimensional ECDFs, hex-bin plots, contour plots, etc. All of these plots would be useful, but this package does not aspire to be nearly as complete as, say, HoloViews.

Nonetheless, there are a few plots that fall into the one quantitative/*n* categorical class of plots. Here, we address why those are *not* included.

## Why no stacked bar graphs?

Stacked bar graphs are useful to displaying relative count data, but unfortunately, their utility is somewhat restricted to that. All four functions in Bokeh-catplot hand arbitrary scalar-valued quantitative data (including negative values), and allow for arbitrary many measurements per category. A stacked bar graph either requires one non-negative quantitative value per category or requires a count operation on the data points, which has a very specific, possibly ambiguous meaning. So, a stacked bar graph would necessitate restrictions on allowed data types beyond those allowed by the other four kinds of plots.

Beyond that, there are often better choices than stacked bar. To demonstrate, consider making a stacked bar plot of the counts of cars with each number of cylinders from each region of origin.

In [1]:
import bokeh_catplot
import bokeh.sampledata.autompg

import colorcet

import bokeh.io
bokeh.io.output_notebook()

df = bokeh.sampledata.autompg.autompg_clean

count_df = (
    df.groupby(["origin"])["cyl"]
    .value_counts()
    .unstack()
    .reset_index()
    .fillna(0)
)
count_df.columns = count_df.columns.astype(str)
stackers = ["3", "4", "5", "6", "8"]

p = bokeh.plotting.figure(
    frame_width=500,
    frame_height=250,
    y_range=["North America", "Europe", "Asia"],
    x_axis_label="count",
)
p.x_range.start = 0
p.hbar_stack(
    stackers=stackers,
    height=0.5,
    y="origin",
    color=colorcet.b_glasbey_category10[:5],
    source=count_df,
    legend_label=stackers,
)

p.ygrid.grid_line_alpha = 0
p.legend.title = "cylinders"

bokeh.io.show(p)

To get the actual count of each category (number of cylinders) in the stacks, you need to assess the difference from the top to bottom. Compare that with a strip plot containing the same information.

In [2]:
p = bokeh_catplot.strip(
    data=count_df.melt(id_vars='origin', value_name='count'),
    val='count',
    cats='origin',
    color_column='cyl',
    frame_width=500,
    show_legend=True,
    marker_kwargs=dict(size=10)
)

p.legend.title = 'cylinders'

bokeh.io.show(p)

In this case, we can immediately read off the number of cars with the respective number of cylinders.

## Why no bar graphs?

I strongly prefer strip plots (with jitter) to box plots and ECDFs for histograms. Why? Because in the strip plots and ECDFs, you are **plotting all of your data**. In practice, there are the only two types of visualizations for data with a categorical axis I use (though I'll sometimes overlay a jitter on a box plot to show some of the summary statistics).

A bar graph is the antithesis of plotting all of your data. You distill all of the information in the data set down to one or two summary statistics, and then use giant glyphs to show them. You should plot all of your data, so you shouldn't make bar graphs. Bokeh-catplot will not help you practice bad plotting.

So why does bokeh-catplot have box-and-whisker plots? One may argue that it is nonetheless valuable to plot summary statistics, which is what box plots do. In that case, at least five summary statistics are plotted (the ends of the whiskers, the ends of the box, and the median). While this is still not plotting all of the data, it is still better than a dynamite graph (bar graph with error bars), which shows at most three summary statistics (height of bar, and lower and upper bound of confidence interval). But still, why does bokeh-catplot enable box plots, but not bar graphs?

The answer is that there are many ways to specify the summary statistic used in bar graphs. We could choose the height of the bar to be the mean of the data and the error bars to have a length given by the standard error of the mean. We could have the height of the bar be the median and the error bars be a possibly asymmetric confidence interval obtained by bootstrapping the median. And there are many more possibilities.

Conversely, if we stick to the widely-used (almost universally used, as far as I can tell) Tukey specification of a box-and-whisker plot, we are only plotting *percentiles* of the data. These assume no underlying statistical model, so the plots are unambiguous.