# Intentional omissions

IQplot is intentionally limited in scope. It is restricted to only data sets with a single quantitative variable. It is further limited in that only four types of plots (albeit with allowance for a fair amount of configurability) are allowed. Nonetheless, there are a few plots that fall into the one quantitative variable class of plots. Here, we address why those are *not* included.

## Why no stacked bar graphs?

Stacked bar graphs are useful to displaying relative count data, but unfortunately, their utility is somewhat restricted to that. All four functions in iqplot hand arbitrary scalar-valued quantitative data (including negative values), and allow for arbitrary many measurements per category. A stacked bar graph either requires one non-negative quantitative value per category or requires a count operation on the data points, which has a very specific, possibly ambiguous meaning. So, a stacked bar graph would necessitate restrictions on allowed data types beyond those allowed by the other four kinds of plots.

Beyond that, there are often better choices than stacked bar. To demonstrate, consider making a stacked bar plot of the counts of cars with each number of cylinders from each region of origin.

In [1]:
import numpy as np

import iqplot
import bokeh.sampledata.autompg

import colorcet

import bokeh.io
bokeh.io.output_notebook()

df = bokeh.sampledata.autompg.autompg_clean

count_df = (
    df.groupby(["origin"])["cyl"]
    .value_counts()
    .unstack()
    .reset_index()
    .fillna(0)
)
count_df.columns = count_df.columns.astype(str)
stackers = ["3", "4", "5", "6", "8"]

p = bokeh.plotting.figure(
    frame_width=500,
    frame_height=250,
    y_range=["North America", "Europe", "Asia"],
    x_axis_label="count",
)
p.x_range.start = 0
p.hbar_stack(
    stackers=stackers,
    height=0.5,
    y="origin",
    color=colorcet.b_glasbey_category10[:5],
    source=count_df,
    legend_label=stackers,
)

p.ygrid.grid_line_alpha = 0
p.legend.title = "cylinders"

bokeh.io.show(p)

To get the actual count of each category (number of cylinders) in the stacks, you need to assess the difference from the top to bottom. Compare that with a strip plot containing the same information.

In [2]:
p = iqplot.strip(
    data=count_df.melt(id_vars="origin", value_name="count"),
    q="count",
    cats="origin",
    color_column="cyl",
    frame_width=500,
    show_legend=True,
    marker_kwargs=dict(size=10),
)

p.legend.title = "cylinders"

bokeh.io.show(p)

In this case, we can immediately read off the number of cars with the respective number of cylinders.

## Why no bar graphs?

I strongly prefer strip plots (with jitter) to box plots and ECDFs for histograms. Why? Because in the strip plots and ECDFs, you are **plotting all of your data**. In practice, there are the only two types of visualizations for data with a categorical axis I use (though I'll sometimes overlay a jitter on a box plot to show some of the summary statistics).

A bar graph is the antithesis of plotting all of your data. You distill all of the information in the data set down to one or two summary statistics, and then use giant glyphs to show them. You should plot all of your data, so you shouldn't make bar graphs. IQplot will not help you practice bad plotting.

So why does iqplot have box-and-whisker plots? One may argue that it is nonetheless valuable to plot summary statistics, which is what box plots do. In that case, at least five summary statistics are plotted (the ends of the whiskers, the ends of the box, and the median). While this is still not plotting all of the data, it is still better than a dynamite graph (bar graph with error bars), which shows at most three summary statistics (height of bar, and lower and upper bound of confidence interval). But still, why does iqplot enable box plots, but not bar graphs?

The answer is that there are many ways to specify the summary statistic used in bar graphs. We could choose the height of the bar to be the mean of the data and the error bars to have a length given by the standard error of the mean. We could have the height of the bar be the median and the error bars be a possibly asymmetric confidence interval obtained by bootstrapping the median. And there are many more possibilities.

Conversely, if we stick to the widely-used (almost universally used, as far as I can tell) Tukey specification of a box-and-whisker plot, we are only plotting *percentiles* of the data. These assume no underlying statistical model, so the plots are unambiguous.

## Why no violin plots?

Similarly to histograms, [violin plots](https://en.wikipedia.org/wiki/Violin_plot) are a way to visualize the probability density function (pdf) of a quantitative variable. Violin plots accomplish this using [kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE), a procedure by which a smooth function to approximate the pdf of a random variable. Like binning must be specified for a histogram, the kernel and its bandwidth must be specified to compute a KDE. So, like histograms, violin plots require specification of arbitrary parameters.

The two big shortcomings of histograms is:

1. They break the rule of plot all of your data.
2. The choice of binning is arbitrary.

Violin plots suffer from both of these shortcomings and have the additional complication that they assign nonzero density to values beyond the extremes of the measured data, even into unphysical territory.

I view histograms as an auxiliary feature of iqplot to visualize pdfs, with ECDFs being far more powerful for visualizing distributions. As such, I did not extend the functionality to include another pdf visualizer which is, in my even, not any better.

For this reason, I do not include any other KDE-based plots, such as ridgeline plots.

## Why no rug plots?

I just complained about the shortcomings of histograms, so why don't I alleviate the "plot all your data" rule by allowing for [rug plots](https://en.wikipedia.org/wiki/Rug_plot) with histograms? That's a good question, and it is a feature worth adding, but is not implemented yet.

## Why no extended box plots?

The box in a box-and-whisker plot contains the middle two quartiles of the quantitative data. One can add more boxes containing different percentile ranges, and such plots are called extended box plots. I did not include this functionality because I view box plots as annotations of well-defined, visually interpretable summary statistics. Extending the box plots becomes challenging because annotation or textual description of the edges of all of the boxes is necessary. If more percentiles are needed, they may be added to a plot, e.g., with dashes. Here is an example where we want to add the 10th and 90th percentiles of the data in red.

In [3]:
# Make a box plot
p = iqplot.box(
    data=df,
    q="mpg",
    cats="origin",
    box_kwargs=dict(line_color="gray", fill_alpha=0),
    median_kwargs=dict(line_color="gray"),
    display_points=False,
    frame_width=500,
)

# Overlay a jitter plot
p = iqplot.strip(
    data=df,
    q="mpg",
    cats="origin",
    p=p,
    jitter=True,
    marker_kwargs=dict(alpha=0.5),
    tooltips=[("year", "@yr"), ("model", "@name")],
)

# Add 10th and 90th percentiles
df_10 = df.groupby("origin")["mpg"].quantile(0.1).reset_index()
df_90 = df.groupby("origin")["mpg"].quantile(0.9).reset_index()
p.dash(
    source=df_10,
    x="mpg",
    y="origin",
    angle=np.pi / 2,
    color="tomato",
    size=40,
    line_width=2,
)
p.dash(
    source=df_90,
    x="mpg",
    y="origin",
    angle=np.pi / 2,
    color="tomato",
    size=40,
    line_width=2,
)

bokeh.io.show(p)