# Basic usage

Bokeh-catplot is imported as `bokeh_catplot`. To display plots in a notebook, as is always the case for Bokeh plots, we also need to import `bokeh.io` and execute `bokeh.io.output_notebook()`. Finally, we will use the automobile fuel efficiency sample data set that is included in Bokeh to demonstrate the usage of bokech-catplot.

In [1]:
import bokeh_catplot

import bokeh.sampledata.autompg

import bokeh.io
bokeh.io.output_notebook()

So we have an understanding of the data set, we will take a look at it.

In [2]:
df = bokeh.sampledata.autompg.autompg_clean

df.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name,mfr
0,18.0,8,307.0,130,3504,12.0,70,North America,chevrolet chevelle malibu,chevrolet
1,15.0,8,350.0,165,3693,11.5,70,North America,buick skylark 320,buick
2,18.0,8,318.0,150,3436,11.0,70,North America,plymouth satellite,plymouth
3,16.0,8,304.0,150,3433,12.0,70,North America,amc rebel sst,amc
4,17.0,8,302.0,140,3449,10.5,70,North America,ford torino,ford


Importantly, this data set is [tidy](https://en.wikipedia.org/wiki/Tidy_data); each row represents a single observation and each column a variable associated with an observation. Bokeh-catplot assumes that any inputted data frame is in tidy format. In the fuel efficiency example, the columns have different character. For example, `'mpg'` contains quantitative measurement of the miles per gallon of each car. The `'origin'` is **categorical** in the sense that it is not quantitative, but is a descriptor of the automobile that takes on a few discrete values. 

## Quick start

Bokeh-catplot generates plots from tidy data frames where some columns may contain categorical data and the column of interest in the plot is quantitative. We call this "one quantitative/*n* categorical," or "1QNC."

There are four types of plots that Bokeh-catplot generates.

- **Plots with a categorical axis**
    + Box plots: `bokeh_catplot.box()`
    + Strip plots: `bokeh_catplot.strip()`
    
    
    
- **Plots without a categorical axis**
    + Histograms: `bokeh_catplot.histogram()`
    + [ECDFs](https://en.wikipedia.org/wiki/Empirical_distribution_function): `bokeh_catplot.ecdf()`

If you are unfamiliar with ECDFs, they are discussed [below](#ECDFs).

This first seven arguments are the same for all plots. They are:

- `data`: A tidy data frame
- `val`: The column of the data frame to be treated as the quantitative variable.
- `cats`: A list of columns in the data frame that are to be considered as categorical variables in the plot. If `None`, a single box, strip, histogram, or ECDF is plotted.
- `val_axis`: Along which axis, *x* or *y* that the quantitative variable varies. The default is `'x'`.
- `palette`: A list of hex colors to use for coloring the markers for each category. By default, it uses the Glasbey Category 10 color palette from [colorcet](https://colorcet.holoviz.org/).
- `order`: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.
- `p`: If specified, the `bokeh.plotting.Figure` object to use for the plot. If not specified, a new figure is created.

If `data` is given as a Numpy array, it is the only required argument. If `data` is given as a Pandas DataFrame, `val` must also be supplied. All other arguments are optional and have reasonably set defaults.

The respective plots also have kwargs that are specific to them. Examples highlighting some, but not all, customizations are in the following sections.

Any extra kwargs not in the function call signature are passed to `bokeh.plotting.figure()` when the figure is instantiated.

Here are the four default plots for `cats = 'origin'` and `val = 'mpg'`.

In [3]:
p_box = bokeh_catplot.box(data=df, val='mpg', cats='origin', title='box')
p_strip = bokeh_catplot.strip(data=df, val='mpg', cats='origin', title='strip')
p_histogram = bokeh_catplot.histogram(data=df, val='mpg', cats='origin', title='histogram')
p_ecdf = bokeh_catplot.ecdf(data=df, val='mpg', cats='origin', title='ecdf')

bokeh.io.show(
    bokeh.layouts.gridplot([p_box, p_strip, p_histogram, p_ecdf], ncols=1)
)

## Plots with a single data set

You can also generate plots from a single Numpy array without specifying categories and values. Note that when `data` is specified as a Numpy array, the string used for the `val` argument is used as the axis label.

In [4]:
# MPG data for all cars as Numpy array
data = df['mpg'].values

p_box = bokeh_catplot.box(data=data, val='mpg', title='box')
p_strip = bokeh_catplot.strip(data=data, val='mpg', title='strip')
p_histogram = bokeh_catplot.histogram(data=data, val='mpg', title='histogram')
p_ecdf = bokeh_catplot.ecdf(data=data, val='mpg', title='ecdf')

bokeh.io.show(
    bokeh.layouts.gridplot([p_box, p_strip, p_histogram, p_ecdf], ncols=1)
)

## Fine-tuning of plots

In the following, we investigate each of the four kind of plots and explore some, but not all, of the configuration options. Refer to the API reference for details about possible keyword arguments.

### Box plots

We can also make vertical box plots by specifying `val_axis='y'`. We also demonstrate the `order` kwarg to specify the ordering of the categorical variables.

In [5]:
p = bokeh_catplot.box(
    data=df,
    val='mpg',
    cats='origin',
    val_axis='y',
    order=['Asia', 'Europe', 'North America']
)

bokeh.io.show(p)

We can independently specify properties of the marks using `box_kwargs`, `whisker_kwargs`, `median_kwargs`, and `outlier_kwargs`. For example, say we wanted our colors to be [Betancourt red](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#step_four:_build_a_generative_model19), and that we wanted the outliers to also be that color and use diamond glyphs. We can also put caps on the whiskers using `whisker_caps=True`.

In [6]:
p = bokeh_catplot.box(
    data=df,
    val='mpg',
    cats='origin',
    whisker_caps=True,
    outlier_marker='diamond',
    box_kwargs=dict(fill_color='#7C0000'),
    whisker_kwargs=dict(line_color='#7C0000', line_width=2),
)

bokeh.io.show(p)

We can have multiple categories by specifying `cats` as a list. We will also specify a custom palette.

In [7]:
bkp = bokeh.palettes.d3['Category20c'][20]
palette = bkp[:3] + bkp[4:7] + bkp[8:11]

p = bokeh_catplot.box(
    data=df,
    val='mpg',
    cats=['origin', 'cyl'],
    palette=palette,
    y_axis_label='# of cylinders',
)

p.yaxis.axis_label_text_font_style = 'bold'

bokeh.io.show(p)

## Strip plots

We can make a strip plot with dash markers and add some transparency.

In [8]:
p = bokeh_catplot.strip(
    data=df,
    val='mpg',
    cats='origin',
    marker='dash',
    marker_kwargs=dict(alpha=0.3)
)

bokeh.io.show(p)

The problem with strip plots is that they can have trouble with overlapping data points. A common approach to deal with this is to "jitter," or place the glyphs with small random displacements along the categorical axis. I do that here, allowing for hover tools that give more information about the respective data points.

In [9]:
p = bokeh_catplot.strip(
    data=df,
    val='mpg',
    cats='origin',
    jitter=True,
    marker_kwargs=dict(alpha=0.5),
    tooltips=[('year', '@yr'), ('model', '@name')],
    frame_width=500,
)

bokeh.io.show(p)

Note that in this plot, I used the `frame_width` kwarg to make the plot wider. Any kwargs that can be passed into `bokeh.plotting.figure()` can be used.

### Strip-box plots

Even while plotting all of the data, we sometimes want to graphically display summary statistics, in which case overlaying a box plot and a jitter plot is useful. To populate an existing Bokeh figure with new glyphs from another catplot, pass in the `p` kwarg. You should be careful, though, because you need to make sure the `val`, `cats`, and `val_axis` arguments exactly match.

In [10]:
# Make a box plot
p = bokeh_catplot.box(
    data=df,
    val='mpg',
    cats='origin',
    box_kwargs=dict(line_color='gray', fill_alpha=0),
    median_kwargs=dict(line_color='gray'),
    display_points=False,
    frame_width=500,
)

# Overlay a jitter plot
p = bokeh_catplot.strip(
    data=df,
    val='mpg',
    cats='origin',
    p=p,
    jitter=True,
    marker_kwargs=dict(alpha=0.5),
    tooltips=[('year', '@yr'), ('model', '@name')]
)

bokeh.io.show(p)

## Histograms

We could plot normalized histograms using the `density` kwarg, and we'll make the plot a little wider to support the legend.

In [11]:
p = bokeh_catplot.histogram(
    data=df,
    val='mpg',
    cats='origin',
    density=True,
    frame_width=550,
)

bokeh.io.show(p)

## ECDFs

An empirical cumulative distribution function, or ECDF, is a convenient way to visualize a univariate probability distribution. Consider a measurement x in a set of measurements X. The ECDF evaluated at *x* is defined as

> ECDF(x) = fraction of data points in X that are ≤ x.

By default, the ECDFs are plotted as dots, where *y*-value of a given dot is the fraction of data points that are less than or equal to the corresponding *x* value. We may wish to display ECDFs as staircases, as is also traditionally done. (Note, though, that in this case, we cannot have hover tooltips.) To do this, we use the `style='staircase'` kwarg.

In [12]:
p = bokeh_catplot.ecdf(
    data=df,
    val='mpg',
    cats='origin',
    style='staircase',
)

bokeh.io.show(p)

We can also display empirical complementary cumulative distribution functions (ECCDFs) using the `complementary` kwarg.

>ECCDF(x) = 1 - ECDF(x)

In [13]:
p = bokeh_catplot.ecdf(
    data=df,
    val='mpg',
    cats='origin',
    complementary=True
)

bokeh.io.show(p)

Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the `kind='colored'` kwarg.

In [14]:
p = bokeh_catplot.ecdf(
    data=df,
    val='mpg',
    cats='origin',
    kind='colored'
)

bokeh.io.show(p)

We can also display a confidence intervals for the ECDFs acquired by bootstrapping using the `conf_int` kwarg.

In [15]:
p = bokeh_catplot.ecdf(
    data=df,
    val='mpg',
    cats='origin',
    style='staircase',
    conf_int=True,
)

bokeh.io.show(p)