Overhaul of categorical distribution plots #410

Merged
merged 44 commits into from Jan 22, 2015

Conversation

Projects
None yet
4 participants
@mwaskom
Owner

mwaskom commented Dec 29, 2014

TLDR: The boxplot and violinplot APIs are changing, for the better, but in a way that will be mildly disruptive. There is also a new function, stripplot.

There's some examples below, but to really see these functions in action, check out the new API docs that take advantage of automated figure collection for docstring examples:

boxplot | violinplot | stripplot

Changes/enhancements to boxplot and violinplot

This PR updates and unifies the API for boxplot and violinplot. Both functions maintain backwards-compatibility in terms of the kind of data they accept, but the syntax has changed. These functions are now invoked with x, y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter. You can still pass wide-form DataFrames or arrays to data, but it is no longer the first positional argument.

In other words instead of doing

sns.boxplot(tips.total_bill, groupby=tips.day)

You would now do

sns.boxplot("day", "total_bill", data=tips)

seaborn-boxplot-2

Existing code that uses these functions will probably break, but can be easily updated. I don't like these kind of disruptive API changes, but in this case the new API has a lot of virtues and creating a smoother upgrade path would have been too complicated to reasonably handle.

The upshot of this is that both functions now work seamlessly in context of a FacetGrid. Additionally, by using named variables and a data object, it's much easier to apply transformations to the data in the body of the seaborn call. It also just generally decreases the cognitive overhead of remembering that the API for boxplot/violinplot is different from that for regplot and friends.

To sweeten this change, there are a variety of other enhancements (and a few other API breaks):

  • Added a hue argument to boxplot and violinplot, which allows for nested grouping the plot elements by a third categorical variable. For violinplot, this nesting can also be accomplished by splitting the violins when there are two levels of the hue variable. To make this functionality feasible, the ability to specify where the plots will be draw in data coordinates has been removed. These plots now are drawn at set positions, like (and identical to) barplot and pointplot.
sns.violinplot("day", "total_bill", "smoker", data=tips, palette="Set1", split=True)

seaborn-violinplot-4

  • These plots now accept ordered categorical-type variables as input, and infer the orientation of the plot from which argument gets the category. Additionally, the order of the categories will determine the order of the plot elements:
sns.violinplot("orbital_period", "method", data=planets.query("orbital_period < 1000"))

seaborn-violinplot-10

  • Added a palette parameter to boxplot/violinplot. The color parameter still exists, but no longer does double-duty in accepting the name of a seaborn palette. palette supersedes color so that it can be used with a FacetGrid.
  • Added the scale and scale_hue parameters to violinplot. These control how the width of the violins are scaled. The default is area, which is different from how the violins used to be drawn. Use scale='width' to get the old behavior. You can also use scale="count" to scale by the number of observations in each bin.
  • Used a different style for the box kind of interior plot in violinplot, which shows the whisker range in addition to the quartiles. Use inner='quartile' to get the old style.

New stripplot function

This PR also introduces the stripplot function, which draws a scatterplot where one of the variables is categorical. This plot has the same API as boxplot and violinplot. It is useful both on its own and when composed with one of these other plot kinds to show both the observations and underlying distribution.

sns.violinplot("total_bill", "day", data=tips, inner=None)
sns.stripplot("total_bill", "day", data=tips, jitter=True)

seaborn-stripplot-10

Backend details

For the aficionados, this PR involves a complete rewrite of the code for these functions. It's much better organized, abstracted, and tested. That means it will be easier to keep these functions on a common API going forward, and to add enhancements with more confidence that they won't lead to regressions.

Next up will probably be to bring the barplot/pointplot code into this framework, which in some ways is better and more robust than what those run on. Also coming soon.... swarmplot.

mwaskom added some commits Dec 27, 2014

Make infer_orient a regular method
This was only really a staticmethod for early testing convenience.
Implement much of new violinplot
This still needs a lot of cleaning up and testing. Most, but not all of it
is here.
More messy violinplot code
Trying to handle cases with 0 or 1 observations in a bin, but
not every option works currently.
Add high-level control over point aestetics to stripplot
This commit also sucked in the new comments in the violinplot
kde estimation method
Add matplotlib plot_directive locally
Matplotlib 1.5 will introduce the 'close-figs' option to the plot directive
which will allow the kind of example style  I want to write in. I'm bringing
that file in here locally.

mwaskom added some commits Jan 19, 2015

Prune some plot tests
This was causing issues with the old version of matplotlib so I am just
killing it for now.

@mwaskom mwaskom changed the title from WIP: Overhaul of boxplot-like plots to WIP: Overhaul of categorical distribution plots Jan 19, 2015

@mwaskom mwaskom changed the title from WIP: Overhaul of categorical distribution plots to Overhaul of categorical distribution plots Jan 19, 2015

@mwaskom

This comment has been minimized.

Show comment
Hide comment
@mwaskom

mwaskom Jan 19, 2015

Owner

This is mostly done from my perspective (modulo #423), but I'm looking for testers to try it out on some real data and find any weird corner-cases before I merge.

Owner

mwaskom commented Jan 19, 2015

This is mostly done from my perspective (modulo #423), but I'm looking for testers to try it out on some real data and find any weird corner-cases before I merge.

@phobson

This comment has been minimized.

Show comment
Hide comment
@phobson

phobson Jan 19, 2015

Contributor

Due to the repeated values, I wouldn't be surprised if this was supposed to fail. If that's the case, should it fail more gracefully? Difficult to debug presently.

from io import StringIO

import numpy as np
import matplotlib.pyplot as plt
import seaborn

strfile = """\
category,epazone,parameter,station,qual,res
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.000000094994903
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.000000094994903
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,6,"Lead, Dissolved",inflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",outflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",inflow,=,0.8199999928474426
Wetland Basin,6,"Lead, Dissolved",outflow,=,2.5999999046325684
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5199999809265137
Wetland Basin,6,"Lead, Dissolved",outflow,=,7.449999809265137
Wetland Basin,6,"Lead, Dissolved",inflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",outflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",inflow,=,15.899999618530273
Wetland Basin,6,"Lead, Dissolved",outflow,=,2.190000057220459
Wetland Basin,6,"Lead, Dissolved",inflow,=,3.5899999141693115
Wetland Basin,6,"Lead, Dissolved",outflow,=,6.840000152587891
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5199999809265137
Wetland Basin,6,"Lead, Dissolved",outflow,=,3.9600000381469727
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5399999618530273
Wetland Basin,6,"Lead, Dissolved",outflow,=,3.2200000286102295
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.5600000023841858
Wetland Basin,7,"Lead, Dissolved",outflow,=,2.0899999141693115
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.46000000834465027
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.550000011920929
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.7799999713897705
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.5899999737739563
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.3700000047683716
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.27000001072883606
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.23999999463558197
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.3700000047683716
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.6399999856948853
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.8399999737739563
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.5699999928474426
Wetland Basin,7,"Lead, Dissolved",outflow,=,1.2899999618530273
"""
df = pandas.read_csv(StringIO(strfile))
df['logres'] = np.log(df['res'])
fig, ax = plt.subplots()
seaborn.violinplot(x='epazone', y='logres', hue='station', data=df, ax=ax, split=True)

Again, this is all very awesome.

Contributor

phobson commented Jan 19, 2015

Due to the repeated values, I wouldn't be surprised if this was supposed to fail. If that's the case, should it fail more gracefully? Difficult to debug presently.

from io import StringIO

import numpy as np
import matplotlib.pyplot as plt
import seaborn

strfile = """\
category,epazone,parameter,station,qual,res
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.000000094994903
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.000000094994903
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",inflow,ND,2.0
Wetland Basin,4,"Lead, Dissolved",outflow,ND,2.0
Wetland Basin,6,"Lead, Dissolved",inflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",outflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",inflow,=,0.8199999928474426
Wetland Basin,6,"Lead, Dissolved",outflow,=,2.5999999046325684
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5199999809265137
Wetland Basin,6,"Lead, Dissolved",outflow,=,7.449999809265137
Wetland Basin,6,"Lead, Dissolved",inflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",outflow,ND,0.5
Wetland Basin,6,"Lead, Dissolved",inflow,=,15.899999618530273
Wetland Basin,6,"Lead, Dissolved",outflow,=,2.190000057220459
Wetland Basin,6,"Lead, Dissolved",inflow,=,3.5899999141693115
Wetland Basin,6,"Lead, Dissolved",outflow,=,6.840000152587891
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5199999809265137
Wetland Basin,6,"Lead, Dissolved",outflow,=,3.9600000381469727
Wetland Basin,6,"Lead, Dissolved",inflow,=,1.5399999618530273
Wetland Basin,6,"Lead, Dissolved",outflow,=,3.2200000286102295
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.5600000023841858
Wetland Basin,7,"Lead, Dissolved",outflow,=,2.0899999141693115
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.46000000834465027
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.550000011920929
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.7799999713897705
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.5899999737739563
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.3700000047683716
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.27000001072883606
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.23999999463558197
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.3700000047683716
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.6399999856948853
Wetland Basin,7,"Lead, Dissolved",outflow,=,0.8399999737739563
Wetland Basin,7,"Lead, Dissolved",inflow,=,0.5699999928474426
Wetland Basin,7,"Lead, Dissolved",outflow,=,1.2899999618530273
"""
df = pandas.read_csv(StringIO(strfile))
df['logres'] = np.log(df['res'])
fig, ax = plt.subplots()
seaborn.violinplot(x='epazone', y='logres', hue='station', data=df, ax=ax, split=True)

Again, this is all very awesome.

@mwaskom

This comment has been minimized.

Show comment
Hide comment
@mwaskom

mwaskom Jan 19, 2015

Owner

Ding ding, we have a winner. Fixed with bb8af70

Owner

mwaskom commented Jan 19, 2015

Ding ding, we have a winner. Fixed with bb8af70

@mwaskom

This comment has been minimized.

Show comment
Hide comment
@mwaskom

mwaskom Jan 19, 2015

Owner

Your dataset is also doing some weird things with the area scaling when I remove the hue nesting, but I'm not entirely sure it's a "bug" or what the right way to fix it would be.

Owner

mwaskom commented Jan 19, 2015

Your dataset is also doing some weird things with the area scaling when I remove the hue nesting, but I'm not entirely sure it's a "bug" or what the right way to fix it would be.

@phobson

This comment has been minimized.

Show comment
Hide comment
@phobson

phobson Jan 19, 2015

Contributor

👏 that fixed it all on my end with all 361 datasets just like that one:
total suspended solids_bioretention

Contributor

phobson commented Jan 19, 2015

👏 that fixed it all on my end with all 361 datasets just like that one:
total suspended solids_bioretention

@shoyer shoyer referenced this pull request in pandas-dev/pandas Jan 21, 2015

Closed

Feature request: Categorical plotting #9069

mwaskom added a commit that referenced this pull request Jan 22, 2015

Merge pull request #410 from mwaskom/new_boxish_plots
Overhaul of categorical distribution plots

@mwaskom mwaskom merged commit ee59253 into master Jan 22, 2015

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

@mwaskom mwaskom deleted the new_boxish_plots branch Jan 22, 2015

@mwaskom mwaskom referenced this pull request Mar 9, 2015

Merged

Unify categorical plots #466

3 of 4 tasks complete
@sjobeek

This comment has been minimized.

Show comment
Hide comment
@sjobeek

sjobeek Mar 13, 2015

Ohhhhh man, @mwaskom you are my hero. Can't wait to play around with these.

Categorical, horizontal violinplots with from long dataframes, compatible with FacetGrid... yum.

sjobeek commented Mar 13, 2015

Ohhhhh man, @mwaskom you are my hero. Can't wait to play around with these.

Categorical, horizontal violinplots with from long dataframes, compatible with FacetGrid... yum.

@Phlya

This comment has been minimized.

Show comment
Hide comment
@Phlya

Phlya May 21, 2015

Not sure if it's the right place for the comment, just want to say that combining e.g. violinplot with stripplot when providing hue argument causes each hue to be repeated in the legend - once for violins and once for points of stripplot. Not a big deal, but I think a way to not add points to the legend would be a good idea - if they are overlaying the violins and have the same colour mentioning them in the legend is not really necessary.
An example. Yes, it still looks OK, but with more hues would be more cluttered I think.
figure_1

P.S.
It's absolutely awesome that such complex and good-looking plots can be produced with just 2 lines of code! Great job, @mwaskom and everyone else contributing!

Phlya commented May 21, 2015

Not sure if it's the right place for the comment, just want to say that combining e.g. violinplot with stripplot when providing hue argument causes each hue to be repeated in the legend - once for violins and once for points of stripplot. Not a big deal, but I think a way to not add points to the legend would be a good idea - if they are overlaying the violins and have the same colour mentioning them in the legend is not really necessary.
An example. Yes, it still looks OK, but with more hues would be more cluttered I think.
figure_1

P.S.
It's absolutely awesome that such complex and good-looking plots can be produced with just 2 lines of code! Great job, @mwaskom and everyone else contributing!

@pyup-bot pyup-bot referenced this pull request in mayou36/raredecay May 18, 2017

Closed

Pin seaborn to latest version 0.7.1 #10

@pyup-bot pyup-bot referenced this pull request in tnir/pandas Feb 2, 2018

Open

Pin seaborn to latest version 0.8.1 #14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment