Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seaborn should respect categorical order when sorting pd.Categorical objects #361

Closed
shoyer opened this issue Nov 11, 2014 · 48 comments · Fixed by #548
Closed

Seaborn should respect categorical order when sorting pd.Categorical objects #361

shoyer opened this issue Nov 11, 2014 · 48 comments · Fixed by #548

Comments

@shoyer
Copy link
Contributor

shoyer commented Nov 11, 2014

For example, this adaption of the "Grouped boxplots" example should work (if using pandas 0.15 or higher) even without specifying x_order:

import seaborn as sns
sns.set(style="ticks")

tips = sns.load_dataset("tips")
days = ["Thur", "Fri", "Sat", "Sun"]
tips['days'] = pd.Categorical(tips['day'], days)

g = sns.factorplot("day", "total_bill", "sex", tips, kind="box",
                   palette="PRGn", aspect=1.25)
g.despine(offset=10, trim=True)
g.set_axis_labels("Day", "Total Bill")

If you using a pandas method to do the sorting, then this is a pandas bug.

@wrobstory
Copy link

I think it should probably respect the index order by default, e.g. if I have a sorted-ascending DataFrame, I should have a sorted-ascending bar chart.

@shoyer
Copy link
Contributor Author

shoyer commented Nov 11, 2014

Generally I agree, but I was assuming @mwaskom had his reasons. If this is just a side effect of the fact that he's using np.unique, he should try the pandas unique method instead, which does not sort (and is also a little faster).

On Mon, Nov 10, 2014 at 10:10 PM, Rob Story notifications@github.com
wrote:

I think it should probably respect the index order by default, e.g. if I have a sorted-ascending DataFrame, I should have a sorted-ascending bar chart.

Reply to this email directly or view it on GitHub:
#361 (comment)

@mwaskom
Copy link
Owner

mwaskom commented Nov 12, 2014

I definitely agree with @shoyer about categorical types. I'm not sure I feel all that strongly about alphabetical sorting (though to @wrobstory's point, without having any way to know whether a column is intentionally sorted, it seemed best for the default to do something predictable). But it sounds easy to just change np.unique(Series) to Series.unique() and pick up the main goal here (categorical columns) and maybe also reduce some surprise.

To go along with this I think the test datasets should be updated to use Categorical where appropriate to demonstrate the functionality.

@mwaskom
Copy link
Owner

mwaskom commented Nov 14, 2014

I'm in favor of this, but I'm gonna kick it to the 0.6 cycle because it will take a little thinking. I want to figure out the best way to make this behavior as consistent as possible across the package.

@olgabot
Copy link
Contributor

olgabot commented Dec 4, 2014

Just wanted to chime in here in support of sticking to the categorical order, for example here:

image

The alphabetical sort doesn't make sense because then the length bins are out of order, putting the bins with 0's and 5's first, rather than the natural size ordering.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 4, 2014

@olgabot The sorting order for bins will also be fixed upstream when I finish adding an Interval type to pandas (which will sort properly), instead of using string labels.

@yoshiserry
Copy link

hi all, is there a workaround for this currently? i love the power of the col= parameter, where you can create a graph for all instances of a column but I want to be able to plot Jan, Feb, Mar in order.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 5, 2014

@yoshiserry Yes, use the col_order argument.

@jseabold
Copy link
Contributor

Do the changes in #409 look reasonable?

It provides a compatibility function that uses pandas sort and unique, which 1) handles NA in a consistent manner because np.sort will fail for object dtype Series with NA. and 2) it handles ordered and unordered Categorical. If an ordered factor/Categorical is passed in then it sorts on that order. If not, it does a lexicographical sort bc. of how unique works.

Want to get some feedback before adding a few more tests and touching the code base in other parts. Should this approach replace all the np.sort calls right now. I don't have a good sense if this would ever not be desired.

@mwaskom
Copy link
Owner

mwaskom commented Dec 26, 2014

Thanks for taking a crack at it.

The open issue is really whether things should be lexicographically sorted or in the order that they appear in the dataframe (so just what the straight pandas .unique() method returns). I think that @wrobstory finding the sorting surprising was what originally motivated this chain of issues.

I originally had a mild preference for consistent behavior (sorts are at least predictable once you expect them), but if it requires a fair amount of complexity to make sorting work correctly across a range of pandas versions, it's possible that might weigh in favor of not sorting.

@mwaskom
Copy link
Owner

mwaskom commented Dec 26, 2014

I'd been letting this issue fester as it required a hard decision so thanks for poking at it :]

@wrobstory
Copy link

Yep- I would not expect seaborn to sort the data for me unless explicitly asked to do so. I think there are lots of cases where I've already munged the dataframe to get the exact ordering I want, and expect it to be plotted 1:1.

@jseabold
Copy link
Contributor

I think that's probably fair in some cases, though I don't really see at first glance how a sorted index would affect the order in a boxplot, e.g. Would it? Is the suggestion to keep the order of the first instances of each level in a factor over observations? This would require some serious doing to work around behavior of unique no?

These issues are kind of unrelated though. The status quo right now is to sort. This should be done correctly. Then there's the question of whether or not to sort at all by default and in which cases it makes sense not to, right?

@mwaskom
Copy link
Owner

mwaskom commented Dec 26, 2014

Is the suggestion to keep the order of the first instances of each level in a factor over observations? This would require some serious doing to work around behavior of unique no?

I don't think I follow, the default behavior of Series.unique() differs from np.unique() in that it doesn't sort:

In [23]: pd.Series(["foo", "bar", "buz"]).unique()
Out[23]: array(['foo', 'bar', 'buz'], dtype=object)

These issues are kind of unrelated though. The status quo right now is to sort. This should be done correctly.

Sure, but I definitely want to deal with this for the 0.6 release, so it doesn't make sense for you to put a lot of work into a good solution to preserve the status quo if it's just gonna get stripped out for a bunch of simpler code that just calls .unique().

@jseabold
Copy link
Contributor

On Fri, Dec 26, 2014 at 2:57 PM, Michael Waskom notifications@github.com
wrote:

Is the suggestion to keep the order of the first instances of each level
in a factor over observations? This would require some serious doing to
work around behavior of unique no?

I don't think I follow, the default behavior of Series.unique() differs
from np.unique() in that it doesn't sort:

In [23]: pd.Series(["foo", "bar", "buz"]).unique()
Out[23]: array(['foo', 'bar', 'buz'], dtype=object)

Oh ok. That's news to me. Should have checked my priors.

@jseabold
Copy link
Contributor

So concrete steps for #409. Change np.sort to the pandas compatibility sort and preserve np.unique vs. pandas.unique? I think this preserves the status quo and makes sort work as you'd expect it to. I'd prefer to punt on sorting vs. index preservation bc. I don't have my head around enough the current code base.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 29, 2014

Now that we have ordered categoricals in pandas, I think automatically sorting would be OK for Seaborn. But generally I would agree with @wrobstory that respecting input order is less surprising. It's also certainly much less awkward to manually sort a column with pandas if desired than to tell Seaborn not to sort. So I'm +0 for .unique() rather than sorting.

@mwaskom
Copy link
Owner

mwaskom commented Jan 21, 2015

The strange behavior of unordered pandas categoricals sort of defeats the utility of relying on .unique(), @shoyer, cf this issue thread: pandas-dev/pandas#9148

@jankatins
Copy link

IMO if categorical variables (dtype "category") are to be plotted, the categories should be used directly instead of unique(). E.g. I would expect that if I plot a lickert scale via bar plots, I expect that the complete scale is shown, not only the bars for categories which are used. That was also once upon the time the reasoning why unique returned (all) categories in the order they were specified (but this was changed in pandas-dev/pandas#8559 (comment) to only return unique values because that's what the contract on unique said...).

@mwaskom
Copy link
Owner

mwaskom commented Jan 22, 2015

Interesting, I think that's a reasonable point @JanSchulz

@mwaskom
Copy link
Owner

mwaskom commented Jan 22, 2015

So to make explicit what we want to happen on a categorical axis:

  • For objects with ordered categorical datatype, show all categories in the correct category order
  • For objects with unordered categorical datatype, show all categories in the order they appear in the Series
  • Otherwise, show all unique values in the order they appear in the Series/array/list

Also this has to happen by inspecting the object attributes, not with any special pandas functions, because seaborn has to run on pandas < 0.15.

Does that sound right? If so, smarter pandas folk, what is the cleanest way to go about doing this?

@shoyer
Copy link
Contributor Author

shoyer commented Jan 22, 2015

@mwaskom Here is my suggested implementation:

def get_categories(values):
    if hasattr(values, 'categories'):
        # values is a pd.Categorical
        return np.asarray(values.categories)
    else:
        return pd.unique(values)

This satisfies your conditions 1 and 3, but not 2: unordered Categoricals will still display values in the order of the categories. It's definitely possible to fix that case, but it's also trickier and the order is somewhat ambiguous if not all categories appear in the data (I guess those could go to the end?). Might not be worth worrying about.

@mwaskom
Copy link
Owner

mwaskom commented Jan 22, 2015

I guess it would be good to be consistent with Pandas, but I think it's better to use the DataFrame order and drop categories with no observations than the reverse (exactly for the reason you say: where to put them is undefined).

@shoyer
Copy link
Contributor Author

shoyer commented Jan 23, 2015

So I've been playing with scatter plots of categorical numeric data today (pretty easy when combined with hue on FacetGrid), and I do agree pretty strongly with @JanSchulz that all categories should be plotted.

Here's a synthetic version of my data:

import pandas as pd
import numpy as np
import seaborn as sns

def cut_diverging(array, n=9):
    mag = max(-array.min(), array.max())
    return pd.cut(array, np.linspace(-mag, mag, num=(n + 1)))

rs = np.random.RandomState(0)
df = pd.DataFrame({'x': rs.rand(100),
                   'y': rs.rand(100),
                   'z': 1 + rs.randn(100)})
df['z_cat'] = cut_diverging(df.z, 9)

categories = df.z_cat.cat.categories
palette = sns.color_palette('RdBu_r', 9)

g = sns.FacetGrid(df, hue='z_cat', hue_order=categories,
                  palette=palette, aspect=1.3, size=3)
g.map(plt.scatter, 'x', 'y', s=50)
g.add_legend()

image

Two issues are evident in this plot:

  1. The labels in the legend are wrong (notice that distribution should be mostly positive numbers). This is because I'm using labels in hue_order that don't appear in the data. (There should probably fail more loudly, but that's a separate issue).
  2. Even though I went to the trouble of obtaining centered divergent categories, the color map is not centered because not every category is found.

Plotting all categories (but not bothering to order them) would solve each of these problems.

As a side note, perhaps a utility function like cut_diverging belongs in seaborn?

@mwaskom
Copy link
Owner

mwaskom commented Jan 23, 2015

I think this is actually just a bug in FacetGrid.add_lagend and isn't specifically related to the category issue.

@jankatins
Copy link

@mwaskom Not sure what you expected, but IMO these are the main advantages for plotting (and maybe statistical libs):

  • I can easily change the order of displayed categorical data (e.g. try to order a variable with values like "one", "two", "three" or lickert scales -> already possible with kwords in seaborn)
  • I can include empty categories at the right place (not sure if that's possible in seaborn yet).
  • In ggplot I was also looking forward to using it for facets: instead of passing around the full labels (which would need an additional "context" or so), each facet would simple take this information from the the categorical variable itself. This would then solve the up-to-now problem in ggplot, that displaying a categorical (e.g. string) variable in a faceted plot would omit bars in some facets, where they were not present in the groupby dataframe.

@mwaskom
Copy link
Owner

mwaskom commented Jan 23, 2015

Sure, I mean I get that. Like I said, I'm excited about them! I think my reservations have to do with the distinction between ordered and unordered categoricals, which seems to add a fair amount of complexity and confusion (I guess mostly I don't understand the point of unordered categoricals -- all of the examples you mention have to do with the ordered kind).

@shoyer
Copy link
Contributor Author

shoyer commented Jan 23, 2015

AFAIK unique makes no guarantee that unique() outputs data in any particular order, so this could change (the docstring doesn't mention any order).

Let's fix this: pandas-dev/pandas#9346

@mwaskom
Copy link
Owner

mwaskom commented Jan 23, 2015

Maybe what's confusing is that "unordered" categoricals aren't really unordered, they're just default (lex sort) ordered.

@jankatins
Copy link

I will prepare the change discussed above (e.g if one constructs a categorical with order=False the categories are not ordered), but I suspect that we understand something different by "a categorical is ordered": The only thing what this means is that values are sortable (according to the order specified in the categories), not anything about the order in the categories. More like that there is an order on int (e.g. 1 < 2) and if you could take this away (order=False), 1 < 2 would throw a exception like it does for comparing two objects.

examples for unordered categoricals: countries, treament vs non treatment,... I'm undecided if lexi sorting in this case (at least initially) would be helpful or not. IMO it would not harm :-)

Anyway: I think seaborn will need a doc-sentence on this in any case because in most case the workflow wouldn't use the Categorical constructor directly but

df = ...
df["vcat"] = df.string_var.astype("category") # sorts because categorical defaults to "ordered=True"
df.vcat.cat.ordered = False # categories are not reevaluated/sorted in order of appearance
[plot it...]

What will IMO never happen is that if you have a ordered categorical and take away the order, the categories will be now sorted "as appearing", as this would mean that the categoricals have to be computed on the fly as the order depends on the current order of the rows.

@jankatins
Copy link

IMO the docs should read something like this:

The categories are either ordered like "var.unique()" or, in case the variable is of dtype "category", in the same order as the categories ("var.cat.categories"). You can change the order by resorting (all but dtype category), by changing the order of the categories variable of dtype category (var.cat.reorder_categories([....neworder...]) or by supplying [... kword args...].

jankatins added a commit to jankatins/pandas that referenced this issue Jan 23, 2015
…False)

In mwaskom/seaborn#361 it was discussed
that lexicographical sorting the categories is only appropiate if an
order is specified/implied. If this is explicitly not done, e.g. with
`Categorical(..., ordered=False)` then the order should be taken
from the order of appearance, similar to the current `Series.unique()`
implementation.
@jankatins
Copy link

Just something I wrote in one of the linked bugreports: I think that "order of apearance" doesn't make sense as a default for plotting categoricals: I can't see many cases where the "order of appearance" of a variable has any logical meaning apart when one has explicitly sorted that var (in which case it is easier to sort the unique values) or for teh date case below:

  • If it is sorted by the categorical variable, then that ordere has a meaning, but I could have done that much cheaper by sorting the unique values itself or supplying a custom ordering by kwargs
  • If the frame is "unsorted" (or by some ID/Timestamp), the order of (almost) any other var including the categorical one is random and has no meaning in a categorical plot.
  • The only case is when you sort by date and have a categorical "month", which has the month as a name ("Jan","Apr",...). But that would be much better handled by converting to a Categorical, where the order of the categories has a defined meaning.

IMO the last is much less often the case than the "it's random" case above

So, this would be my vote: for all cases where a categorical variable is expected:

  • If it is dtype category, take the categories and plot "as is" (order and including unused)
  • if it is a string/int: take unique and sort it, [maybe convert to categorical for easier codepaths?,] then plot it. Add a warning that string was converted to categorical and one can do that by hand and change the order there.
  • If one supplies a ordered list of values, overwrite both cases.

mwaskom added a commit that referenced this issue Mar 8, 2015
This will include levels that appear in the `category` list, but that
do not appear in the data.

See #361
@mwaskom
Copy link
Owner

mwaskom commented Mar 8, 2015

For the record here's the function I ended up using to determine a list of category levels from an arbitrary vector object:

def categorical_order(values, order=None):
    """Return a list of unique data values.

    Determine an ordered list of levels in ``values``.

    Parameters
    ----------
    values : list, array, Categorical, or Series
        Vector of "categorical" values
    order : list-like, optional
        Desired order of category levels to override the order determined
        from the ``values`` object.

    Returns
    -------
    order : list
        Ordered list of category levels

    """
    if order is None:
        if hasattr(values, "categories"):
            order = values.categories
        else:
            try:
                order = values.cat.categories
            except (TypeError, AttributeError):
                try:
                    order = values.unique()
                except AttributeError:
                    order = pd.unique(values)

    return list(order)

IMHO there has to be a lot of complexity in this little function to handle the various options and failure modes of working with pandas "categorical" data, but it appears to get the job done...

@shoyer
Copy link
Contributor Author

shoyer commented Mar 8, 2015

This is indeed a helpful reference -- thanks @mwaskom!

On my TODO list is cleaning that up a little bit, at least removing that last try/except clause -- pd.unique should be able to handle pandas categoricals (and other types with a .unique method) directly.

@mwaskom
Copy link
Owner

mwaskom commented Mar 8, 2015

The thing I got stuck on for a while was figuring out that obj.cat raises TypeError if obj is a Series but AttributeError otherwise (and apparently that distinction only takes effect on Python 3). Streamlining that might be helpful to others.

@mwaskom
Copy link
Owner

mwaskom commented Mar 8, 2015

Or maybe it was that hasattr(obj, "cat") raises a TypeError only on Python 3 -- anyway, "determine if obj has categories" remains a bit fraught.

@shoyer
Copy link
Contributor Author

shoyer commented Mar 8, 2015

I will double check, but I'm pretty sure this is at least consistent now between Python 2 and 3 on master. Any invalid use should now raise TypeError. I suppose there is a reasonable case that that should be AttributeError instead.

On Sun, Mar 8, 2015 at 4:18 PM, Michael Waskom notifications@github.com
wrote:

Or maybe it was that hasattr(obj, "cat") raises a TypeError only on Python 3 -- anyway, "determine if obj has categories"` remains a bit fraught.

Reply to this email directly or view it on GitHub:
#361 (comment)

@mwaskom
Copy link
Owner

mwaskom commented Mar 9, 2015

Sorry, took a little bit, but I've reproduced the issue. The following code returns False on Python 2.7 but raises TypeError on Python 3.4:

import pandas as pd
x = ["a", "c", "c", "b", "a", "d"]
hasattr(pd.Series(x), "cat")

shoyer added a commit to shoyer/pandas that referenced this issue Mar 9, 2015
`AttributeError` is really the appropriate error to raise for an invalid
attribute. In particular, it is necessary to ensure that tests like
`hasattr(s, 'cat')` work consistently on Python 2 and 3: on Python 2,
`hasattr(s, 'cat')` will return `False` even if a `TypeError` was raised, but
Python 3 more strictly requires `AttributeError`.

This is an unfortunate trap that we should avoid. See this discussion in
Seaborn for a full report:
mwaskom/seaborn#361 (comment)

Note that technically, this is an API change, since these accessors (all but
`.str`, I think) raised TypeError in the last release.

This also suggests another possibility for testing for Series with a
Categorical dtype (GH8814): just use `hasattr(s, 'cat')` (at least for Python
2 or pandas >=0.16).

CC mwaskom jorisvandenbossche JanSchulz
@mwaskom mwaskom mentioned this issue Mar 9, 2015
4 tasks
mwaskom added a commit that referenced this issue Mar 13, 2015
This will include levels that appear in the `category` list, but that
do not appear in the data.

See #361
mwaskom added a commit that referenced this issue May 9, 2015
This fixes #472.

This also changes the default `hue_order` to use the same `category_order`
rules as elsewhere in seaborn (cf #361).
@mwaskom
Copy link
Owner

mwaskom commented May 10, 2015

Alright with #548 I think categorical variables in seaborn should work as articulated in this thread and uniformly across the package.

Please open an issue if you find something that does not behave as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants