Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting hue in pairplot on a string column with only 1 item in a category throws exception: ValueError: dataset input should have multiple elements. #1627

Closed
adam-ah opened this issue Dec 12, 2018 · 13 comments

Comments

@adam-ah
Copy link

adam-ah commented Dec 12, 2018

seaborn 0.9.0, installed via pip.

I have 10 rows, trying to create pairplot. The plot works fine until I set the hue to a string (object) column that has 4 categories with the breakdown of (4, 3, 2, 1).
Stack trace below:

ValueError                                Traceback (most recent call last)
<ipython-input-62-cc1e7428015b> in <module>()
      8 # ds.dtypes
      9 # ds
---> 10 sns.pairplot(ds, vars=['mark', 'days', 'hours', 'refs'], hue='topic')
     11 # dataset.iloc[0:10,[8,19]]
     12 

/Users/piglet/Library/Python/2.7/lib/python/site-packages/seaborn/axisgrid.pyc in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, dropna, plot_kws, diag_kws, grid_kws, size)
   2109             diag_kws.setdefault("shade", True)
   2110             diag_kws["legend"] = False
-> 2111             grid.map_diag(kdeplot, **diag_kws)
   2112     
   2113     # Maybe plot on the off-diagonals

/Users/piglet/Library/Python/2.7/lib/python/site-packages/seaborn/axisgrid.pyc in map_diag(self, func, **kwargs)
   1397                     color = fixed_color
   1398                 
-> 1399                 func(data_k, label=label_k, color=color, **kwargs)
   1400             
   1401             self._clean_axis(ax)

/Users/piglet/Library/Python/2.7/lib/python/site-packages/seaborn/distributions.pyc in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
    689         ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,
    690                                  gridsize, cut, clip, legend, ax,
--> 691                                  cumulative=cumulative, **kwargs)
    692     
    693     return ax

/Users/piglet/Library/Python/2.7/lib/python/site-packages/seaborn/distributions.pyc in _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut, clip, legend, ax, cumulative, **kwargs)
    292                               "only implemented in statsmodels."
    293                               "Please install statsmodels.")
--> 294         x, y = _scipy_univariate_kde(data, bw, gridsize, cut, clip)
    295 
    296     # Make sure the density is nonnegative

/Users/piglet/Library/Python/2.7/lib/python/site-packages/seaborn/distributions.pyc in _scipy_univariate_kde(data, bw, gridsize, cut, clip)
    364     """Compute a univariate kernel density estimate using scipy."""
    365     try:
--> 366         kde = stats.gaussian_kde(data, bw_method=bw)
    367     except TypeError:
    368         kde = stats.gaussian_kde(data)

/Users/piglet/Library/Python/2.7/lib/python/site-packages/scipy/stats/kde.pyc in __init__(self, dataset, bw_method)
    167         self.dataset = atleast_2d(dataset)
    168         if not self.dataset.size > 1:
--> 169             raise ValueError("`dataset` input should have multiple elements.")
    170 
    171         self.d, self.n = self.dataset.shape

ValueError: `dataset` input should have multiple elements.
@CRiddler
Copy link
Contributor

Would you be able to share either the dataset itself/the relevant code you've been using to get this error, or create a fake dataset and write a snippet that also reproduces the error?

As of right now, my best guess is that one of the values in the "topic" column of your dataset only appears once. When the data then get subset by this column, you end up with an array that has 1 value, which is triggering the dataset.size > 1 error

@adam-ah
Copy link
Author

adam-ah commented Dec 19, 2018

Yes, that's the problem that you have just described. It should still plot nonetheless, even if one category appears once only.

Here is a full sample:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv('ce.csv', sep='\t')

dataset

# sns.pairplot(dataset, hue='e') # Throws error

sns.pairplot(dataset) # Works

Content of ce.csv:

a b c d e
75 20 35 13 sl
90 7 10 24 ov
87.5 6 24 16 ov
73 10 90 14 po
67.5 14 25 12 po
75 14 30 16 po
65 7 30 13 mo
72.5 4 15 15 ov
75 7 20 10 ov

@CRiddler
Copy link
Contributor

It should still plot nonetheless

I wouldn't agree with this statement. The issue here isn't with seaborn, the issue is with the underlying math. On the diagonal, the plot is attempting to draw a kernel density estimate (kde plot) to do this, it needs to calculate the kde- which is why the error trace back is from scipy.stats.kde. If you wanted it to still plot, seaborn would need to make the assumption of dropping that data from the plot which would lead to more confusion since users would be confused as to why no errors were created but their plot isn't showing up.

Since we can't plot a kde on the diagonal, you can tell pairplot to draw a histogram on the diagonal.

sns.pairplot(df, hue='e', diag_kind='hist')

download

If I don't specify the diag_kind argument, it will try to draw a kde and create an error.

@adam-ah
Copy link
Author

adam-ah commented Dec 20, 2018

I wouldn't agree with this statement.

I must have misread your argument because surely you aren't saying that a plotting library should throw a random exception from a third party library because it was asked to color in dots?

You can throw a warning, saying that the diagram diagonal isn't using colours, because of whatever reason.

You can throw an error, saying that 'kde' diagonal requires at least two elements in each group (for whatever reason), and say that you can use the 'hist' parameter to draw a histogram instead of kde.

But what you cannot do is to let a third party library throw an arbitrary exception because seaborn didn't check the required parameters the library needs to do its thing.

The issue here isn't with seaborn, the issue is with the underlying math.

Just like we cannot blame calling toString() on a null value, or blame sqrt() when passing in a negative value, we cannot blame a third party library when we are feeding in incorrect parameters.

EDIT: With the .pairplot API the user never actually asked for a kde plot, only for a pair-plot, and seaborn decided to use a kde plot; it's difficult to reason that it is a user error either. The data can be plotted and colored in, so overall it would be hard to conclude that it is not a bug.

@CRiddler
Copy link
Contributor

You can throw an error, saying that 'kde' diagonal requires at least two elements in each group (for whatever reason), and say that you can use the 'hist' parameter to draw a histogram instead of kde.

100% agree with you here. I do agree that the error could be handled earlier, and be more informative for the user. What I was attempting to convey is that it shouldn't be silently and not draw a specific plot. I assumed this is what you meant by

It should be plotted nonetheless

However it seems I misinterpreted what you meant by that, but I think we're both on the same page here now. The pairplot does default to "auto", which simply means "if hue is None, plot a hist on the diagonal, else plot a kde". Maybe another check can go into this determination to check if any of the hue groupings have 1 datapoint?

@adam-ah
Copy link
Author

adam-ah commented Dec 20, 2018

'auto' implies that it would do something meaningful so I would add an extra check, fall back to a plottable diagonal from kde (if / when necessary), and throw a warning.
An error would not be graceful degradation and would cause issues in an automated setting when seaborn is not run interactively.
So, check, warn & fallback.

@mwaskom
Copy link
Owner

mwaskom commented Dec 20, 2018

Duplicate of #1502

@mwaskom mwaskom marked this as a duplicate of #1502 Dec 20, 2018
@mwaskom mwaskom closed this as completed Dec 20, 2018
@adam-ah
Copy link
Author

adam-ah commented Dec 20, 2018

That was an odd move to close it down after discussing the problem and the solution @mwaskom
The other bug cannot be found with the same keywords, talks about many different problems, and does not recommend a solution.

@mwaskom
Copy link
Owner

mwaskom commented Dec 20, 2018

Nevertheless, it’s the same issue, so there don’t need to be two separate threads about it.

@adam-ah
Copy link
Author

adam-ah commented Dec 20, 2018

If it really is the same, why did it take days and half a dozen of comments to identify the problem and why is there a proposed solution here but not on the other bug, and why is the other bug talking about "LinAlgError: singular matrix" and this but about "ValueError: dataset"?

If it is the same, why people on the other bug claim that it works on Python 2 but not Python 3, indicating a totally unrelated issue to this?

It might be the same in your eyes but I doubt anyone else would readily say this thread is talking about the same problem and same solutions as the other thread - because it does not.

EDIT: just because two issues may have the same code fix, it doesn't mean they are the same, obviously.

@mwaskom
Copy link
Owner

mwaskom commented Dec 20, 2018

What are you trying to accomplish here, exactly?

@adam-ah
Copy link
Author

adam-ah commented Dec 20, 2018

I thought it would be obvious: keep the issue open until someone wants to fix it (create a PR) or simply let other users discover the active issue by the keywords they would see and let them use them workaround @CRiddler used.
This issue is far from "Closed" at this point, in my view.

@mwaskom
Copy link
Owner

mwaskom commented Dec 20, 2018

Well, you've failed in that aim, but succeeded in irritating me to the point that I've moved this issue to the bottom of the stack of things I choose to spend time working on. Congrats!

Repository owner locked as resolved and limited conversation to collaborators Dec 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants