New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: scatter plot with categorical data raises KeyError #16199

Closed
jorisvandenbossche opened this Issue May 2, 2017 · 10 comments

Comments

Projects
None yet
4 participants
@jorisvandenbossche
Member

jorisvandenbossche commented May 2, 2017

df = pd.DataFrame({'x':[1,2,3,4], 'y':pd.Categorical(['a', 'b', 'a', 'c'])})
df.plot(x='x', y='y', kind='scatter')

raises KeyError: 'y', while the column certainly exists, which can be very confusing.
Without the scatter (just df.plot(x='x', y='y')), it raises the more informative TypeError: Empty 'DataFrame': no numeric data to plot

@stangirala

This comment has been minimized.

Show comment
Hide comment
@stangirala

stangirala May 2, 2017

Contributor

@jorisvandenbossche Seems like a simple fix that would check the types of x and y at https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L831?

Contributor

stangirala commented May 2, 2017

@jorisvandenbossche Seems like a simple fix that would check the types of x and y at https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L831?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 2, 2017

Contributor

@stangirala it may have to be earlier, somewhere around https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L335

The issue is we drop non-numeric columns fairly early on, and by the type you get to https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L831, the data passed to that method already doesn't have the categorical column.

The cleanest way might be to modify MPLPlot._compute_plot_data to check if self.x and self.y are not in numeric_data before setting self.data = numeric_data.

Contributor

TomAugspurger commented May 2, 2017

@stangirala it may have to be earlier, somewhere around https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L335

The issue is we drop non-numeric columns fairly early on, and by the type you get to https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L831, the data passed to that method already doesn't have the categorical column.

The cleanest way might be to modify MPLPlot._compute_plot_data to check if self.x and self.y are not in numeric_data before setting self.data = numeric_data.

@stangirala

This comment has been minimized.

Show comment
Hide comment
@stangirala

stangirala May 2, 2017

Contributor

@TomAugspurger I see. But it looks like PlanePlot has checks on x and y and not MPLPlot, https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L770. The error in the code sample above passes the check in MPLPlot at https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L335 because x is numeric. So the check would go in PlanePlot right?

Contributor

stangirala commented May 2, 2017

@TomAugspurger I see. But it looks like PlanePlot has checks on x and y and not MPLPlot, https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L770. The error in the code sample above passes the check in MPLPlot at https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L335 because x is numeric. So the check would go in PlanePlot right?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 2, 2017

Member

@stangirala Another interesting area to contribute, however a much bigger issue, is to actually better support categorical data (as the above example could work)

Member

jorisvandenbossche commented May 2, 2017

@stangirala Another interesting area to contribute, however a much bigger issue, is to actually better support categorical data (as the above example could work)

@stangirala

This comment has been minimized.

Show comment
Hide comment
@stangirala

stangirala May 2, 2017

Contributor

@jorisvandenbossche Maybe you can open a new issue for it? :D It seems like we would have a flag for most plots that would wholesale convert categorical data to a label-to-integer mapping?

But I don't think it would make much sense for a scatter plot that requires an inherent ordering. But having equivalent categorical plots is a good idea, for example parallel_sets, #12341

Contributor

stangirala commented May 2, 2017

@jorisvandenbossche Maybe you can open a new issue for it? :D It seems like we would have a flag for most plots that would wholesale convert categorical data to a label-to-integer mapping?

But I don't think it would make much sense for a scatter plot that requires an inherent ordering. But having equivalent categorical plots is a good idea, for example parallel_sets, #12341

stangirala added a commit to stangirala/pandas that referenced this issue May 3, 2017

BUG: Categorical scatter plot has KeyError pandas-dev#16199
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

stangirala added a commit to stangirala/pandas that referenced this issue May 3, 2017

BUG: Categorical scatter plot has KeyError pandas-dev#16199
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

@jreback jreback added this to the 0.20.1 milestone May 3, 2017

stangirala added a commit to stangirala/pandas that referenced this issue May 3, 2017

BUG: Categorical scatter plot has KeyError pandas-dev#16199
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 3, 2017

Member

There are already some issues about it, eg #12341

But I don't think it would make much sense for a scatter plot that requires an inherent ordering.

Do you mean you think it won't make sense for a scatter plot to support categorical data? I think it could make sense, where you basically use the underlying codes of the categorical as the values to plot

Member

jorisvandenbossche commented May 3, 2017

There are already some issues about it, eg #12341

But I don't think it would make much sense for a scatter plot that requires an inherent ordering.

Do you mean you think it won't make sense for a scatter plot to support categorical data? I think it could make sense, where you basically use the underlying codes of the categorical as the values to plot

@stangirala

This comment has been minimized.

Show comment
Hide comment
@stangirala

stangirala May 3, 2017

Contributor

Oh I meant would someone want a scatter plot for categorical data when most of the times the categories don't have an ordering? I mean in such case won't someone want to use a box plot for example assuming y data is numeric and x is categorical?

Contributor

stangirala commented May 3, 2017

Oh I meant would someone want a scatter plot for categorical data when most of the times the categories don't have an ordering? I mean in such case won't someone want to use a box plot for example assuming y data is numeric and x is categorical?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 3, 2017

Contributor

supporting categorical in scatter would be nice when you have a single observation per category. More like a dot plot.

Contributor

TomAugspurger commented May 3, 2017

supporting categorical in scatter would be nice when you have a single observation per category. More like a dot plot.

@stangirala

This comment has been minimized.

Show comment
Hide comment
@stangirala

stangirala May 3, 2017

Contributor

@TomAugspurger I see, a dot plot would make sense. I don't see an issue for this, do I open one?

Contributor

stangirala commented May 3, 2017

@TomAugspurger I see, a dot plot would make sense. I don't see an issue for this, do I open one?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 3, 2017

Contributor
Contributor

TomAugspurger commented May 3, 2017

@jreback jreback modified the milestones: 0.20.1, 0.20.2 May 5, 2017

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.20.2 May 6, 2017

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

BUG: Categorical scatter plot has KeyError pandas-dev#16199
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

@jreback jreback modified the milestones: 0.20.3, Next Major Release Jun 12, 2017

TomAugspurger added a commit that referenced this issue Jun 12, 2017

BUG: Categorical scatter plot has KeyError #16199 (#16208)
* BUG: Categorical scatter plot has KeyError #16199

Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

* Add to whatsnew

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jul 6, 2017

BUG: Categorical scatter plot has KeyError pandas-dev#16199 (pandas-d…
…ev#16208)

* BUG: Categorical scatter plot has KeyError pandas-dev#16199

Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

* Add to whatsnew

(cherry picked from commit 11d274f)

TomAugspurger added a commit that referenced this issue Jul 7, 2017

BUG: Categorical scatter plot has KeyError #16199 (#16208)
* BUG: Categorical scatter plot has KeyError #16199

Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

* Add to whatsnew

(cherry picked from commit 11d274f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment