Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Plotting categorical columns includes empty categories #3704

Closed
Yazan-Sharaya opened this issue Jun 2, 2024 · 3 comments
Closed

[Bug] Plotting categorical columns includes empty categories #3704

Yazan-Sharaya opened this issue Jun 2, 2024 · 3 comments

Comments

@Yazan-Sharaya
Copy link

Yazan-Sharaya commented Jun 2, 2024

A reproducible code example that demonstrates the problem

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

countries = ['US', 'Canada', 'Spain', 'US', 'Canada', 'Sweden', 'Jordan', 'Netherlands', 'US', 'Spain']
df = pd.DataFrame(countries, columns=['Countries'])
df['Countries'] = df['Countries'].astype('category')

filtered_df = df[df['Countries'] == 'US'].copy()

sns.countplot(filtered_df, x='Countries')
plt.show()

The output that you are seeing (an image of a plot, or the error message)

myplot

A clear explanation of why you think something is wrong

When plotting a categorical column, the resulting plot will contain all the categories even if they don't exist anymore.
I couldn't find any direct information in the documentation about this. However, I found the following example at https://seaborn.pydata.org/tutorial/categorical.html#categorical-scatterplots. Specifically the part that contains

sns.catplot(data=tips.query("size != 3"), x="size", y="total_bill", native_scale=True)

Where the result had an empty column at size=3. Nonetheless, I'm not sure that this should be the case when creating a new dataframe without certain categories from the orginal one.
I understand that this could be more of a pandas issue than seaborn's, but I felt like this should be mentioned or be more clearly documented.
There's a couple of easy solutions to this problem currently

filtered_df['Countries'] = filtered_df['Countries'].astype('string')
# Or
filtered_df['Countries'] = filtered_df['Countries'].cat.remove_unused_categories()

The specific versions of seaborn and matplotlib that you are working with

  • Python: 3.12
  • seaborn: 0.13.2
  • matplotlib: 3.9.0
  • This isn't specific to any version combinations though, because I observed the same behaviour with the oldest supported Python version (3.7)
@mwaskom
Copy link
Owner

mwaskom commented Jun 3, 2024

This is fully intentional, you can dig up the original thread on the introduction of categorical support where it was discussed at length.

@jhncls
Copy link

jhncls commented Jun 4, 2024

@Yazan-Sharaya
Sometimes, for consistency between plots, you want to see the unused categories. And sometimes, as in your case, you don't want to see them.

One way to only show the used categories way changes the dataframe:

filtered_df['Countries'] = filtered_df['Countries'].cat.remove_unused_categories()

An alternative way uses order= to restrict the plot to the desired categories:

sns.countplot(filtered_df, x='Countries', order=filtered_df['Countries'].unique())

@mwaskom
Copy link
Owner

mwaskom commented Jun 4, 2024

Thanks @jhncls

@mwaskom mwaskom closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants