Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow easy selection of ordered/unordered categorical columns #46941

Open
richierocks opened this issue May 4, 2022 · 2 comments
Open
Labels

Comments

@richierocks
Copy link

Is your feature request related to a problem?

I'd like to be able to easily select only ordered categorical columns, or only unordered categorical columns, from a dataframe.

Example

Here's an example dataset:

import pandas as pd
import numpy.random as npr

n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
    "eye_color": npr.choice(eye_colors, size=n_obs),
    "age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)

Here, eye_color is an unordered categorical column, age_group is an ordered categorical column, and age is numeric. I want just the age_group column.

My best attempt at selecting ordered categorical columns is

categories = people.select_dtypes("category")
categories[[col for col in categories.columns if categories[col].cat.ordered]]

This solution feels overly complicated for such a simple task.

Describe the solution you'd like

There are a few options for what nicer code might look like.

If ordered and unordered categoricals had different dtypes (as in R with factor vs. ordered), then I could just write people.select_dtypes("ordered"). Unfortunately, this would have breaking changes for all other code that assumes the dtype of ordered categoricals.

If dataframe-level .cat.* methods existed, I could write something like

is_ordered = people.cat.ordered # should return [False, pd.NA, True]
people.loc[:, is_ordered & pd.notnull(is_ordered)]

A variation on this might be to have more specialized equivalents of .api.types.is_categorical_dtype(), perhaps .api.types.is_ordered_categorical_dtype() and .api.types.is_unordered_categorical_dtype().

API breaking implications

The first option mentioned above has API breaking changes; the other two options do not.

Additional context

I asked the internet for better solutions; no response so far.

@richierocks richierocks added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022
@simonjayhawkins simonjayhawkins added API Design Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2022
@samukweku
Copy link
Contributor

The specialised ideas seem a better route to take

@ShaopengLin
Copy link

ShaopengLin commented Apr 9, 2023

I am new to pandas, is there a huge performance overhead for apply on DataFrame? If not, then for the specialized version, we can stay consistent with the input of is_categorical_dtype(). A similar boolean array can be achieved with df.apply(pd.api.types.is_ordered_categorical_dtype), though it will be without the pd.NA to signal non-categorical columns.

To retrieve the columns we can then simply do this:
people.loc[:, people.apply(pd.api.types.is_ordered_categorical_dtype)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants