ENH: Allow easy selection of ordered/unordered categorical columns #46941

richierocks · 2022-05-04T17:43:14Z

Is your feature request related to a problem?

I'd like to be able to easily select only ordered categorical columns, or only unordered categorical columns, from a dataframe.

Example

Here's an example dataset:

import pandas as pd
import numpy.random as npr

n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
    "eye_color": npr.choice(eye_colors, size=n_obs),
    "age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)

Here, eye_color is an unordered categorical column, age_group is an ordered categorical column, and age is numeric. I want just the age_group column.

My best attempt at selecting ordered categorical columns is

categories = people.select_dtypes("category")
categories[[col for col in categories.columns if categories[col].cat.ordered]]

This solution feels overly complicated for such a simple task.

Describe the solution you'd like

There are a few options for what nicer code might look like.

If ordered and unordered categoricals had different dtypes (as in R with factor vs. ordered), then I could just write people.select_dtypes("ordered"). Unfortunately, this would have breaking changes for all other code that assumes the dtype of ordered categoricals.

If dataframe-level .cat.* methods existed, I could write something like

is_ordered = people.cat.ordered # should return [False, pd.NA, True]
people.loc[:, is_ordered & pd.notnull(is_ordered)]

A variation on this might be to have more specialized equivalents of .api.types.is_categorical_dtype(), perhaps .api.types.is_ordered_categorical_dtype() and .api.types.is_unordered_categorical_dtype().

API breaking implications

The first option mentioned above has API breaking changes; the other two options do not.

Additional context

I asked the internet for better solutions; no response so far.

The text was updated successfully, but these errors were encountered:

samukweku · 2022-05-05T23:11:15Z

The specialised ideas seem a better route to take

ShaopengLin · 2023-04-09T15:58:57Z

I am new to pandas, is there a huge performance overhead for apply on DataFrame? If not, then for the specialized version, we can stay consistent with the input of is_categorical_dtype(). A similar boolean array can be achieved with df.apply(pd.api.types.is_ordered_categorical_dtype), though it will be without the pd.NA to signal non-categorical columns.

To retrieve the columns we can then simply do this:
people.loc[:, people.apply(pd.api.types.is_ordered_categorical_dtype)]

richierocks added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022

simonjayhawkins added API Design Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2022

ShaopengLin mentioned this issue Apr 8, 2023

D4 issue 46941 - ENH: Allow easy selection of ordered/unordered categorical columns LingSu-dev/d01w23-team-timbits#29

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow easy selection of ordered/unordered categorical columns #46941

ENH: Allow easy selection of ordered/unordered categorical columns #46941

richierocks commented May 4, 2022

samukweku commented May 5, 2022

ShaopengLin commented Apr 9, 2023 •

edited

ENH: Allow easy selection of ordered/unordered categorical columns #46941

ENH: Allow easy selection of ordered/unordered categorical columns #46941

Comments

richierocks commented May 4, 2022

Is your feature request related to a problem?

Example

Describe the solution you'd like

API breaking implications

Additional context

samukweku commented May 5, 2022

ShaopengLin commented Apr 9, 2023 • edited

ShaopengLin commented Apr 9, 2023 •

edited