ENH: Improved `CategoricalDtype` subtype handling. #48515

randolf-scholz · 2022-09-12T13:00:05Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Internally categories already distinguish different subtypes. consider for example:

import pandas as pd
s = pd.Series(["foo", "bar"], dtype=object)
print(s.astype("category"))
print(s.astype("string").astype("category"))

In the first case, s.dtype.categories is Index(['bar', 'foo'], dtype='object'), in the latter case it is Index(['bar', 'foo'], dtype='string').

However currently handling of these subtypes is a bit awkward, hence the proposed features are quality-of-life improvements when working with such kinds of data, mainly:

Allow direct casting to categories of specific subtype via .astype("category[<type>]")
Ensure round tripping subtypes when serializing in formats that support categorical types.

import pandas as pd
df = pd.DataFrame({"col":[ "foo", "bar"]}, dtype=object)
df = df.astype("string").astype("category")
df.to_parquet("test.parquet")
print(df["col"].dtype.categories)
df= pd.read_parquet("test.parquet")
print(df["col"].dtype.categories)

Feature Description

Make CategoricalDtype a typing.Generic parametrized by a scalar type. (⇝ relevant for pandas-stubs)
The fallback should be category[object] (cf. Defaults for Generics? python/mypy#4236 (comment))
Allow type casting .astype("category[<type>]")
- series.astype("category[string]") should behave equivalently to series.astype("string").astype("category")
Allow usage in constructor methods such as read_csv(file, dtype=...) and DataFrame(..., dtype=...)
Ensure category subtypes are maintained trough serialization and loading
- In particular, when reading parquet/feather format. (⇝ interoperability with pyarrow's dictionary type)
Allow type checkingseries.dtype == "category[string]".
- Possibly series.dtype == "string" and pd.api.types.is_string_dtype(series) should evaluate to True if the dtype is category[string], since category acts only as a kind of wrapper and things like Series.str accessor are still applicable. (needs discussion)

Alternative Solutions

Existing functionality is to manually cast as .astype(<type>).astype("category") whenever necessary, or to explicitly construct an instance of CategoricalDtype, which however requires a-priori knowledge of the categories.

Additional Context

Allowing direct casting to category[<type>] when using read_csv should bring minor performance benfits

The text was updated successfully, but these errors were encountered:

tehunter · 2022-09-14T14:20:53Z

+1, in particular the "Ensure round tripping subtypes when serializing in formats that support categorical types". Doing a pd.testing.assert_frame_equal fails on a round-trip to parquet with the CategoricalDtype string-type.

brurosa · 2022-10-20T13:36:27Z

take

topper-123 · 2023-05-10T21:33:17Z

xref #50041 (similar but different (aboutgetting an array out of a categorical, in the same dtype as the categories).

topper-123 · 2023-05-10T21:40:25Z

I'm not super sure what I think about this. I can see its utility, but i'm worried about the complexity, e.g "category[string]" isn't a dtype currently and it takes more to define a CategoricalDtype than the underlying type. For example:

CategoricalDtype(["a", "b"]) is not the same as CategoricalDtype(["a", "b", "c"]) but both would be 'Category[string]` . This would mean this is not usable for a lot things that we currently use dtypes for, while being similar. That similarity will confuse users IMO.

I'm -1 on this concrete idea (but I acknowledge the problem you're describing is real).

randolf-scholz · 2023-05-11T08:15:38Z

@topper-123 Well, this is how pyarrow's dictionary type works: it simply saves a dictionary [index_type -> value_type], and then stores index-values (typically int32) instead of the actual values.

To me it makes a lot of sense to have it this way, and I don't see what's supposedly confusing about it. In fact, allowing to specify the dtype as category[string] potentially allows us to be way more precise in how categories are handles, particularly in what happens when data is added. Writing category[string] basically says that any string is considered a valid category, while writing CategoricalDtype(["a", "b"]) means that only the strings "a" and "b" are valid categories, and trying e.g. to adding/writing a value of "c" to the table should in principle raise a ValueError-Exception.

joining x=Series(["a", "b"], dtype=CategoricalDtype(["a", "b"])) on y=Series(["a", "b', "c"], dtype=CategoricalDtype(["a", "b", "c"])) is ok, because all categories in x are contained in y
joining y on x should raise an error, since the category "c" is not in x.
joining x=Series(["a", "b"], dtype="category[string]" on y=Series(["a", "b', "c"], dtype="category[string]") or vice-versa are both ok, since any string is a valid category.
symmetric merge operations should consider the union of the categories. (where union of string with any literal list should result in string again.)

But please notice, one of the major points of this issue is that even if we disallow writing category[string], it should still be possible to specify/restrict the subtype used in CategoricalDtype; currently, all to often it defaults to object, even if all categories are string. The CategoricalDtype should imo move toward compatibility with pyarrows dictionary type.

topper-123 · 2023-05-11T11:23:43Z

There was at one point discussion about adding a Categorical-like dtype that didn't have fixed categories. I think the idea was that the categories would be part of the array instead of the dtype and be dict-like instead of an array. It didn't go anywhere, unfortunately IMO, but that would be very similar to what you're describing I think, e.g. having a dtype of EncodedDtype("string") or similar, which would be translatable from "encoded[string]".

I agree it would be nice to specify a categorical with a guaranteed dtype, but not necessarily spelled out categories. I don't think I'd be positive doing it in string form only ("category[string]") without a corresponding actual dtype.

Just riffing a bit on your idea, but maybe could accept a subtype argument to CategoricalDtype. Already we accept pd.Series(..., dtype=pd.CategoricalDtype()) go get whatever is the array portion of the series, as the categories. Would it be a stretch to also allow pd.CategoricalDtype(subtype="string")), which would then be the actual dtype for "category[string]"?

topper-123 · 2023-05-11T11:26:28Z

Found it: #20899.

randolf-scholz · 2023-05-11T12:13:52Z

Having a string alias is really nice to have, for example if you want to write table schemas in a config file (json/toml/yaml). I do this quite often when I need to write a pipeline for data that comes via .csv from somewhere. I create a config file containing the dictionary with column-name -> dtype mapping and pass that to read_csv. Right now, categoricals are really a sore point here because they completely break this workflow, while all other data types are fine.

randolf-scholz · 2023-05-11T12:23:46Z

@topper-123 Also, as mentioned in this thread, CategoricalDtype already internally kind of does have a subtype: every CategoricalDtype has a property .categories which is an instance of Index, and Index has a dtype. Here one could also get some speed/memory benefits, for example when using string[pyarrow] instead of object.

topper-123 · 2023-05-11T14:32:18Z

My point was to make that possible, i..e pandas_dtype("category[string]") would return CategoricalDtype(subtype=StringsDtype()). So it would fix this issue with this being different than other string -> dtype conversions.

randolf-scholz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2022

github-actions bot assigned brurosa Oct 20, 2022

brurosa removed their assignment Oct 31, 2022

topper-123 added Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2023

topper-123 mentioned this issue May 11, 2023

ENH: allow better categorical dtype strings, e.g. category[string] #53190

Closed

6 tasks

topper-123 removed the Closing Candidate May be closeable, needs more eyeballs label May 12, 2023

VladimirFokow mentioned this issue Feb 6, 2024

ENH: show category[dtype] #57281

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improved `CategoricalDtype` subtype handling. #48515

ENH: Improved `CategoricalDtype` subtype handling. #48515

randolf-scholz commented Sep 12, 2022 •

edited

tehunter commented Sep 14, 2022 •

edited

brurosa commented Oct 20, 2022

topper-123 commented May 10, 2023

topper-123 commented May 10, 2023 •

edited

randolf-scholz commented May 11, 2023

topper-123 commented May 11, 2023 •

edited

topper-123 commented May 11, 2023

randolf-scholz commented May 11, 2023 •

edited

randolf-scholz commented May 11, 2023 •

edited

topper-123 commented May 11, 2023

ENH: Improved CategoricalDtype subtype handling. #48515

ENH: Improved CategoricalDtype subtype handling. #48515

Comments

randolf-scholz commented Sep 12, 2022 • edited

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

tehunter commented Sep 14, 2022 • edited

brurosa commented Oct 20, 2022

topper-123 commented May 10, 2023

topper-123 commented May 10, 2023 • edited

randolf-scholz commented May 11, 2023

topper-123 commented May 11, 2023 • edited

topper-123 commented May 11, 2023

randolf-scholz commented May 11, 2023 • edited

randolf-scholz commented May 11, 2023 • edited

topper-123 commented May 11, 2023

ENH: Improved `CategoricalDtype` subtype handling. #48515

ENH: Improved `CategoricalDtype` subtype handling. #48515

randolf-scholz commented Sep 12, 2022 •

edited

tehunter commented Sep 14, 2022 •

edited

topper-123 commented May 10, 2023 •

edited

topper-123 commented May 11, 2023 •

edited

randolf-scholz commented May 11, 2023 •

edited

randolf-scholz commented May 11, 2023 •

edited