Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improved CategoricalDtype subtype handling. #48515

Open
2 of 9 tasks
randolf-scholz opened this issue Sep 12, 2022 · 10 comments
Open
2 of 9 tasks

ENH: Improved CategoricalDtype subtype handling. #48515

randolf-scholz opened this issue Sep 12, 2022 · 10 comments
Labels
Categorical Categorical Data Type Enhancement

Comments

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Sep 12, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Internally categories already distinguish different subtypes. consider for example:

import pandas as pd
s = pd.Series(["foo", "bar"], dtype=object)
print(s.astype("category"))
print(s.astype("string").astype("category"))

In the first case, s.dtype.categories is Index(['bar', 'foo'], dtype='object'), in the latter case it is Index(['bar', 'foo'], dtype='string').

However currently handling of these subtypes is a bit awkward, hence the proposed features are quality-of-life improvements when working with such kinds of data, mainly:

  1. Allow direct casting to categories of specific subtype via .astype("category[<type>]")
  2. Ensure round tripping subtypes when serializing in formats that support categorical types.
import pandas as pd
df = pd.DataFrame({"col":[ "foo", "bar"]}, dtype=object)
df = df.astype("string").astype("category")
df.to_parquet("test.parquet")
print(df["col"].dtype.categories)
df= pd.read_parquet("test.parquet")
print(df["col"].dtype.categories)

Feature Description

  • Make CategoricalDtype a typing.Generic parametrized by a scalar type. (⇝ relevant for pandas-stubs)
  • The fallback should be category[object] (cf. Defaults for Generics? python/mypy#4236 (comment))
  • Allow type casting .astype("category[<type>]")
    • series.astype("category[string]") should behave equivalently to series.astype("string").astype("category")
  • Allow usage in constructor methods such as read_csv(file, dtype=...) and DataFrame(..., dtype=...)
  • Ensure category subtypes are maintained trough serialization and loading
    • In particular, when reading parquet/feather format. (⇝ interoperability with pyarrow's dictionary type)
  • Allow type checkingseries.dtype == "category[string]".
    • Possibly series.dtype == "string" and pd.api.types.is_string_dtype(series) should evaluate to True if the dtype is category[string], since category acts only as a kind of wrapper and things like Series.str accessor are still applicable. (needs discussion)

Alternative Solutions

Existing functionality is to manually cast as .astype(<type>).astype("category") whenever necessary, or to explicitly construct an instance of CategoricalDtype, which however requires a-priori knowledge of the categories.

Additional Context

Allowing direct casting to category[<type>] when using read_csv should bring minor performance benfits

@randolf-scholz randolf-scholz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2022
@tehunter
Copy link
Contributor

tehunter commented Sep 14, 2022

+1, in particular the "Ensure round tripping subtypes when serializing in formats that support categorical types". Doing a pd.testing.assert_frame_equal fails on a round-trip to parquet with the CategoricalDtype string-type.

@brurosa
Copy link

brurosa commented Oct 20, 2022

take

@brurosa brurosa removed their assignment Oct 31, 2022
@topper-123
Copy link
Contributor

xref #50041 (similar but different (aboutgetting an array out of a categorical, in the same dtype as the categories).

@topper-123
Copy link
Contributor

topper-123 commented May 10, 2023

I'm not super sure what I think about this. I can see its utility, but i'm worried about the complexity, e.g "category[string]" isn't a dtype currently and it takes more to define a CategoricalDtype than the underlying type. For example:

CategoricalDtype(["a", "b"]) is not the same as CategoricalDtype(["a", "b", "c"]) but both would be 'Category[string]` . This would mean this is not usable for a lot things that we currently use dtypes for, while being similar. That similarity will confuse users IMO.

I'm -1 on this concrete idea (but I acknowledge the problem you're describing is real).

@topper-123 topper-123 added Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2023
@randolf-scholz
Copy link
Contributor Author

@topper-123 Well, this is how pyarrow's dictionary type works: it simply saves a dictionary [index_type -> value_type], and then stores index-values (typically int32) instead of the actual values.

To me it makes a lot of sense to have it this way, and I don't see what's supposedly confusing about it. In fact, allowing to specify the dtype as category[string] potentially allows us to be way more precise in how categories are handles, particularly in what happens when data is added. Writing category[string] basically says that any string is considered a valid category, while writing CategoricalDtype(["a", "b"]) means that only the strings "a" and "b" are valid categories, and trying e.g. to adding/writing a value of "c" to the table should in principle raise a ValueError-Exception.

  • joining x=Series(["a", "b"], dtype=CategoricalDtype(["a", "b"])) on y=Series(["a", "b', "c"], dtype=CategoricalDtype(["a", "b", "c"])) is ok, because all categories in x are contained in y
  • joining y on x should raise an error, since the category "c" is not in x.
  • joining x=Series(["a", "b"], dtype="category[string]" on y=Series(["a", "b', "c"], dtype="category[string]") or vice-versa are both ok, since any string is a valid category.
  • symmetric merge operations should consider the union of the categories. (where union of string with any literal list should result in string again.)

But please notice, one of the major points of this issue is that even if we disallow writing category[string], it should still be possible to specify/restrict the subtype used in CategoricalDtype; currently, all to often it defaults to object, even if all categories are string. The CategoricalDtype should imo move toward compatibility with pyarrows dictionary type.

@topper-123
Copy link
Contributor

topper-123 commented May 11, 2023

There was at one point discussion about adding a Categorical-like dtype that didn't have fixed categories. I think the idea was that the categories would be part of the array instead of the dtype and be dict-like instead of an array. It didn't go anywhere, unfortunately IMO, but that would be very similar to what you're describing I think, e.g. having a dtype of EncodedDtype("string") or similar, which would be translatable from "encoded[string]".

I agree it would be nice to specify a categorical with a guaranteed dtype, but not necessarily spelled out categories. I don't think I'd be positive doing it in string form only ("category[string]") without a corresponding actual dtype.

Just riffing a bit on your idea, but maybe could accept a subtype argument to CategoricalDtype. Already we accept pd.Series(..., dtype=pd.CategoricalDtype()) go get whatever is the array portion of the series, as the categories. Would it be a stretch to also allow pd.CategoricalDtype(subtype="string")), which would then be the actual dtype for "category[string]"?

@topper-123
Copy link
Contributor

Found it: #20899.

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented May 11, 2023

Having a string alias is really nice to have, for example if you want to write table schemas in a config file (json/toml/yaml). I do this quite often when I need to write a pipeline for data that comes via .csv from somewhere. I create a config file containing the dictionary with column-name -> dtype mapping and pass that to read_csv. Right now, categoricals are really a sore point here because they completely break this workflow, while all other data types are fine.

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented May 11, 2023

@topper-123 Also, as mentioned in this thread, CategoricalDtype already internally kind of does have a subtype: every CategoricalDtype has a property .categories which is an instance of Index, and Index has a dtype. Here one could also get some speed/memory benefits, for example when using string[pyarrow] instead of object.

@topper-123
Copy link
Contributor

My point was to make that possible, i..e pandas_dtype("category[string]") would return CategoricalDtype(subtype=StringsDtype()). So it would fix this issue with this being different than other string -> dtype conversions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants