Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object #18502

Closed
dcolascione opened this issue Nov 26, 2017 · 1 comment

Comments

@dcolascione
Copy link

commented Nov 26, 2017

Code Sample, a copy-pastable example if possible

# Your code here

    In [1]: df=pd.DataFrame(dict(payload=[-1,-2,-1,-2], col=pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)));df
    Out[1]: 
       col  payload
    0  foo       -1
    1  bar       -2
    2  bar       -1
    3  qux       -2

    In [2]: df.groupby("payload").first().col.dtype
    Out[2]: dtype('O')

Problem description

Grouping shouldn't coerce a categorical into object. Categorical dtypes should be preserved as long as possible for efficiency and correctness.

Expected Output

The result dtype should be CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True), just like it is here:

    In [6]: df.groupby("payload").head().col.dtype
    Out[6]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True

#### Output of ``pd.show_versions()``

<details>
    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 3.4.3.final.0
    python-bits: 64
    OS: Linux
    OS-release: 4.4.0-98-generic
    machine: x86_64
    processor: x86_64
    byteorder: little
    LC_ALL: None
    LANG: en_US.UTF-8
    LOCALE: en_US.UTF-8

    pandas: 0.21.0
    pytest: 3.2.5
    pip: 9.0.1
    setuptools: 36.5.0
    Cython: 0.20.1post0
    numpy: 1.13.3
    scipy: 0.13.3
    pyarrow: None
    xarray: 0.9.6
    IPython: 6.2.0
    sphinx: None
    patsy: None
    dateutil: 2.6.1
    pytz: 2017.3
    blosc: None
    bottleneck: None
    tables: 3.1.1
    numexpr: 2.6.4
    feather: None
    matplotlib: 1.3.1
    openpyxl: None
    xlrd: None
    xlwt: None
    xlsxwriter: None
    lxml: None
    bs4: 4.2.1
    html5lib: 0.999
    sqlalchemy: 0.8.4
    pymysql: None
    psycopg2: None
    jinja2: 2.7.2
    s3fs: 0.1.2
    fastparquet: None
    pandas_gbq: None
    pandas_datareader: None
[paste the output of ``pd.show_versions()`` here below this line]

</details>
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 26, 2017

I suppose. This would require a fair amount of work on this type of preservation as these routines are in cython. Would hold be able to work for the non-numerical filters (first,last,min,max). If you want to submit a PR would be accepted.

note 3.4 is not longer supported FYI.

@jreback jreback added this to the Next Major Release milestone Nov 26, 2017

@jreback jreback changed the title Categorical dtype doesn't survive groupby of first, max, min, etc.: unwanted coercion to object Categorical dtype doesn't survive groupby of first, max, min, value_counts etc.: unwanted coercion to object Jan 27, 2018

@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 May 28, 2019

jreback added a commit to jreback/pandas that referenced this issue May 29, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502

jreback added a commit to jreback/pandas that referenced this issue May 29, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue May 29, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue May 29, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue May 30, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue Jun 2, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue Jun 8, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue Jun 9, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue Jun 21, 2019

BUG: preserve categorical & sparse types when grouping / pivot
preserve dtypes when applying a ufunc to a sparse dtype

closes pandas-dev#18502
closes pandas-dev#23743

jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019

jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019

jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019

jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019

jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019

jreback added a commit to jreback/pandas that referenced this issue Jun 27, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.