Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

ghuname · 2020-03-27T16:21:42Z

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'a': ['x', 'x', 'y', 'y'], 'b': ['x', 'x', 'y', 'y'], 'c': [False, False, True, False]})
>>> df['d'] = df.c.astype(pd.BooleanDtype())
>>>
>>> df.dtypes
a     object
b     object
c       bool
d    boolean
dtype: object
>>>
>>> df.groupby(['a', 'b']).c.last()
a  b
x  x    False
y  y    False
Name: c, dtype: bool
>>>
>>> df.groupby(['a', 'b']).d.last()
a  b
x  x    0.0
y  y    0.0
Name: d, dtype: float64
>

Problem description

df.groupby(['a', 'b']).c.last() returns False, but df.groupby(['a', 'b']).d.last() returns Float64.
Why the difference?

Expected Output

I expect that both values should be False

Output of `pd.show_versions()`

python : 3.7.4.final.0
pandas : 1.0.3

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-03-28T02:19:20Z

Thanks. Looks like the bug exists at least for min and max as well:

In [1]: import pandas as pd                                                                                                   

In [2]: df = pd.DataFrame({"a": [1, 2], "b": pd.array([True, False])})                                                        

In [3]: df.dtypes                                                                                                             
Out[3]: 
a      int64
b    boolean
dtype: object

In [4]: df.groupby("a")["b"].min()                                                                                            
Out[4]: 
a
1    1.0
2    0.0
Name: b, dtype: float64

In [5]: df.groupby("a")["b"].max()                                                                                            
Out[5]: 
a
1    1.0
2    0.0
Name: b, dtype: float64

simonjayhawkins · 2020-03-28T12:33:13Z

also occurs for IntDtype see #32194

>>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+999.gc47e9ca8b'
>>>
>>> df = pd.DataFrame(
...     {"a": ["x", "x", "y", "y"], "b": ["x", "x", "y", "y"], "c": [0, 1, 2, 3]}
... )
>>> df["d"] = df.c.astype(pd.Int64Dtype())
>>>
>>> df.dtypes
a    object
b    object
c     int64
d     Int64
dtype: object
>>>
>>>
>>> df.groupby(["a", "b"]).c.last()
a  b
x  x    1
y  y    3
Name: c, dtype: int64
>>>
>>>
>>> df.groupby(["a", "b"]).d.last()
a  b
x  x    1.0
y  y    3.0
Name: d, dtype: float64
>>>

ghuname · 2020-03-29T17:24:15Z

Thanks. Looks like the bug exists at least for min and max as well:

You are welcome. I am glad that I can participate (at least by testing) to the development of such a marvel as pandas is.

Best regards.

ghuname · 2020-03-29T17:27:49Z

In dataframe bellow, lag is Int64 dtype, that I had to cast to int in order make it work:

df_window_lag_sum.assign(lag=lambda x: x.lag.astype(int)).groupby(['cgrp', 'topic']).lag.quantile(quantily_cut_off_value)

Regards.

dsaxton · 2020-03-29T22:51:23Z

@ghasemnaddaf Can you provide a small reproducible example for this problem?

ghuname · 2020-03-30T12:34:13Z

@ghasemnaddaf Can you provide a small reproducible example for this problem?

Here you go:

Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from pandas import Timestamp
>>> df_dict = {'dtime': {0: Timestamp('2020-03-28 14:15:00'),
...   1: Timestamp('2020-03-28 14:15:00'),
...   2: Timestamp('2020-03-28 14:15:00')},
...  'cgrp': {0: 'grp1',
...   1: 'grp2',
...   2: 'grp3'},
...  'topic': {0: 'top1',
...   1: 'top1',
...   2: 'top1'},
...  'lag': {0: 18, 1: 1, 2: 83}}
>>> df = pd.DataFrame(df_dict)
>>> df.cgrp = df.cgrp.astype(pd.StringDtype())
>>> df.topic = df.topic.astype(pd.StringDtype())
>>> df.lag = df.lag.astype(pd.Int64Dtype())
>>> df
                dtime  cgrp topic  lag
0 2020-03-28 14:15:00  grp1  top1   18
1 2020-03-28 14:15:00  grp2  top1    1
2 2020-03-28 14:15:00  grp3  top1   83
>>> df.dtypes
dtime    datetime64[ns]
cgrp             string
topic            string
lag               Int64
dtype: object
>>> df.groupby(['cgrp', 'topic']).lag.quantile(0.5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\User\Miniconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 1913, in quantile
    interpolation=interpolation,
  File "C:\Users\User\Miniconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2291, in _get_cythonized_result
    func(**kwargs)  # Call func to modify indexer values in place
  File "pandas\_libs\groupby.pyx", line 720, in pandas._libs.groupby.__pyx_fused_cpdef
TypeError: No matching signature found
>>> df.assign(lag=lambda x: x.lag.astype(int)).groupby(['cgrp', 'topic']).lag.quantile(0.5)
cgrp  topic
grp1  top1     18.0
grp2  top1      1.0
grp3  top1     83.0
Name: lag, dtype: float64
>>>

dsaxton · 2020-03-30T17:42:08Z

Thanks @ghuname , I've opened up a separate issue / PR for this bug

ghuname · 2020-04-06T11:47:04Z

You are welcome.

dsaxton · 2020-04-06T12:25:25Z

@ghuname We can leave this issue open, it should be closed automatically if / when the associated PR is merged

dsaxton mentioned this issue Mar 28, 2020

BUG: Don't cast nullable Boolean to float in groupby #33089

Merged

6 tasks

simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Groupby labels Mar 28, 2020

simonjayhawkins mentioned this issue Mar 28, 2020

Calling sum with min_count on SeriesGroupBy with dtype Int64 gives large negative value rather than pd.NA #32861

Closed

dsaxton mentioned this issue Mar 29, 2020

SeriesGroupBy.quantile doesn't work for nullable integers #33136

Closed

ghuname closed this as completed Apr 6, 2020

dsaxton reopened this Apr 6, 2020

jreback added this to the 1.1 milestone Apr 7, 2020

simonjayhawkins closed this as completed in #33089 Apr 7, 2020

simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

ghuname commented Mar 27, 2020

dsaxton commented Mar 28, 2020

simonjayhawkins commented Mar 28, 2020

ghuname commented Mar 29, 2020

ghuname commented Mar 29, 2020

dsaxton commented Mar 29, 2020

ghuname commented Mar 30, 2020 •

edited

Loading

dsaxton commented Mar 30, 2020

ghuname commented Apr 6, 2020

dsaxton commented Apr 6, 2020

Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

Comments

ghuname commented Mar 27, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

dsaxton commented Mar 28, 2020

simonjayhawkins commented Mar 28, 2020

ghuname commented Mar 29, 2020

ghuname commented Mar 29, 2020

dsaxton commented Mar 29, 2020

ghuname commented Mar 30, 2020 • edited Loading

dsaxton commented Mar 30, 2020

ghuname commented Apr 6, 2020

dsaxton commented Apr 6, 2020

Output of `pd.show_versions()`

ghuname commented Mar 30, 2020 •

edited

Loading