Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

Closed
ghuname opened this issue Mar 27, 2020 · 9 comments · Fixed by #33089
Closed

Why pd.BooleanDtype() is casted to Float64 by groupby/last? #33071

ghuname opened this issue Mar 27, 2020 · 9 comments · Fixed by #33089
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Milestone

Comments

@ghuname
Copy link

ghuname commented Mar 27, 2020

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'a': ['x', 'x', 'y', 'y'], 'b': ['x', 'x', 'y', 'y'], 'c': [False, False, True, False]})
>>> df['d'] = df.c.astype(pd.BooleanDtype())
>>>
>>> df.dtypes
a     object
b     object
c       bool
d    boolean
dtype: object
>>>
>>> df.groupby(['a', 'b']).c.last()
a  b
x  x    False
y  y    False
Name: c, dtype: bool
>>>
>>> df.groupby(['a', 'b']).d.last()
a  b
x  x    0.0
y  y    0.0
Name: d, dtype: float64
>

Problem description

df.groupby(['a', 'b']).c.last() returns False, but df.groupby(['a', 'b']).d.last() returns Float64.
Why the difference?

Expected Output

I expect that both values should be False

Output of pd.show_versions()

python : 3.7.4.final.0
pandas : 1.0.3

@dsaxton
Copy link
Member

dsaxton commented Mar 28, 2020

Thanks. Looks like the bug exists at least for min and max as well:

In [1]: import pandas as pd                                                                                                   

In [2]: df = pd.DataFrame({"a": [1, 2], "b": pd.array([True, False])})                                                        

In [3]: df.dtypes                                                                                                             
Out[3]: 
a      int64
b    boolean
dtype: object

In [4]: df.groupby("a")["b"].min()                                                                                            
Out[4]: 
a
1    1.0
2    0.0
Name: b, dtype: float64

In [5]: df.groupby("a")["b"].max()                                                                                            
Out[5]: 
a
1    1.0
2    0.0
Name: b, dtype: float64

@simonjayhawkins
Copy link
Member

also occurs for IntDtype see #32194

>>> import pandas as pd
>>> pd.__version__
'1.1.0.dev0+999.gc47e9ca8b'
>>>
>>> df = pd.DataFrame(
...     {"a": ["x", "x", "y", "y"], "b": ["x", "x", "y", "y"], "c": [0, 1, 2, 3]}
... )
>>> df["d"] = df.c.astype(pd.Int64Dtype())
>>>
>>> df.dtypes
a    object
b    object
c     int64
d     Int64
dtype: object
>>>
>>>
>>> df.groupby(["a", "b"]).c.last()
a  b
x  x    1
y  y    3
Name: c, dtype: int64
>>>
>>>
>>> df.groupby(["a", "b"]).d.last()
a  b
x  x    1.0
y  y    3.0
Name: d, dtype: float64
>>>

@ghuname
Copy link
Author

ghuname commented Mar 29, 2020

Thanks. Looks like the bug exists at least for min and max as well:

You are welcome. I am glad that I can participate (at least by testing) to the development of such a marvel as pandas is.

Best regards.

@ghuname
Copy link
Author

ghuname commented Mar 29, 2020

In dataframe bellow, lag is Int64 dtype, that I had to cast to int in order make it work:

df_window_lag_sum.assign(lag=lambda x: x.lag.astype(int)).groupby(['cgrp', 'topic']).lag.quantile(quantily_cut_off_value)

Regards.

@dsaxton
Copy link
Member

dsaxton commented Mar 29, 2020

@ghasemnaddaf Can you provide a small reproducible example for this problem?

@ghuname
Copy link
Author

ghuname commented Mar 30, 2020

@ghasemnaddaf Can you provide a small reproducible example for this problem?

Here you go:

Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from pandas import Timestamp
>>> df_dict = {'dtime': {0: Timestamp('2020-03-28 14:15:00'),
...   1: Timestamp('2020-03-28 14:15:00'),
...   2: Timestamp('2020-03-28 14:15:00')},
...  'cgrp': {0: 'grp1',
...   1: 'grp2',
...   2: 'grp3'},
...  'topic': {0: 'top1',
...   1: 'top1',
...   2: 'top1'},
...  'lag': {0: 18, 1: 1, 2: 83}}
>>> df = pd.DataFrame(df_dict)
>>> df.cgrp = df.cgrp.astype(pd.StringDtype())
>>> df.topic = df.topic.astype(pd.StringDtype())
>>> df.lag = df.lag.astype(pd.Int64Dtype())
>>> df
                dtime  cgrp topic  lag
0 2020-03-28 14:15:00  grp1  top1   18
1 2020-03-28 14:15:00  grp2  top1    1
2 2020-03-28 14:15:00  grp3  top1   83
>>> df.dtypes
dtime    datetime64[ns]
cgrp             string
topic            string
lag               Int64
dtype: object
>>> df.groupby(['cgrp', 'topic']).lag.quantile(0.5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\User\Miniconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 1913, in quantile
    interpolation=interpolation,
  File "C:\Users\User\Miniconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 2291, in _get_cythonized_result
    func(**kwargs)  # Call func to modify indexer values in place
  File "pandas\_libs\groupby.pyx", line 720, in pandas._libs.groupby.__pyx_fused_cpdef
TypeError: No matching signature found
>>> df.assign(lag=lambda x: x.lag.astype(int)).groupby(['cgrp', 'topic']).lag.quantile(0.5)
cgrp  topic
grp1  top1     18.0
grp2  top1      1.0
grp3  top1     83.0
Name: lag, dtype: float64
>>>    

@dsaxton
Copy link
Member

dsaxton commented Mar 30, 2020

Thanks @ghuname , I've opened up a separate issue / PR for this bug

@ghuname
Copy link
Author

ghuname commented Apr 6, 2020

You are welcome.

@ghuname ghuname closed this as completed Apr 6, 2020
@dsaxton
Copy link
Member

dsaxton commented Apr 6, 2020

@ghuname We can leave this issue open, it should be closed automatically if / when the associated PR is merged

@dsaxton dsaxton reopened this Apr 6, 2020
@jreback jreback added this to the 1.1 milestone Apr 7, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants