Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

Closed
zking1219 opened this issue Feb 18, 2020 · 9 comments
Closed
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version

Comments

@zking1219
Copy link

Code Sample

import pandas as pd # 1.0.1
import numpy as np # 1.18.1

# Simple test case that fails
df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
                          'good_string' : ['cat','dog','cat','dog','fish','pig'],
                          'bad_string' : ['cat',np.nan,np.nan, np.nan, np.nan, np.nan]})

df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

>>>line 17, in <module>
    df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1378, in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1004, in _cython_agg_general
    how, alt=alt, numeric_only=numeric_only, min_count=min_count

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1099, in _cython_agg_blocks
    agg_block: Block = block.make_block(result)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 273, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3041, in make_block
    return klass(values, ndim=ndim, placement=placement)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2589, in __init__
    super().__init__(values, ndim=ndim, placement=placement)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 125, in __init__
    f"Wrong number of items passed {len(self.values)}, "

ValueError: Wrong number of items passed 1, placement implies 2
-------------

# Add one more legitimate string value to the 'bad_string' column and it works
df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
                          'good_string' : ['cat','dog','cat','dog','fish','pig'],
                          'bad_string' : ['cat','dog',np.nan, np.nan, np.nan, np.nan]})

df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

Problem description

Unless I've misunderstood something fundamental about the max, and min, aggregate functions, I don't think they should error out when a Series in the DataFrame is of type object and contains all but one NaN value. Notice from the example above that just adding one more non-NaN value to the offending Series gets around the ValueError.

Expected Output

No ValueError; groupby object returned.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.2.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.6.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.4.1
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.1
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.3.5
pyxlsb           : None
s3fs             : 0.4.0
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.7
numba            : 0.43.1
@MarcoGorelli
Copy link
Member

Thanks @zking1219 for the report - this used to work in 0.25.3

@MarcoGorelli MarcoGorelli added Bug Regression Functionality that used to work in a prior pandas version Groupby labels Feb 18, 2020
@zking1219
Copy link
Author

@MarcoGorelli forgot to mention it worked in 0.24.2 and then not in 1.0.0. That it is working in 0.25.3 definitely narrows down where the bug got introduced.

@dsaxton
Copy link
Member

dsaxton commented Feb 18, 2020

Is this the same issue as #31802?

@zking1219
Copy link
Author

Could be. I don't know why changing the number of NaNs in my object Series is causing the ValueError. Might be the same reason the function from #31802 fails with a similar ValueError, might not.

@jorisvandenbossche jorisvandenbossche added this to the 1.0.2 milestone Feb 19, 2020
@TomAugspurger
Copy link
Contributor

I'm reasonably sure this is the same root cause as #31802. See #31802 (comment)

@dz0
Copy link

dz0 commented Jun 2, 2021

I am still getting this in 1.2.4

$ pip show pandas
Name: pandas
Version: 1.2.4

Traceback for the same exampole

  File "/home/jurgis/.config/JetBrains/PyCharm2020.3/scratches/scratch_76.py", line 9, in <module>
    df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1676, in max
    return self._agg_general(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1024, in _agg_general
    result = self._cython_agg_general(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1015, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1118, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 380, in apply
    return self._split_op_result(result)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 416, in _split_op_result
    result = self.make_block(result)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 286, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2742, in make_block
    return klass(values, ndim=ndim, placement=placement)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 142, in __init__
    raise ValueError(
ValueError: Wrong number of items passed 1, placement implies 2

@MarcoGorelli MarcoGorelli reopened this Jun 2, 2021
@wakandan
Copy link

wakandan commented Jun 5, 2021

I tried downgrading pandas to version 1.2.3 but it doesn't work.

@mroeschke mroeschke removed this from the 1.0.2 milestone Jul 28, 2021
@mroeschke
Copy link
Member

This appear to work on master. Could use a test

In [10]: import pandas as pd # 1.0.1
    ...: import numpy as np # 1.18.1
    ...:
    ...: # Simple test case that fails
    ...: df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
    ...:                           'good_string' : ['cat','dog','cat','dog','fish','pig'],
    ...:                           'bad_string' : ['cat',np.nan,np.nan, np.nan, np.nan, np.nan]})
    ...:
    ...: df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()
<ipython-input-10-daeaf092ad7a>:9: FutureWarning: Dropping invalid columns in DataFrameGroupBy.max is deprecated. In a future version, a TypeError will be raised. Before calling .max, select only columns which should be valid for the function.
  df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

In [11]: df_simple_max
Out[11]:
   key  data good_string
0    1    20         dog
1    2    40         dog
2    3    60         pig

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 28, 2021
@mroeschke
Copy link
Member

Actually given the new deprecation (non numeric columns need to be de-selected before calling max), this behavior will be removed in 2.0 and tests for aggregating on only numeric columns has good test coverage. Closing as this behavior is deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

8 participants