ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

zking1219 · 2020-02-18T16:57:41Z

Code Sample

import pandas as pd # 1.0.1
import numpy as np # 1.18.1

# Simple test case that fails
df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
                          'good_string' : ['cat','dog','cat','dog','fish','pig'],
                          'bad_string' : ['cat',np.nan,np.nan, np.nan, np.nan, np.nan]})

df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

>>>line 17, in <module>
    df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1378, in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1004, in _cython_agg_general
    how, alt=alt, numeric_only=numeric_only, min_count=min_count

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 1099, in _cython_agg_blocks
    agg_block: Block = block.make_block(result)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 273, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3041, in make_block
    return klass(values, ndim=ndim, placement=placement)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2589, in __init__
    super().__init__(values, ndim=ndim, placement=placement)

  File "/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 125, in __init__
    f"Wrong number of items passed {len(self.values)}, "

ValueError: Wrong number of items passed 1, placement implies 2
-------------

# Add one more legitimate string value to the 'bad_string' column and it works
df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
                          'good_string' : ['cat','dog','cat','dog','fish','pig'],
                          'bad_string' : ['cat','dog',np.nan, np.nan, np.nan, np.nan]})

df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

Problem description

Unless I've misunderstood something fundamental about the max, and min, aggregate functions, I don't think they should error out when a Series in the DataFrame is of type object and contains all but one NaN value. Notice from the example above that just adding one more non-NaN value to the offending Series gets around the ValueError.

Expected Output

No ValueError; groupby object returned.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.2.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.6.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.4.1
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.1
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.3.5
pyxlsb           : None
s3fs             : 0.4.0
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.7
numba            : 0.43.1

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-02-18T18:19:14Z

Thanks @zking1219 for the report - this used to work in 0.25.3

zking1219 · 2020-02-18T18:24:45Z

@MarcoGorelli forgot to mention it worked in 0.24.2 and then not in 1.0.0. That it is working in 0.25.3 definitely narrows down where the bug got introduced.

dsaxton · 2020-02-18T18:51:04Z

Is this the same issue as #31802?

zking1219 · 2020-02-18T19:04:56Z

Could be. I don't know why changing the number of NaNs in my object Series is causing the ValueError. Might be the same reason the function from #31802 fails with a similar ValueError, might not.

TomAugspurger · 2020-03-09T15:56:03Z

I'm reasonably sure this is the same root cause as #31802. See #31802 (comment)

dz0 · 2021-06-02T13:12:06Z

I am still getting this in 1.2.4

$ pip show pandas
Name: pandas
Version: 1.2.4

Traceback for the same exampole

  File "/home/jurgis/.config/JetBrains/PyCharm2020.3/scratches/scratch_76.py", line 9, in <module>
    df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1676, in max
    return self._agg_general(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1024, in _agg_general
    result = self._cython_agg_general(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1015, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1118, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 380, in apply
    return self._split_op_result(result)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 416, in _split_op_result
    result = self.make_block(result)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 286, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 2742, in make_block
    return klass(values, ndim=ndim, placement=placement)
  File "/home/jurgis/miniconda3/envs/debitum-portfolio/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 142, in __init__
    raise ValueError(
ValueError: Wrong number of items passed 1, placement implies 2

wakandan · 2021-06-05T05:50:34Z

I tried downgrading pandas to version 1.2.3 but it doesn't work.

mroeschke · 2021-07-28T05:53:38Z

This appear to work on master. Could use a test

In [10]: import pandas as pd # 1.0.1
    ...: import numpy as np # 1.18.1
    ...:
    ...: # Simple test case that fails
    ...: df_simple = pd.DataFrame({'key': [1,1,2,2,3,3], 'data' : [10,20,30,40,50,60],
    ...:                           'good_string' : ['cat','dog','cat','dog','fish','pig'],
    ...:                           'bad_string' : ['cat',np.nan,np.nan, np.nan, np.nan, np.nan]})
    ...:
    ...: df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()
<ipython-input-10-daeaf092ad7a>:9: FutureWarning: Dropping invalid columns in DataFrameGroupBy.max is deprecated. In a future version, a TypeError will be raised. Before calling .max, select only columns which should be valid for the function.
  df_simple_max = df_simple.groupby(['key'], as_index=False, sort=False).max()

In [11]: df_simple_max
Out[11]:
   key  data good_string
0    1    20         dog
1    2    40         dog
2    3    60         pig

mroeschke · 2021-12-28T19:36:54Z

Actually given the new deprecation (non numeric columns need to be de-selected before calling max), this behavior will be removed in 2.0 and tests for aggregating on only numeric columns has good test coverage. Closing as this behavior is deprecated.

MarcoGorelli added Bug Regression Functionality that used to work in a prior pandas version Groupby labels Feb 18, 2020

jorisvandenbossche removed the Bug label Feb 19, 2020

jorisvandenbossche added this to the 1.0.2 milestone Feb 19, 2020

TomAugspurger closed this as completed Mar 9, 2020

MarcoGorelli reopened this Jun 2, 2021

mroeschke removed this from the 1.0.2 milestone Jul 28, 2021

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 28, 2021

mroeschke closed this as completed Dec 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

zking1219 commented Feb 18, 2020

MarcoGorelli commented Feb 18, 2020

zking1219 commented Feb 18, 2020

dsaxton commented Feb 18, 2020

zking1219 commented Feb 18, 2020

TomAugspurger commented Mar 9, 2020

dz0 commented Jun 2, 2021

wakandan commented Jun 5, 2021

mroeschke commented Jul 28, 2021

mroeschke commented Dec 28, 2021

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

Comments

zking1219 commented Feb 18, 2020

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

MarcoGorelli commented Feb 18, 2020

zking1219 commented Feb 18, 2020

dsaxton commented Feb 18, 2020

zking1219 commented Feb 18, 2020

TomAugspurger commented Mar 9, 2020

dz0 commented Jun 2, 2021

wakandan commented Jun 5, 2021

mroeschke commented Jul 28, 2021

mroeschke commented Dec 28, 2021

Output of `pd.show_versions()`