Getting an error when I try to call an aggregate on a groupby object #409

eyadsibai · 2019-01-13T13:17:58Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Modin installed from (source or binary): binary
Modin version: 0.2.5
Python version: 3.6.8
Exact command to reproduce:

import modin.pandas as pd
df = pd.read_csv(file)
df.groupby(col).size()

Describe the problem

Getting an error when I try to call .size or other aggregates like .count on a groupby object

Source code / logs

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/modin/pandas/dataframe.py in __repr__(self)
    122         num_cols = 30
    123 
--> 124         result = repr(self._build_repr_df(num_rows, num_cols))
    125         if len(self.index) > num_rows or len(self.columns) > num_cols:
    126             # The split here is so that we don't repr pandas row lengths.

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/modin/pandas/dataframe.py in _build_repr_df(self, num_rows, num_cols)
     93 
     94         if len(self.columns) <= num_cols:
---> 95             head_front = head.to_pandas()
     96             # Creating these empty to make the concat logic simpler
     97             head_back = pandas.DataFrame()

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/modin/data_management/query_compiler/pandas_query_compiler.py in to_pandas(self)
   2047             df = pandas.DataFrame(dtype_dict, self.index)
   2048         else:
-> 2049             df.index = self.index
   2050             df.columns = self.columns
   2051         return df

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   4387         try:
   4388             object.__getattribute__(self, name)
-> 4389             return object.__setattr__(self, name, value)
   4390         except AttributeError:
   4391             pass

pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    387         object.__setattr__(self, '_index', labels)
    388         if not fastpath:
--> 389             self._data.set_axis(axis, labels)
    390 
    391     def _set_subtyp(self, is_all_dates):

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   3321             raise ValueError(
   3322                 'Length mismatch: Expected axis has {old} elements, new '
-> 3323                 'values have {new} elements'.format(old=old_len, new=new_len))
   3324 
   3325         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 186 elements, new values have 31 elements

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2019-01-13T18:18:11Z

Hi @eyadsibai, thanks for posting this issue!

We have recently made some fixes to that code and have a pre-release out. Would you be able to tell me if pip install --pre modin fixes the issue?

eyadsibai · 2019-01-13T18:55:38Z

Hi @devin-petersohn the issue remains.

devin-petersohn · 2019-01-14T00:00:37Z

Looking through the stack, it is a problem with __repr__. I haven't been able to reproduce the issue locally.

Does print(df) after read_csv also give the same error?

Would you be able to share the dtypes and shape of the DataFrame so I can better attempt to reproduce this in order to fix it? Thanks!

eavidan · 2019-01-14T07:44:39Z

hi @devin-petersohn , the following reproduces the error.
gets ValueError: Length mismatch: Expected axis has 9 elements, new values have 3 elements.
seems related to the shape of the aggregated series. looking into it

import pandas as pd
df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').size()
print(df)


import modin.pandas as pd
df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').size()
print(df)

eavidan · 2019-01-14T16:49:11Z

The aggregation is done correctly, but on all columns atomically in PandasQueryCompiler.groupby_agg()

result_data = self.map_across_full_axis(axis, func_prepared)

we end up with 3 different column (in the example above) with the same aggregated values [3, 1, 1]
so later in PandasQueryCompiler._post_process_apply() setting the new index series_result.index = index fails since there are 3 columns instead of 1.

@devin-petersohn any suggestions for an elegant solution?

devin-petersohn · 2019-01-14T17:29:58Z

I see what is happening now thanks @eavidan!

The deeper problem with size is that each partition is outputting its own size and we don't aggregate those correctly. It probably shouldn't even trigger a compute on the partitions since we know the size before we even trigger the full compute.

As for count the internal issue is that when you pass a Series or column name to groupby, pandas will make that the new index by default (as_index=False disables this behavior). When that happens, if the column was in the DataFrame being grouped, it is removed from the result.

We correctly remove the result from the columns list, but the result still has the column internally. It looks like we need to drop the column before we do the operation. It should happen in DataFrameGroupBy in the constructor.

* Resolves modin-project#409 * Removes a grouped column from the result to match pandas * Changes the way we compute `size` to match pandas * Adds consisntecy between the DataFrame being computed on and the result

…413) * Add a fix to Groupby for aggregations by a column from the DataFrame * Resolves #409 * Removes a grouped column from the result to match pandas * Changes the way we compute `size` to match pandas * Adds consisntecy between the DataFrame being computed on and the result * Additional updates for `groupby` + agg: * Computing columns more directly now. We reset the index or columns and use those indices to compute that actual index externally. This is more correct (and was actually being computed previously, but incorrectly). * Adding **kwargs to `modin.pandas.groupby.DataFrameGroupby.rank` * Adding tests for string + integer inter-operations * Cleaning up and making some code more consistent * Fix lint and version issue * Fix lint * Making sort correct for python2 * Fix lint * Python2 and Python3 compat fix

devin-petersohn added the bug 🦗 Something isn't working label Jan 14, 2019

devin-petersohn mentioned this issue Jan 15, 2019

Add a fix to Groupby for aggregations by a column from the DataFrame #413

Merged

2 tasks

devin-petersohn closed this as completed in #413 Jan 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting an error when I try to call an aggregate on a groupby object #409

Getting an error when I try to call an aggregate on a groupby object #409

eyadsibai commented Jan 13, 2019

devin-petersohn commented Jan 13, 2019

eyadsibai commented Jan 13, 2019

devin-petersohn commented Jan 14, 2019

eavidan commented Jan 14, 2019

eavidan commented Jan 14, 2019

devin-petersohn commented Jan 14, 2019

Getting an error when I try to call an aggregate on a groupby object #409

Getting an error when I try to call an aggregate on a groupby object #409

Comments

eyadsibai commented Jan 13, 2019

System information

Describe the problem

Source code / logs

devin-petersohn commented Jan 13, 2019

eyadsibai commented Jan 13, 2019

devin-petersohn commented Jan 14, 2019

eavidan commented Jan 14, 2019

eavidan commented Jan 14, 2019

devin-petersohn commented Jan 14, 2019