Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting an error when I try to call an aggregate on a groupby object #409

Closed
eyadsibai opened this issue Jan 13, 2019 · 6 comments · Fixed by #413
Closed

Getting an error when I try to call an aggregate on a groupby object #409

eyadsibai opened this issue Jan 13, 2019 · 6 comments · Fixed by #413
Labels
bug 🦗 Something isn't working

Comments

@eyadsibai
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Modin installed from (source or binary): binary
  • Modin version: 0.2.5
  • Python version: 3.6.8
  • Exact command to reproduce:
import modin.pandas as pd
df = pd.read_csv(file)
df.groupby(col).size()

Describe the problem

Getting an error when I try to call .size or other aggregates like .count on a groupby object

Source code / logs

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/modin/pandas/dataframe.py in __repr__(self)
    122         num_cols = 30
    123 
--> 124         result = repr(self._build_repr_df(num_rows, num_cols))
    125         if len(self.index) > num_rows or len(self.columns) > num_cols:
    126             # The split here is so that we don't repr pandas row lengths.

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/modin/pandas/dataframe.py in _build_repr_df(self, num_rows, num_cols)
     93 
     94         if len(self.columns) <= num_cols:
---> 95             head_front = head.to_pandas()
     96             # Creating these empty to make the concat logic simpler
     97             head_back = pandas.DataFrame()

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/modin/data_management/query_compiler/pandas_query_compiler.py in to_pandas(self)
   2047             df = pandas.DataFrame(dtype_dict, self.index)
   2048         else:
-> 2049             df.index = self.index
   2050             df.columns = self.columns
   2051         return df

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   4387         try:
   4388             object.__getattribute__(self, name)
-> 4389             return object.__setattr__(self, name, value)
   4390         except AttributeError:
   4391             pass

pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    387         object.__setattr__(self, '_index', labels)
    388         if not fastpath:
--> 389             self._data.set_axis(axis, labels)
    390 
    391     def _set_subtyp(self, is_all_dates):

~/miniconda3/envs/elo-kaggle/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   3321             raise ValueError(
   3322                 'Length mismatch: Expected axis has {old} elements, new '
-> 3323                 'values have {new} elements'.format(old=old_len, new=new_len))
   3324 
   3325         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 186 elements, new values have 31 elements
@devin-petersohn
Copy link
Collaborator

Hi @eyadsibai, thanks for posting this issue!

We have recently made some fixes to that code and have a pre-release out. Would you be able to tell me if pip install --pre modin fixes the issue?

@eyadsibai
Copy link
Author

Hi @devin-petersohn the issue remains.

@devin-petersohn
Copy link
Collaborator

Looking through the stack, it is a problem with __repr__. I haven't been able to reproduce the issue locally.

Does print(df) after read_csv also give the same error?

Would you be able to share the dtypes and shape of the DataFrame so I can better attempt to reproduce this in order to fix it? Thanks!

@devin-petersohn devin-petersohn added the bug 🦗 Something isn't working label Jan 14, 2019
@eavidan
Copy link
Collaborator

eavidan commented Jan 14, 2019

hi @devin-petersohn , the following reproduces the error.
gets ValueError: Length mismatch: Expected axis has 9 elements, new values have 3 elements.
seems related to the shape of the aggregated series. looking into it

import pandas as pd
df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').size()
print(df)


import modin.pandas as pd
df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').size()
print(df)

@eavidan
Copy link
Collaborator

eavidan commented Jan 14, 2019

The aggregation is done correctly, but on all columns atomically in PandasQueryCompiler.groupby_agg()

result_data = self.map_across_full_axis(axis, func_prepared)

we end up with 3 different column (in the example above) with the same aggregated values [3, 1, 1]
so later in PandasQueryCompiler._post_process_apply() setting the new index series_result.index = index fails since there are 3 columns instead of 1.

@devin-petersohn any suggestions for an elegant solution?

@devin-petersohn
Copy link
Collaborator

I see what is happening now thanks @eavidan!

The deeper problem with size is that each partition is outputting its own size and we don't aggregate those correctly. It probably shouldn't even trigger a compute on the partitions since we know the size before we even trigger the full compute.

As for count the internal issue is that when you pass a Series or column name to groupby, pandas will make that the new index by default (as_index=False disables this behavior). When that happens, if the column was in the DataFrame being grouped, it is removed from the result.

We correctly remove the result from the columns list, but the result still has the column internally. It looks like we need to drop the column before we do the operation. It should happen in DataFrameGroupBy in the constructor.

devin-petersohn added a commit to devin-petersohn/modin that referenced this issue Jan 17, 2019
* Resolves modin-project#409
* Removes a grouped column from the result to match pandas
* Changes the way we compute `size` to match pandas
* Adds consisntecy between the DataFrame being computed on and the result
devin-petersohn added a commit that referenced this issue Jan 20, 2019
…413)

* Add a fix to Groupby for aggregations by a column from the DataFrame

* Resolves #409
* Removes a grouped column from the result to match pandas
* Changes the way we compute `size` to match pandas
* Adds consisntecy between the DataFrame being computed on and the result

* Additional updates for `groupby` + agg:

* Computing columns more directly now. We reset the index or columns and
  use those indices to compute that actual index externally. This is
  more correct (and was actually being computed previously, but
  incorrectly).
* Adding **kwargs to `modin.pandas.groupby.DataFrameGroupby.rank`
* Adding tests for string + integer inter-operations
* Cleaning up and making some code more consistent

* Fix lint and version issue

* Fix lint

* Making sort correct for python2

* Fix lint

* Python2 and Python3 compat fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants