Add a fix to Groupby for aggregations by a column from the DataFrame #413

devin-petersohn · 2019-01-15T20:33:28Z

Resolves Getting an error when I try to call an aggregate on a groupby object #409
Removes a grouped column from the result to match pandas
Changes the way we compute size to match pandas
Adds consisntecy between the DataFrame being computed on and the result

What do these changes do?

Related issue number

passes git diff upstream/master -u -- "*.py" | flake8 --diff
passes black --check modin/

AmplabJenkins · 2019-01-15T20:38:27Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-15T20:38:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/266/
Test PASSed.

AmplabJenkins · 2019-01-15T20:42:44Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-15T20:42:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/327/
Test PASSed.

eavidan · 2019-01-16T06:37:45Z

hey @devin-petersohn
fix looks good, just one thing I notice is that we still have an issue where we apply an aggregation function and need to drop some columns.
for instance in sum, the below code will through ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements.

import modin.pandas as pd
df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').sum()
print(df)

this seems to come from the way we try to set the new columns of the new_manager in DataFrameGroupBy._apply_agg_function()

new_manager.columns = new_columns

but the new columns in this case are [B] while the current ones are [B, C] and then the validation in PandasQueryCompiler._set_columns() throws the error.
maybe we could drop the non relevant columns instead of setting the relevant ones, since in the aggregation case the new columns will always exist in the DF already

devin-petersohn · 2019-01-16T07:21:12Z

Hey @eavidan, that sounds good. I think dropping the non-numeric columns is what we try to do, but sum allows for string sums (concatenation). So we assume that numeric operations can only happen on numeric columns, but sum allows strings as well. Code here.

Truthfully I need to spend some time on the tests for this code path because we aren't testing enough of the edge cases. I should add that to this PR so we can make sure that this and all future updates to this code are covered well.

devin-petersohn · 2019-01-17T00:55:27Z

modin/data_management/query_compiler/pandas_query_compiler.py

@@ -2368,12 +2368,12 @@ def _post_process_apply(self, result_data, axis, try_scale=True):
            else:
                columns = self.columns
        else:
-            # See above explaination for checking the lengths of columns
+            columns = internal_columns


I moved this for consistency with code block above.

devin-petersohn · 2019-01-17T00:55:55Z

modin/data_management/query_compiler/pandas_query_compiler.py

            grouped_df = df.groupby(by=by, axis=axis, **groupby_args)
            try:
                return agg_func(grouped_df, **agg_args)
            # This happens when the partition is filled with non-numeric data and a
            # numeric operation is done. We need to build the index here to avoid issues
            # with extracting the index.
            except DataError:
-                return pandas.DataFrame(index=grouped_df.count().index)
+                return pandas.DataFrame(index=grouped_df.size().index)


Changed to size which has lower overhead that count.

devin-petersohn · 2019-01-17T00:56:48Z

modin/pandas/test/test_groupby.py

@@ -215,7 +296,6 @@ def test_single_group_row_groupby():
    test_take(ray_groupby, pandas_groupby)


-@pytest.mark.skip(reason="See Modin issue #21.")


Was able to add this test back. It was working before this PR also.

AmplabJenkins · 2019-01-17T01:00:06Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T01:00:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/272/
Test PASSed.

AmplabJenkins · 2019-01-17T01:02:35Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T01:02:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/333/
Test PASSed.

AmplabJenkins · 2019-01-17T01:41:18Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T01:41:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/273/
Test PASSed.

AmplabJenkins · 2019-01-17T01:46:38Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T01:46:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/334/
Test PASSed.

AmplabJenkins · 2019-01-17T04:03:11Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T04:03:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/275/
Test PASSed.

AmplabJenkins · 2019-01-17T04:07:43Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T04:07:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/336/
Test PASSed.

* Resolves modin-project#409 * Removes a grouped column from the result to match pandas * Changes the way we compute `size` to match pandas * Adds consisntecy between the DataFrame being computed on and the result

* Computing columns more directly now. We reset the index or columns and use those indices to compute that actual index externally. This is more correct (and was actually being computed previously, but incorrectly). * Adding **kwargs to `modin.pandas.groupby.DataFrameGroupby.rank` * Adding tests for string + integer inter-operations * Cleaning up and making some code more consistent

AmplabJenkins · 2019-01-17T18:31:21Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T18:31:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/280/
Test PASSed.

AmplabJenkins · 2019-01-17T18:34:56Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T18:34:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/341/
Test PASSed.

AmplabJenkins · 2019-01-17T19:10:48Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T19:10:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/281/
Test PASSed.

AmplabJenkins · 2019-01-17T19:16:57Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-01-17T19:16:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/342/
Test PASSed.

devin-petersohn · 2019-01-17T19:55:22Z

@eavidan, I fixed the issue for the general case and added some test cases for larger and variable dtype DataFrames.

williamma12

Great PR! Just one minor change.

williamma12 · 2019-01-18T22:44:25Z

modin/data_management/query_compiler/pandas_query_compiler.py


        func_prepared = self._prepare_method(lambda df: groupby_agg_builder(df))
        result_data = self.map_across_full_axis(axis, func_prepared)
-        return self._post_process_apply(result_data, axis, try_scale=False)
+        if axis == 0:


Can this if...else block be moved into the else statement that gets ran only if the result is not a series (line 2519)?

We don't have any way of knowing if len(columns) == 0 before this. This is how we check.

Ah sorry, my mistake

eavidan · 2019-01-20T10:03:51Z

@devin-petersohn great PR and nice variety of tests.
Pandas seems to have an numeric_only passed as one of sums kwargs, where you can state if you want to sum only numeric columns or not.
It seems that those kwargs we pass to the lamda function in _apply_agg_function() do not reach pandas.sum() at the partition level when we map_across_full_axis.
So we still get some inconsistencies in some cases. eg:

df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').sum(numeric_only=True)
print(df)

This returns with the C column where pandas do not (numeric_only is set to True by default).

I have been scratching my head about this one. cannot seem to find the root cause for this in the code. probably missing something. this issue also persist in max() with numeric_only as well.

devin-petersohn · 2019-01-20T23:32:43Z

Thanks @eavidan, I see what is happening. Pandas will ignore numeric_only if there are no numeric columns in the DataFrame.

import pandas
df = pandas.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df[["A", "C"]].groupby('A').sum(numeric_only=True)
print(df)

It is strange behavior. This affects us because a partition may have only non-numeric data, which pandas will treat as all non-numeric.

For now, since this fixes something very broken, we can merge this. I will create a new PR to handle the kwargs differently.

devin-petersohn commented Jan 17, 2019

View reviewed changes

devin-petersohn added 6 commits January 17, 2019 09:04

Add a fix to Groupby for aggregations by a column from the DataFrame

e5cfc51

* Resolves modin-project#409 * Removes a grouped column from the result to match pandas * Changes the way we compute `size` to match pandas * Adds consisntecy between the DataFrame being computed on and the result

Fix lint and version issue

3d5202c

Fix lint

9e5632e

Making sort correct for python2

aed3f2f

Fix lint

ae67460

devin-petersohn force-pushed the issues/409 branch from e3b630e to ae67460 Compare January 17, 2019 18:26

Python2 and Python3 compat fix

b814a5f

devin-petersohn mentioned this pull request Jan 18, 2019

read_excel do not support "encoding" keyword #414

Closed

williamma12 requested changes Jan 18, 2019

View reviewed changes

williamma12 approved these changes Jan 18, 2019

View reviewed changes

devin-petersohn merged commit 5b2bb3b into modin-project:master Jan 20, 2019

devin-petersohn mentioned this pull request Jan 20, 2019

Groupby + agg sometimes ignores numeric_only arg #420

Closed

		@@ -215,7 +296,6 @@ def test_single_group_row_groupby():
		test_take(ray_groupby, pandas_groupby)


		@pytest.mark.skip(reason="See Modin issue #21.")

Add a fix to Groupby for aggregations by a column from the DataFrame #413

Add a fix to Groupby for aggregations by a column from the DataFrame #413

Conversation

devin-petersohn commented Jan 15, 2019

What do these changes do?

Related issue number

AmplabJenkins commented Jan 15, 2019

AmplabJenkins commented Jan 15, 2019

AmplabJenkins commented Jan 15, 2019

AmplabJenkins commented Jan 15, 2019

eavidan commented Jan 16, 2019

devin-petersohn commented Jan 16, 2019

devin-petersohn Jan 17, 2019

Choose a reason for hiding this comment

devin-petersohn Jan 17, 2019

Choose a reason for hiding this comment

devin-petersohn Jan 17, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

AmplabJenkins commented Jan 17, 2019

devin-petersohn commented Jan 17, 2019

williamma12 left a comment

Choose a reason for hiding this comment

williamma12 Jan 18, 2019

Choose a reason for hiding this comment

devin-petersohn Jan 18, 2019

Choose a reason for hiding this comment

williamma12 Jan 18, 2019

Choose a reason for hiding this comment

eavidan commented Jan 20, 2019 • edited

devin-petersohn commented Jan 20, 2019 • edited

eavidan commented Jan 20, 2019 •

edited

devin-petersohn commented Jan 20, 2019 •

edited