Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a fix to Groupby for aggregations by a column from the DataFrame #413

Merged
merged 7 commits into from Jan 20, 2019

Conversation

devin-petersohn
Copy link
Collaborator

What do these changes do?

Related issue number

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • passes black --check modin/

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/266/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/327/
Test PASSed.

@eavidan
Copy link
Collaborator

eavidan commented Jan 16, 2019

hey @devin-petersohn
fix looks good, just one thing I notice is that we still have an issue where we apply an aggregation function and need to drop some columns.
for instance in sum, the below code will through ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements.

import modin.pandas as pd
df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').sum()
print(df)

this seems to come from the way we try to set the new columns of the new_manager in DataFrameGroupBy._apply_agg_function()

new_manager.columns = new_columns

but the new columns in this case are [B] while the current ones are [B, C] and then the validation in PandasQueryCompiler._set_columns() throws the error.
maybe we could drop the non relevant columns instead of setting the relevant ones, since in the aggregation case the new columns will always exist in the DF already

@devin-petersohn
Copy link
Collaborator Author

Hey @eavidan, that sounds good. I think dropping the non-numeric columns is what we try to do, but sum allows for string sums (concatenation). So we assume that numeric operations can only happen on numeric columns, but sum allows strings as well. Code here.

Truthfully I need to spend some time on the tests for this code path because we aren't testing enough of the edge cases. I should add that to this PR so we can make sure that this and all future updates to this code are covered well.

@@ -2368,12 +2368,12 @@ def _post_process_apply(self, result_data, axis, try_scale=True):
else:
columns = self.columns
else:
# See above explaination for checking the lengths of columns
columns = internal_columns
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this for consistency with code block above.

grouped_df = df.groupby(by=by, axis=axis, **groupby_args)
try:
return agg_func(grouped_df, **agg_args)
# This happens when the partition is filled with non-numeric data and a
# numeric operation is done. We need to build the index here to avoid issues
# with extracting the index.
except DataError:
return pandas.DataFrame(index=grouped_df.count().index)
return pandas.DataFrame(index=grouped_df.size().index)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to size which has lower overhead that count.

@@ -215,7 +296,6 @@ def test_single_group_row_groupby():
test_take(ray_groupby, pandas_groupby)


@pytest.mark.skip(reason="See Modin issue #21.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to add this test back. It was working before this PR also.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/272/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/333/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/273/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/334/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/275/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/336/
Test PASSed.

* Resolves modin-project#409
* Removes a grouped column from the result to match pandas
* Changes the way we compute `size` to match pandas
* Adds consisntecy between the DataFrame being computed on and the result
* Computing columns more directly now. We reset the index or columns and
  use those indices to compute that actual index externally. This is
  more correct (and was actually being computed previously, but
  incorrectly).
* Adding **kwargs to `modin.pandas.groupby.DataFrameGroupby.rank`
* Adding tests for string + integer inter-operations
* Cleaning up and making some code more consistent
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/280/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/341/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-Performance-PRB/281/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Modin-PRB/342/
Test PASSed.

@devin-petersohn
Copy link
Collaborator Author

@eavidan, I fixed the issue for the general case and added some test cases for larger and variable dtype DataFrames.

Copy link
Collaborator

@williamma12 williamma12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Just one minor change.


func_prepared = self._prepare_method(lambda df: groupby_agg_builder(df))
result_data = self.map_across_full_axis(axis, func_prepared)
return self._post_process_apply(result_data, axis, try_scale=False)
if axis == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this if...else block be moved into the else statement that gets ran only if the result is not a series (line 2519)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any way of knowing if len(columns) == 0 before this. This is how we check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, my mistake

@eavidan
Copy link
Collaborator

eavidan commented Jan 20, 2019

@devin-petersohn great PR and nice variety of tests.
Pandas seems to have an numeric_only passed as one of sums kwargs, where you can state if you want to sum only numeric columns or not.
It seems that those kwargs we pass to the lamda function in _apply_agg_function() do not reach pandas.sum() at the partition level when we map_across_full_axis.
So we still get some inconsistencies in some cases. eg:

df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df.groupby('A').sum(numeric_only=True)
print(df)

This returns with the C column where pandas do not (numeric_only is set to True by default).

I have been scratching my head about this one. cannot seem to find the root cause for this in the code. probably missing something. this issue also persist in max() with numeric_only as well.

@devin-petersohn
Copy link
Collaborator Author

devin-petersohn commented Jan 20, 2019

Thanks @eavidan, I see what is happening. Pandas will ignore numeric_only if there are no numeric columns in the DataFrame.

import pandas
df = pandas.DataFrame({'A': [2, 2, 2, 3, 4],
                    'B': [5, 6, 7, 8, 9],
                    'C': ['a', 'b', 'c', 'd', 'e']})

df = df[["A", "C"]].groupby('A').sum(numeric_only=True)
print(df)

It is strange behavior. This affects us because a partition may have only non-numeric data, which pandas will treat as all non-numeric.

For now, since this fixes something very broken, we can merge this. I will create a new PR to handle the kwargs differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Getting an error when I try to call an aggregate on a groupby object
4 participants