New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a fix to Groupby for aggregations by a column from the DataFrame #413
Add a fix to Groupby for aggregations by a column from the DataFrame #413
Conversation
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
hey @devin-petersohn
this seems to come from the way we try to set the new columns of the
but the new columns in this case are |
Hey @eavidan, that sounds good. I think dropping the non-numeric columns is what we try to do, but Truthfully I need to spend some time on the tests for this code path because we aren't testing enough of the edge cases. I should add that to this PR so we can make sure that this and all future updates to this code are covered well. |
@@ -2368,12 +2368,12 @@ def _post_process_apply(self, result_data, axis, try_scale=True): | |||
else: | |||
columns = self.columns | |||
else: | |||
# See above explaination for checking the lengths of columns | |||
columns = internal_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this for consistency with code block above.
grouped_df = df.groupby(by=by, axis=axis, **groupby_args) | ||
try: | ||
return agg_func(grouped_df, **agg_args) | ||
# This happens when the partition is filled with non-numeric data and a | ||
# numeric operation is done. We need to build the index here to avoid issues | ||
# with extracting the index. | ||
except DataError: | ||
return pandas.DataFrame(index=grouped_df.count().index) | ||
return pandas.DataFrame(index=grouped_df.size().index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to size
which has lower overhead that count
.
@@ -215,7 +296,6 @@ def test_single_group_row_groupby(): | |||
test_take(ray_groupby, pandas_groupby) | |||
|
|||
|
|||
@pytest.mark.skip(reason="See Modin issue #21.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to add this test back. It was working before this PR also.
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
* Resolves modin-project#409 * Removes a grouped column from the result to match pandas * Changes the way we compute `size` to match pandas * Adds consisntecy between the DataFrame being computed on and the result
* Computing columns more directly now. We reset the index or columns and use those indices to compute that actual index externally. This is more correct (and was actually being computed previously, but incorrectly). * Adding **kwargs to `modin.pandas.groupby.DataFrameGroupby.rank` * Adding tests for string + integer inter-operations * Cleaning up and making some code more consistent
e3b630e
to
ae67460
Compare
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
@eavidan, I fixed the issue for the general case and added some test cases for larger and variable dtype DataFrames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! Just one minor change.
|
||
func_prepared = self._prepare_method(lambda df: groupby_agg_builder(df)) | ||
result_data = self.map_across_full_axis(axis, func_prepared) | ||
return self._post_process_apply(result_data, axis, try_scale=False) | ||
if axis == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this if
...else
block be moved into the else
statement that gets ran only if the result is not a series (line 2519)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any way of knowing if len(columns) == 0
before this. This is how we check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, my mistake
@devin-petersohn great PR and nice variety of tests. df = pd.DataFrame({'A': [2, 2, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df = df.groupby('A').sum(numeric_only=True)
print(df) This returns with the I have been scratching my head about this one. cannot seem to find the root cause for this in the code. probably missing something. this issue also persist in |
Thanks @eavidan, I see what is happening. Pandas will ignore import pandas
df = pandas.DataFrame({'A': [2, 2, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df = df[["A", "C"]].groupby('A').sum(numeric_only=True)
print(df) It is strange behavior. This affects us because a partition may have only non-numeric data, which pandas will treat as all non-numeric. For now, since this fixes something very broken, we can merge this. I will create a new PR to handle the |
size
to match pandasWhat do these changes do?
Related issue number
git diff upstream/master -u -- "*.py" | flake8 --diff
black --check modin/