Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

Merged
merged 2 commits into from
Dec 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.3.5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Fixed regressions
- Fixed performance regression in :func:`read_csv` (:issue:`44106`)
- Fixed regression in :meth:`Series.duplicated` and :meth:`Series.drop_duplicates` when Series has :class:`Categorical` dtype with boolean categories (:issue:`44351`)
- Fixed regression in :meth:`.GroupBy.sum` with ``timedelta64[ns]`` dtype containing ``NaT`` failing to treat that value as NA (:issue:`42659`)
-
- Fixed regression in :meth:`.RollingGroupby.cov` and :meth:`.RollingGroupby.corr` when ``other`` had the same shape as each group would incorrectly return superfluous groups in the result (:issue:`42915`)

.. ---------------------------------------------------------------------------

Expand Down
12 changes: 7 additions & 5 deletions pandas/core/window/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -747,8 +747,11 @@ def _apply_pairwise(
target = self._create_data(target)
result = super()._apply_pairwise(target, other, pairwise, func)
# 1) Determine the levels + codes of the groupby levels
if other is not None:
# When we have other, we must reindex (expand) the result
if other is not None and not all(
len(group) == len(other) for group in self._grouper.indices.values()
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this fix relate to the changes in the PR that caused the regression? (maybe should put that in the issue rather than respond here)

I not familiar with this section of the codebase but this fix looks very specific to the reported case.

for instance, varying the OP example in the issue slightly

df_a[:-1].groupby(level=0).rolling(2).cov(df_b)

on 1.2.5

                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
2    2    1       NaN
          2       0.5
          3       0.5
          4       0.5

on 1.3.x/master/this PR

                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
     2    1       NaN
          2       NaN
          3       NaN
          4       NaN
2    1    1       NaN
          2       NaN
          3       NaN
          4       NaN
          5       NaN
     2    1       NaN
          2       0.5
          3       0.5
          4       0.5

which now seems inconsistent with the df_a.groupby(level=0).rolling(2).cov(df_b) case

Copy link
Member Author

@mroeschke mroeschke Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OPs example, len(other) == len(all groups).

In your df_a[:-1] twist, len(other) != len(all groups), and comparing to the equivalent groupby(level=0).apply(lambda x.rolling(2).cov(df_b)) case the result is

In [15]:         def func(x):
    ...:             return getattr(x.rolling(2), f)(df_b)


In [16]: df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2
    ...: "])
    ...: df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")

In [18]: f = "cov"

In [19]: df_a[:-1].groupby(level=0).apply(func)
Out[19]:
           value
idx1 idx2
1    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5
2    1       NaN
     2       0.5
     3       0.5
     4       0.5

But we already have a similar unit test to the twist, window/test_groupby.py:TestRolling:test_rolling_corr_cov where len(other) != len(all groups) and tests against the equivalent groupby(...).apply(lambda x.rolling(...).cov(...)) case. The result is that other is reindexing each group with other.index which "expands the result"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bottom line:

I do acknowledge this fix seems a bit brittle.

The behavior where len(other) != len(all groups) is a bit opaque to me because comparing to the groupby.apply(lambda x.rolling) equivalent expression doesn't to have a discernable pattern to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that comparison to apply might not be appropriate - in general the built-in op should produce results that are potentially better than apply because apply has be agnostic to the operation.

That said, the current output on master from the example in #44824 (comment) doesn't make sense to me. It appears to me that pandas is creating a record with (idx1, idx1, ...) being (1, 2, ...) and then declaring the value to be null because the record (that pandas created) is inconsistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to make sense to me since the second level was never involved in the groupby.

If you have a change could you evaluate if this test behavior seems correct? It currently tries to match the equivalent groupby.apply version but I am not sure if there is a more sensible result.

def test_rolling_corr_cov_other_diff_size_as_groups(self, f):

Copy link
Member

@rhshadrach rhshadrach Dec 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the expected behavior of frame1.cov(other=frame2) be? Currently this argument doesn't exist, but if it did, I think the expectation would be to first align frame2 to frame1's index and then compute the covariance.

Under this assumption, I think a more sensible result in the test linked to above would be to do the same alignment for each window during the rolling operation. This also has the benefit of making

frame.groupby("A").rolling(window=2).cov(other=frame)

be the same as

frame.groupby("A").rolling(window=2).cov()

However if the answer to the first question is to have null values for rows in frame2 that don't exist in frame1, then the current behavior of the test above is the most sensible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree that it's probably make sense index alignment to happen between the group and other before computing covariance.

# GH 42915
# len(other) != len(any group), so must reindex (expand) the result
# from flex_binary_moment to a "transform"-like result
# per groupby combination
old_result_len = len(result)
Expand All @@ -770,10 +773,9 @@ def _apply_pairwise(
codes, levels = factorize(labels)
groupby_codes.append(codes)
groupby_levels.append(levels)

else:
# When we evaluate the pairwise=True result, repeat the groupby
# labels by the number of columns in the original object
# pairwise=True or len(other) == len(each group), so repeat
# the groupby labels by the number of columns in the original object
groupby_codes = self._grouper.codes
# error: Incompatible types in assignment (expression has type
# "List[Index]", variable has type "List[Union[ndarray, Index]]")
Expand Down
32 changes: 31 additions & 1 deletion pandas/tests/window/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,33 @@ def test_rolling_quantile(self, interpolation):
expected.index = expected_index
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("f, expected_val", [["corr", 1], ["cov", 0.5]])
def test_rolling_corr_cov_other_same_size_as_groups(self, f, expected_val):
# GH 42915
df = DataFrame(
{"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}
).set_index(["idx1", "idx2"])
other = DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index(
"idx2"
)
result = getattr(df.groupby(level=0).rolling(2), f)(other)
expected_data = ([np.nan] + [expected_val] * 4) * 2
expected = DataFrame(
expected_data,
columns=["value"],
index=MultiIndex.from_arrays(
[
[1] * 5 + [2] * 5,
[1] * 5 + [2] * 5,
list(range(1, 6)) * 2,
],
names=["idx1", "idx1", "idx2"],
),
)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("f", ["corr", "cov"])
def test_rolling_corr_cov(self, f):
def test_rolling_corr_cov_other_diff_size_as_groups(self, f):
g = self.frame.groupby("A")
r = g.rolling(window=4)

Expand All @@ -138,6 +163,11 @@ def func(x):
expected["A"] = np.nan
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("f", ["corr", "cov"])
def test_rolling_corr_cov_pairwise(self, f):
g = self.frame.groupby("A")
r = g.rolling(window=4)

result = getattr(r.B, f)(pairwise=True)

def func(x):
Expand Down