BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

mroeschke · 2021-12-09T00:12:42Z

closes BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

…ze as each group

jreback · 2021-12-09T00:20:40Z

nice ping on greenish

mroeschke · 2021-12-09T06:15:57Z

greenish

simonjayhawkins · 2021-12-09T11:42:41Z

pandas/core/window/rolling.py

-            # When we have other, we must reindex (expand) the result
+        if other is not None and not all(
+            len(group) == len(other) for group in self._grouper.indices.values()
+        ):


how does this fix relate to the changes in the PR that caused the regression? (maybe should put that in the issue rather than respond here)

I not familiar with this section of the codebase but this fix looks very specific to the reported case.

for instance, varying the OP example in the issue slightly

df_a[:-1].groupby(level=0).rolling(2).cov(df_b)

on 1.2.5

value idx1 idx1 idx2 1 1 1 NaN 2 0.5 3 0.5 4 0.5 5 0.5 2 2 1 NaN 2 0.5 3 0.5 4 0.5

on 1.3.x/master/this PR

value idx1 idx1 idx2 1 1 1 NaN 2 0.5 3 0.5 4 0.5 5 0.5 2 1 NaN 2 NaN 3 NaN 4 NaN 2 1 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 2 1 NaN 2 0.5 3 0.5 4 0.5

which now seems inconsistent with the df_a.groupby(level=0).rolling(2).cov(df_b) case

In OPs example, len(other) == len(all groups).

In your df_a[:-1] twist, len(other) != len(all groups), and comparing to the equivalent groupby(level=0).apply(lambda x.rolling(2).cov(df_b)) case the result is

In [15]: def func(x): ...: return getattr(x.rolling(2), f)(df_b) In [16]: df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2 ...: "]) ...: df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2") In [18]: f = "cov" In [19]: df_a[:-1].groupby(level=0).apply(func) Out[19]: value idx1 idx2 1 1 NaN 2 0.5 3 0.5 4 0.5 5 0.5 2 1 NaN 2 0.5 3 0.5 4 0.5

But we already have a similar unit test to the twist, window/test_groupby.py:TestRolling:test_rolling_corr_cov where len(other) != len(all groups) and tests against the equivalent groupby(...).apply(lambda x.rolling(...).cov(...)) case. The result is that other is reindexing each group with other.index which "expands the result"

Bottom line:

I do acknowledge this fix seems a bit brittle.

The behavior where len(other) != len(all groups) is a bit opaque to me because comparing to the groupby.apply(lambda x.rolling) equivalent expression doesn't to have a discernable pattern to me.

I agree that comparison to apply might not be appropriate - in general the built-in op should produce results that are potentially better than apply because apply has be agnostic to the operation.

That said, the current output on master from the example in #44824 (comment) doesn't make sense to me. It appears to me that pandas is creating a record with (idx1, idx1, ...) being (1, 2, ...) and then declaring the value to be null because the record (that pandas created) is inconsistent.

That seems to make sense to me since the second level was never involved in the groupby.

If you have a change could you evaluate if this test behavior seems correct? It currently tries to match the equivalent groupby.apply version but I am not sure if there is a more sensible result.

pandas/pandas/tests/window/test_groupby.py

Line 151 in 193ca73

def test_rolling_corr_cov_other_diff_size_as_groups(self, f):

What would the expected behavior of frame1.cov(other=frame2) be? Currently this argument doesn't exist, but if it did, I think the expectation would be to first align frame2 to frame1's index and then compute the covariance.

Under this assumption, I think a more sensible result in the test linked to above would be to do the same alignment for each window during the rolling operation. This also has the benefit of making

frame.groupby("A").rolling(window=2).cov(other=frame)

be the same as

frame.groupby("A").rolling(window=2).cov()

However if the answer to the first question is to have null values for rows in frame2 that don't exist in frame1, then the current behavior of the test above is the most sensible.

Thanks. I agree that it's probably make sense index alignment to happen between the group and other before computing covariance.

jreback · 2021-12-10T23:55:52Z

thanks @mroeschke

jreback · 2021-12-10T23:55:59Z

@meeseeksdev backport 1.3.x

…corr/cov when other is same size as each group

lumberbot-app · 2021-12-10T23:56:06Z

Something went wrong ... Please have a look at my logs.

…hen other is same size as each group (#44848) Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>

BUG: Fix regression in groupby.rolling.corr/cov when other is same si…

1ec3f9f

…ze as each group

mroeschke added this to the 1.3.5 milestone Dec 9, 2021

mroeschke added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding labels Dec 9, 2021

simonjayhawkins reviewed Dec 9, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into reg/groupby_roll_cov

4b855b9

jreback merged commit ea2dc44 into pandas-dev:master Dec 10, 2021

meeseeksmachine mentioned this pull request Dec 10, 2021

Backport PR #44824 on branch 1.3.x (BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group) #44848

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 10, 2021

Backport PR pandas-dev#44824: BUG: Fix regression in groupby.rolling.…

4a1ab51

…corr/cov when other is same size as each group

lithomas1 pushed a commit that referenced this pull request Dec 11, 2021

Backport PR #44824: BUG: Fix regression in groupby.rolling.corr/cov w…

766ea00

…hen other is same size as each group (#44848) Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>

mroeschke deleted the reg/groupby_roll_cov branch December 11, 2021 03:23

mroeschke mentioned this pull request Dec 12, 2021

BUG: groupby.rolling.cov/corr(other) with len(group) != len(other) introduces extra NaN group results #44861

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

mroeschke commented Dec 9, 2021

jreback commented Dec 9, 2021

mroeschke commented Dec 9, 2021

simonjayhawkins Dec 9, 2021

mroeschke Dec 10, 2021 •

edited

mroeschke Dec 10, 2021

rhshadrach Dec 11, 2021

mroeschke Dec 12, 2021

rhshadrach Dec 12, 2021 •

edited

mroeschke Dec 12, 2021

jreback commented Dec 10, 2021

jreback commented Dec 10, 2021

lumberbot-app bot commented Dec 10, 2021

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

Conversation

mroeschke commented Dec 9, 2021

jreback commented Dec 9, 2021

mroeschke commented Dec 9, 2021

simonjayhawkins Dec 9, 2021

Choose a reason for hiding this comment

mroeschke Dec 10, 2021 • edited

Choose a reason for hiding this comment

mroeschke Dec 10, 2021

Choose a reason for hiding this comment

rhshadrach Dec 11, 2021

Choose a reason for hiding this comment

mroeschke Dec 12, 2021

Choose a reason for hiding this comment

rhshadrach Dec 12, 2021 • edited

Choose a reason for hiding this comment

mroeschke Dec 12, 2021

Choose a reason for hiding this comment

jreback commented Dec 10, 2021

jreback commented Dec 10, 2021

lumberbot-app bot commented Dec 10, 2021

mroeschke Dec 10, 2021 •

edited

rhshadrach Dec 12, 2021 •

edited