Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix regression in groupby.rolling.corr/cov when other is same size as each group #44824

Merged
merged 2 commits into from Dec 10, 2021

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke added this to the 1.3.5 milestone Dec 9, 2021
@mroeschke mroeschke added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding labels Dec 9, 2021
@jreback
Copy link
Contributor

jreback commented Dec 9, 2021

nice ping on greenish

@mroeschke
Copy link
Member Author

greenish

# When we have other, we must reindex (expand) the result
if other is not None and not all(
len(group) == len(other) for group in self._grouper.indices.values()
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this fix relate to the changes in the PR that caused the regression? (maybe should put that in the issue rather than respond here)

I not familiar with this section of the codebase but this fix looks very specific to the reported case.

for instance, varying the OP example in the issue slightly

df_a[:-1].groupby(level=0).rolling(2).cov(df_b)

on 1.2.5

                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
2    2    1       NaN
          2       0.5
          3       0.5
          4       0.5

on 1.3.x/master/this PR

                value
idx1 idx1 idx2       
1    1    1       NaN
          2       0.5
          3       0.5
          4       0.5
          5       0.5
     2    1       NaN
          2       NaN
          3       NaN
          4       NaN
2    1    1       NaN
          2       NaN
          3       NaN
          4       NaN
          5       NaN
     2    1       NaN
          2       0.5
          3       0.5
          4       0.5

which now seems inconsistent with the df_a.groupby(level=0).rolling(2).cov(df_b) case

Copy link
Member Author

@mroeschke mroeschke Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OPs example, len(other) == len(all groups).

In your df_a[:-1] twist, len(other) != len(all groups), and comparing to the equivalent groupby(level=0).apply(lambda x.rolling(2).cov(df_b)) case the result is

In [15]:         def func(x):
    ...:             return getattr(x.rolling(2), f)(df_b)


In [16]: df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2
    ...: "])
    ...: df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")

In [18]: f = "cov"

In [19]: df_a[:-1].groupby(level=0).apply(func)
Out[19]:
           value
idx1 idx2
1    1       NaN
     2       0.5
     3       0.5
     4       0.5
     5       0.5
2    1       NaN
     2       0.5
     3       0.5
     4       0.5

But we already have a similar unit test to the twist, window/test_groupby.py:TestRolling:test_rolling_corr_cov where len(other) != len(all groups) and tests against the equivalent groupby(...).apply(lambda x.rolling(...).cov(...)) case. The result is that other is reindexing each group with other.index which "expands the result"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bottom line:

I do acknowledge this fix seems a bit brittle.

The behavior where len(other) != len(all groups) is a bit opaque to me because comparing to the groupby.apply(lambda x.rolling) equivalent expression doesn't to have a discernable pattern to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that comparison to apply might not be appropriate - in general the built-in op should produce results that are potentially better than apply because apply has be agnostic to the operation.

That said, the current output on master from the example in #44824 (comment) doesn't make sense to me. It appears to me that pandas is creating a record with (idx1, idx1, ...) being (1, 2, ...) and then declaring the value to be null because the record (that pandas created) is inconsistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to make sense to me since the second level was never involved in the groupby.

If you have a change could you evaluate if this test behavior seems correct? It currently tries to match the equivalent groupby.apply version but I am not sure if there is a more sensible result.

def test_rolling_corr_cov_other_diff_size_as_groups(self, f):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the expected behavior of frame1.cov(other=frame2) be? Currently this argument doesn't exist, but if it did, I think the expectation would be to first align frame2 to frame1's index and then compute the covariance.

Under this assumption, I think a more sensible result in the test linked to above would be to do the same alignment for each window during the rolling operation. This also has the benefit of making

frame.groupby("A").rolling(window=2).cov(other=frame)

be the same as

frame.groupby("A").rolling(window=2).cov()

However if the answer to the first question is to have null values for rows in frame2 that don't exist in frame1, then the current behavior of the test above is the most sensible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I agree that it's probably make sense index alignment to happen between the group and other before computing covariance.

@jreback jreback merged commit ea2dc44 into pandas-dev:master Dec 10, 2021
@jreback
Copy link
Contributor

jreback commented Dec 10, 2021

thanks @mroeschke

@jreback
Copy link
Contributor

jreback commented Dec 10, 2021

@meeseeksdev backport 1.3.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 10, 2021
…corr/cov when other is same size as each group
@lumberbot-app
Copy link

lumberbot-app bot commented Dec 10, 2021

Something went wrong ... Please have a look at my logs.

lithomas1 pushed a commit that referenced this pull request Dec 11, 2021
…hen other is same size as each group (#44848)

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
@mroeschke mroeschke deleted the reg/groupby_roll_cov branch December 11, 2021 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Projects
None yet
4 participants