Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST/CLN: roll_sum/mean/var/skew/kurt: simplification for non-monotonic indices #36933

Merged
merged 1 commit into from
Oct 25, 2020

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Oct 7, 2020

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

The removed for-loop doesn't seem to be necessary (I hope this code is tested by an existing test).

I feel like I'm missing an obvious reason why these for-loops are needed: looking at the code I don't think we need them and the tests also pass.

@twoertwein twoertwein changed the title roll_sum_variable: simplification for non-monotonic indices CLN: roll_sum_variable: simplification for non-monotonic indices Oct 7, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i don't remember why this is like this for the non-monotonic case.

can you see what tests are actually hitting this (it could be not very many / nothing).

@jreback jreback added the Window rolling, ewma, expanding label Oct 7, 2020
@twoertwein
Copy link
Member Author

twoertwein commented Oct 7, 2020

there are exactly two tests that cover at least the sum part (mean/var/skew/kurtosis do not seem to be covered!).

The tests are test_indexer_constructor_arg and test_indexer_accepts_rolling_args, both from pandas/tests/window/test_base_indexer.py.

I will mark this PR as a draft and look into tests for these functions on the weekend.

I assume that the easiest way to trigger the non-monotonic branch for variable windows is by having a dataframe with a datetime index that is not sorted, or?

edit: I think the only way to trigger the variable non-monotonic part is through using a BaseIndexer with rolling. A ValueError is thrown for non-monotonic datetime.

@twoertwein twoertwein marked this pull request as draft October 7, 2020 17:43
@twoertwein twoertwein marked this pull request as ready for review October 8, 2020 16:04
@twoertwein twoertwein marked this pull request as draft October 8, 2020 17:02
@twoertwein
Copy link
Member Author

I printed the values I set to zero on master: they are zero. I assume the testcase has some randomness/platform-specific behavior?

@twoertwein twoertwein changed the title CLN: roll_sum_variable: simplification for non-monotonic indices TST/CLN: roll_sum/mean/var/skew/kurt_variable: simplification for non-monotonic indices Oct 8, 2020
@jreback
Copy link
Contributor

jreback commented Oct 9, 2020

cc @mroeschke

@twoertwein twoertwein marked this pull request as ready for review October 9, 2020 22:20
@twoertwein twoertwein changed the title TST/CLN: roll_sum/mean/var/skew/kurt_variable: simplification for non-monotonic indices TST/CLN: roll_sum/mean/var/skew/kurt: simplification for non-monotonic indices Oct 10, 2020
@jreback
Copy link
Contributor

jreback commented Oct 10, 2020

cc @mroeschke

@mroeschke
Copy link
Member

While we're addressing this

  1. Could you name all the is_monotonic_* variables to is_monotonic_increasing_* just for clarity?
  2. Could you see if we have tests for a monotonically decreasing index (and add some tests if there are none)? That's what should hit this code path

@twoertwein
Copy link
Member Author

  • Could you name all the is_monotonic_* variables to is_monotonic_increasing_* just for clarity?

Will do

  • Could you see if we have tests for a monotonically decreasing index (and add some tests if there are none)? That's what should hit this code path

I think the not is_monotonic_bounds branch calculates the requested statistic separately for each window (naive calculation but with cython) and is probably meant for non-monotonic bounds. If decreasing indices are common (they fall currently in the not is_monotonic_bounds branch), we could have a far better alternative: re-use the is_monotonic_bounds branch but switch the add_/remove_ calls?

@mroeschke
Copy link
Member

If you trace back the usage of is_monotonic_bounds, it checks the entire bounds at the start for monotonically increasing indexes.

I didn't implement the non-monotonic bound rolling case, but happy to change it if there a more efficient way to do it.

@twoertwein
Copy link
Member Author

twoertwein commented Oct 10, 2020

I didn't implement the non-monotonic bound rolling case

Isn't that already handled by the current not is_monotonic_bounds branch? Since that code calculates the statistic for each window completely independently, it can work with any windows.

I would have expected that rolling is much slower for decreasing indices as it processes windows independently of each other. At least in this simple test, that is not the case :)

df = pd.DataFrame({'values': pd.np.random.rand(1000)})
df_reverse = pd.DataFrame({'values': df['values'][::-1]}, index=df.index[::-1])

%timeit df.rolling(window=10).mean()
645 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df_reverse.rolling(window=10).mean()
653 µs ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@twoertwein
Copy link
Member Author

kurt/skew are sometimes different between increasing/decreasing indices (at an order of 1-e8 for kurt and 1-e10 for skew). I would like to belief that the decreasing indices are more accurate as we set the accumulated value to zero instead of removing the to be deleted values.

@mroeschke is it okay to 1) use deterministic values for testing (instead of np.random.rand) 2) and then "fine-tune" the maximal difference to avoid failing the tests?

@twoertwein
Copy link
Member Author

should be good now :) @mroeschke @jreback

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me. please add a whatsnew note (1.2. bug fixes in rolling).

can you also hand calculate at least mean/sum in your test (i know you hard-code which is good) to make sure that we like those results. Just not 100% trusting 1.1.3

@twoertwein
Copy link
Member Author

okay, I will manually compute the expected results.

About whatsnew: this PR doesn't add any new features and doesn't fix any bugs. Just tests and avoiding for-loops for non-increasing indices.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2020

okay, I will manually compute the expected results.

About whatsnew: this PR doesn't add any new features and doesn't fix any bugs. Just tests and avoiding for-loops for non-increasing indices.

so results in 1.1.3 are ok, we just regressed somehow on master but haven't since haven't released 1.2 this is ok. great.

ping when ready. #37166 is going to rebase after this is in.

@twoertwein
Copy link
Member Author

rebased and added note that the expected statistics for sum/mean have been verified.

@twoertwein
Copy link
Member Author

@jreback green'ish (unrelated CI failure)

@mroeschke
Copy link
Member

Mind merging master and fixing up the code checks error?

@twoertwein
Copy link
Member Author

@mroeschke I think the code check errors are not caused by this PR. I rebased it now, let's see whether that fixes it.

@twoertwein
Copy link
Member Author

@mroeschke green except two unrelated windows CI failures

@mroeschke mroeschke merged commit d592e5e into pandas-dev:master Oct 25, 2020
@mroeschke
Copy link
Member

Thanks @twoertwein. Awesome find and patch!

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants