Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix a bug in 'timedelta_range' that produced an extra point on a edge case (fix #30353) #33498

Merged
merged 17 commits into from
May 9, 2020

Conversation

hasB4K
Copy link
Member

@hasB4K hasB4K commented Apr 12, 2020

The issue from #30353 came actually from timedelta_range.

import pandas as pd

def mock_timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None):
    epoch = pd.Timestamp(0)
    if start is not None:
        start = epoch + pd.Timedelta(start)
    if end is not None:
        end = epoch + pd.Timedelta(end)
    res = pd.date_range(start=start, end=end, periods=periods, freq=freq, name=name, closed=closed)
    res -= epoch
    res.freq = freq
    return res

print(mock_timedelta_range("1day", "10day", freq="2D"))
print(pd.timedelta_range("1day", "10day", freq="2D"))

The outputs from mock_timedelta_range and from pd.timedelta_range are supposed to equivalent, but are not on pandas v1.0.0:
Outputs without this PR:

TimedeltaIndex(['1 days', '3 days', '5 days', '7 days', '9 days'], dtype='timedelta64[ns]', freq='2D')
TimedeltaIndex(['1 days', '3 days', '5 days', '7 days', '9 days', '11 days'], dtype='timedelta64[ns]', freq='2D')

Outputs with this PR:

TimedeltaIndex(['1 days', '3 days', '5 days', '7 days', '9 days'], dtype='timedelta64[ns]', freq='2D')
TimedeltaIndex(['1 days', '3 days', '5 days', '7 days', '9 days'], dtype='timedelta64[ns]', freq='2D')

It also solve an issue (that fail on master) related to this comment: #13022 (comment)

import pandas as pd
rng = pd.timedelta_range(start='1d', periods=10, freq='d')
df = pd.Series(range(10), index=rng)
df.resample('2D').count()

Outputs with this PR:

1 days    2
3 days    2
5 days    2
7 days    2
9 days    2
Freq: 2D, dtype: int64

On the firsts commits I just fixed the issue in _generate_regular_range, then I decided to do a refactor and to use the same code that generate ranges in core/arrays/_ranges.py for date_range and timedelta_range.

Linked with #10887.


@hasB4K hasB4K changed the title Resample fix timedelta WIP: fix #30353 invalid extra bins when resampling timedelta (bug from timedelta_range) Apr 12, 2020
@hasB4K hasB4K changed the title WIP: fix #30353 invalid extra bins when resampling timedelta (bug from timedelta_range) BUG: Fix a bug in 'timedelta_range' that produced an extra point on a edge case (fix #30353) Apr 12, 2020
pandas/core/resample.py Outdated Show resolved Hide resolved
@hasB4K hasB4K changed the title BUG: Fix a bug in 'timedelta_range' that produced an extra point on a edge case (fix #30353) WIP: BUG: Fix a bug in 'timedelta_range' that produced an extra point on a edge case (fix #30353) Apr 12, 2020
@hasB4K hasB4K changed the title WIP: BUG: Fix a bug in 'timedelta_range' that produced an extra point on a edge case (fix #30353) BUG: Fix a bug in 'timedelta_range' that produced an extra point on a edge case (fix #30353) Apr 12, 2020
@pep8speaks
Copy link

pep8speaks commented Apr 12, 2020

Hello @hasB4K! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-09 16:24:27 UTC

@hasB4K hasB4K force-pushed the resample-fix-timedelta branch 2 times, most recently from ffe1caf to 89a9af5 Compare April 14, 2020 00:46
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the refactor looks reasonable - @jbrockmendel any thoughts?

pandas/tests/resample/test_base.py Show resolved Hide resolved
@WillAyd WillAyd added Timedelta Timedelta data type Datetime Datetime data dtype labels Apr 14, 2020
@jbrockmendel
Copy link
Member

A few comments on the refactor, but i really like the idea of re-using generate_regular_range

Comment on lines +425 to +579
- Bug in :func:`timedelta_range` that produced an extra point on a edge case (:issue:`30353`, :issue:`33498`)
- Bug in :meth:`DataFrame.resample` that produced an extra point on a edge case (:issue:`30353`, :issue:`13022`, :issue:`33498`)
- Bug in :meth:`DataFrame.resample` that ignored the ``loffset`` argument when dealing with timedelta (:issue:`7687`, :issue:`33498`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is okay @WillAyd 🙂. I'll be happy to change this if you disagree with this changelog.

BTW, regarding loffset, as mentioned earlier (https://github.com/pandas-dev/pandas/pull/33498/files#r407231964):

Useful because otherwise tests/resample/test_base.py::test_resample_loffset_arg_type was always failing.
However, this piece of code is questionable since #31809 will deprecate loffset anyway.

pandas/core/arrays/_ranges.py Show resolved Hide resolved
pandas/core/arrays/datetimes.py Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Outdated Show resolved Hide resolved
pandas/core/arrays/datetimes.py Show resolved Hide resolved
pandas/tests/resample/test_base.py Show resolved Hide resolved
if isinstance(expected.index, TimedeltaIndex):
msg = "DataFrame are different"
with pytest.raises(AssertionError, match=msg):
tm.assert_frame_equal(result_agg, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is testing incorrect behavior? thats pretty weird

Copy link
Member Author

@hasB4K hasB4K Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Yeah, that was the case. That was testing an incorrect behavior... 😕 And this was the case for two reasons:

Here an example of the values when setting a breakpoint with PyCharm line 222 in pandas/tests/resample/test_base.py on master:

The expected result (with loffset and without an extra bin):
expected

The actual result (without loffset and with an extra bin):
result_agg

I think that was the reason of the presence of this TODO:

# GH 13022, 7687 - TODO: fix resample w/ TimedeltaIndex

IMO, the presence of this if was just to skip the tests of Timedeltas when resampling without skipping the tests of Timestamps and Periods parametrized with @all_ts.

@jreback jreback added this to the 1.1 milestone May 1, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm a couple of questions.

@hasB4K
Copy link
Member Author

hasB4K commented May 1, 2020

lgtm a couple of questions.

Thank you! 🙂
I have just answered on those and fixed the issue on the test you mentioned earlier.

Let me know if you need anything else!

@hasB4K hasB4K requested a review from jreback May 1, 2020 14:30
@jreback jreback merged commit 7c3f662 into pandas-dev:master May 9, 2020
@jreback
Copy link
Contributor

jreback commented May 9, 2020

thanks @hasB4K very nice!

@hasB4K hasB4K deleted the resample-fix-timedelta branch May 9, 2020 20:41
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Timedelta Timedelta data type
Projects
None yet
6 participants