Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fixes rounding error in Timestamp.floor() #19240

Merged
merged 2 commits into from Feb 7, 2018

Conversation

cbertinato
Copy link
Contributor

@codecov
Copy link

codecov bot commented Jan 15, 2018

Codecov Report

Merging #19240 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19240      +/-   ##
==========================================
+ Coverage    91.6%   91.62%   +0.02%     
==========================================
  Files         150      150              
  Lines       48750    48742       -8     
==========================================
+ Hits        44657    44661       +4     
+ Misses       4093     4081      -12
Flag Coverage Δ
#multiple 89.99% <100%> (+0.02%) ⬆️
#single 41.76% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimelike.py 97.05% <100%> (-0.05%) ⬇️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54f1b3e...1901fc8. Read the comment docs.

@jreback jreback changed the title BUG: Fixes rounding error in Timestamp.floor() BUG: Fixes rounding error in Timestamp.floor() Jan 15, 2018
@jreback jreback added Timeseries Dtype Conversions Unexpected or buggy dtype conversions labels Jan 15, 2018
@@ -462,6 +461,7 @@ Numeric
- Bug in :class:`DatetimeIndex`, :class:`TimedeltaIndex` addition and subtraction of zero-dimensional integer arrays gave incorrect results (:issue:`19012`)
- Bug in :func:`Series.__add__` adding Series with dtype ``timedelta64[ns]`` to a timezone-aware ``DatetimeIndex`` incorrectly dropped timezone information (:issue:`13905`)
- Bug in :func:`Timedelta.__floordiv__` and :func:`Timedelta.__rfloordiv__` dividing by many incompatible numpy objects was incorrectly allowed (:issue:`18846`)
- Bug in :func:`Timestamp.floor` where time stamps far in the future and past were not rounded correctly (:issue:`19206`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below, need to add for ceil and for DatetimeIndex floor/ceil as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need tests for DTI as well (these are in pandas/tests/indexes/datetimes/test_ops pls parametrize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you need a separate fix there as well.

divisor = 10 ** int(np.log10(unit / 1e7))
else:
divisor = 1

if unit < 1000 and unit % 1000 != 0:
# for nano rounding, work with the last 6 digits separately
# due to float precision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this case be incorporated as well in the same divisor check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe set both divisor & buff in a single if that handles really small, really large and other units (divisor=1, buff=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a simple way of doing this. We would essentially have to move the logic into the arithmetic, which I think would make the code more opaque. I can move the divisor check into the case were unit >= 1000 and refactor some of it to make it a bit cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's ok, just trying to make it more clear.

@@ -695,6 +695,27 @@ def test_round(self):
expected = Timestamp('20130101')
assert result == expected

# GH 19206
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here

result = dt.floor('10ns')
expected = Timestamp('1823-01-01 00:00:01.000000010')
assert result == expected

# ceil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test for ceil as well. If you can parametrize this all the better (if not ok too).

Also ok with splitting this giant test to test_round, test_floor, test_ceil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on other change requests, but just wanted to get feedback on refactoring for timestamp tests. There are so many tests that it seemed like a mixed approach of splitting tests so that failed assertions could be easily traced, and parametrizing to consolidate common code was the way to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below. we need to parameterize as much as possible to cover the cases. and you can still run them individually as well (optionally you can pass ids to create them)

assert result == expected

dt = Timestamp('1823-01-01 00:00:01')
result = dt.floor('1s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls test for datetimeindex as well

result = dt.round('D')
expected = Timestamp('20130202')
assert result == expected
TStestcase = namedtuple('Tstestcase',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

divisor = 10 ** int(np.log10(unit / 1e7))
else:
divisor = 1

if unit < 1000 and unit % 1000 != 0:
# for nano rounding, work with the last 6 digits separately
# due to float precision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this?

@@ -462,6 +461,7 @@ Numeric
- Bug in :class:`DatetimeIndex`, :class:`TimedeltaIndex` addition and subtraction of zero-dimensional integer arrays gave incorrect results (:issue:`19012`)
- Bug in :func:`Series.__add__` adding Series with dtype ``timedelta64[ns]`` to a timezone-aware ``DatetimeIndex`` incorrectly dropped timezone information (:issue:`13905`)
- Bug in :func:`Timedelta.__floordiv__` and :func:`Timedelta.__rfloordiv__` dividing by many incompatible numpy objects was incorrectly allowed (:issue:`18846`)
- Bug in :func:`Timestamp.floor` where time stamps far in the future and past were not rounded correctly (:issue:`19206`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need tests for DTI as well (these are in pandas/tests/indexes/datetimes/test_ops pls parametrize

@@ -462,6 +461,7 @@ Numeric
- Bug in :class:`DatetimeIndex`, :class:`TimedeltaIndex` addition and subtraction of zero-dimensional integer arrays gave incorrect results (:issue:`19012`)
- Bug in :func:`Series.__add__` adding Series with dtype ``timedelta64[ns]`` to a timezone-aware ``DatetimeIndex`` incorrectly dropped timezone information (:issue:`13905`)
- Bug in :func:`Timedelta.__floordiv__` and :func:`Timedelta.__rfloordiv__` dividing by many incompatible numpy objects was incorrectly allowed (:issue:`18846`)
- Bug in :func:`Timestamp.floor` where time stamps far in the future and past were not rounded correctly (:issue:`19206`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you need a separate fix there as well.

@cbertinato
Copy link
Contributor Author

Probably a separate issue, but there appears to be no warning if you specify a Timestamp with a string that causes the underlying int64 to overflow. This works:

In [135]: pd.Timestamp(9223372036854775807)
Out[135]: Timestamp('2262-04-11 23:47:16.854775807')

In [136]: pd.Timestamp(9223372036854775808)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-136-49b799c43d55> in <module>()
----> 1 pd.Timestamp(9223372036854775808)
pandas/_libs/tslib.pyx in pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051)()
pandas/_libs/tslib.pyx in pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:27665)()
OverflowError: Python int too large to convert to C long

but this is silent:

In [138]: pd.Timestamp('2262-04-11 23:47:16.854775807')
Out[138]: Timestamp('2262-04-11 23:47:16.854775807')

In [139]: pd.Timestamp('2262-04-11 23:47:16.854775808')
Out[139]: Timestamp('2262-04-11 23:47:16.854775')

Shouldn't this throw an exception?

@pep8speaks
Copy link

pep8speaks commented Jan 25, 2018

Hello @cbertinato! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 07, 2018 at 14:29 Hours UTC

@jreback
Copy link
Contributor

jreback commented Jan 26, 2018

@cbertinato this is all correct

In [1]: pd.Timestamp.max
Out[1]: Timestamp('2262-04-11 23:47:16.854775807')

In [2]: pd.Timestamp.max.value
Out[2]: 9223372036854775807

In [3]: pd.Timestamp(pd.Timestamp.max.value)
Out[3]: Timestamp('2262-04-11 23:47:16.854775807')

In [4]: pd.Timestamp(pd.Timestamp.max.value + 1)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-4-cef6abffd699> in <module>()
----> 1 pd.Timestamp(pd.Timestamp.max.value + 1)

~/pandas/pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()
    579             tz, tzinfo = tzinfo, None
    580 
--> 581         ts = convert_to_tsobject(ts_input, tz, unit, 0, 0)
    582 
    583         if ts.value == NPY_NAT:

~/pandas/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()
    287         else:
    288             ts = ts * cast_from_unit(None, unit)
--> 289             obj.value = ts
    290             dt64_to_dtstruct(ts, &obj.dts)
    291     elif is_float_object(ts):

OverflowError: Python int too large to convert to C long

In [5]: pd.Timestamp('2262-04-11 23:47:16.854775808')
Out[5]: Timestamp('2262-04-11 23:47:16.854775')

In [6]: pd.Timestamp('2262-04-11 23:47:16.854775808').value
Out[6]: 9223372036854775000

[138] & [139] might be some nanosecond rounding; I suppose you could make an issue about it

# GH19206
# to deal with round-off when unit is large
if unit >= 1e9:
divisor = 10 ** int(np.log10(unit / 1e7))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably pull out all of this rounding logic to a separate function that can be called on a scalar & and array, so we can use it both for Timestamp and DTI (could do it in this PR or as a followup)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea. We can do it here while this is still open.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

'1823-01-01 00:00:01.000000020')
])
def test_round_ops(self, test_input, rounder, freq, expected):
for tz in self.tz:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why tz was not made into a fixture, but I guess ok here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of that too. I don't know why either. Must have overlooked it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!


@pytest.mark.parametrize('test_input, rounder, freq, expected, kw', [
('20130101 09:10:11', 'floor', 'D', '20130101', {}),
# GH 19206 - times far in the future and past rounding incorrectly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just pass None for kw (and then turn it in it into an empty dict in the test)

…r dates far in the future and past (GH19206)
@cbertinato
Copy link
Contributor Author

Moved the function common to both Timestamps and DTIs out of the Timestamp round function, but left it in timestamp.pyx. Not sure whether to move it out of there completely. Also moved all tests into the refactored test modules.

'1823-01-01 00:00:01.000000020'),
('1823-01-01 00:00:01', 'floor', '1s', '1823-01-01 00:00:01'),
('1823-01-01 00:00:01', 'ceil', '1s', '1823-01-01 00:00:01')])
def test_ceil_floor_edge(self, tz, test_input, rounder, freq, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a NaT and another element here (the 2nd can be the same as the first)

@@ -93,6 +93,22 @@ def test_round_frequencies(self, freq, expected):
result = stamp.round(freq=freq)
assert result == expected

@pytest.mark.parametrize('test_input, rounder, freq, expected', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a NaT here as well

@jreback jreback added this to the 0.23.0 milestone Feb 7, 2018
@jreback jreback merged commit 5052842 into pandas-dev:master Feb 7, 2018
@jreback
Copy link
Contributor

jreback commented Feb 7, 2018

thanks @cbertinato nice patch! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Timeseries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rounding error in Timestamp.floor()
3 participants