BUG/PERF: offsets.apply doesnt preserve nanosecond #7697

Merged
merged 1 commit into from Jul 25, 2014

Conversation

Projects
None yet
2 participants
Member

sinhrks commented Jul 8, 2014

Main Fix is to preserve nanosecond info which can lost during offset.apply, but it also includes:

  • Support dateutil timezone
  • Little performance improvement. Even though v0.14.1 should take longer than v0.14.0 because perf test in v0.14 doesn't perform timestamp conversion which was fixed in #7502.
    NOTE: This caches Tick.delta because it was calculated 3 times repeatedly, but does it cause any side effect?

Before

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
timeseries_year_incr                         |   0.0164 |   0.0103 |   1.5846 |
timeseries_year_apply                        |   0.0153 |   0.0094 |   1.6356 |
timeseries_day_incr                          |   0.0187 |   0.0053 |   3.5075 |
timeseries_day_apply                         |   0.0164 |   0.0033 |   4.9048 |

Target [d0076db] : PERF: Improve index.min and max perf
Base   [da0f7ae] : RLS: 0.14.0 final

After the fix

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
timeseries_year_incr                         |   0.0150 |   0.0087 |   1.7339 |
timeseries_year_apply                        |   0.0126 |   0.0073 |   1.7283 |
timeseries_day_incr                          |   0.0130 |   0.0053 |   2.4478 |
timeseries_day_apply                         |   0.0107 |   0.0033 |   3.2143 |

Target [64dd021] : BUG: offsets.apply doesnt preserve nanosecond
Base   [da0f7ae] : RLS: 0.14.0 final

@jreback jreback and 1 other commented on an outdated diff Jul 8, 2014

pandas/tseries/offsets.py
@@ -26,15 +27,17 @@
# convert to/from datetime/timestamp to allow invalid Timestamp ranges to pass thru
def as_timestamp(obj):
+ if isinstance(obj, Timestamp):
+ return obj
+ if type(obj) == date:
@jreback

jreback Jul 8, 2014

Contributor

shouldn't this be isinstance(obj, date)? not sure if you need to include (np.datetime64,datetime,date)

@sinhrks

sinhrks Jul 9, 2014

Member

Because `datetimeis a subclass ofdate``. Modified to use``isinstance``.

>>> dt = datetime.datetime(2011, 1, 1)
>>> isinstance(dt, datetime.date)
True
@sinhrks

sinhrks Jul 9, 2014

Member

Ah, Timestamp can accept datetime.date. Thus this check seems not to be required at all.

@jreback jreback and 1 other commented on an outdated diff Jul 8, 2014

pandas/tseries/offsets.py
return Timestamp(obj)
except (OutOfBoundsDatetime):
pass
return obj
-def as_datetime(obj):
+def as_datetime(obj, warn=False):
@jreback

jreback Jul 8, 2014

Contributor

what does 'warn' do? why is it needed

@sinhrks

sinhrks Jul 9, 2014

Member

In case of OutOfBoundsDatetime error, result will be normal datetime (nanosecond will be reset). This is passed to Timestamp.to_pydatetime to show warning says that.

@jreback

jreback Jul 22, 2014

Contributor

I don't see this warning being actually used anywhere? (I see you calling it), but it doesn't seem to do anything

@jreback

jreback Jul 24, 2014

Contributor

still not clearn on the use of (and does it do anything) for warn?

@sinhrks

sinhrks Jul 24, 2014

Member

Ah correct. Originally warn was intended to be passed to to_pydatetime, but current logic doesn't require it. Because as_datetime only used once and it requires to show warning. Fixed.
https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L425

jreback added this to the 0.15.0 milestone Jul 8, 2014

Contributor

jreback commented Jul 21, 2014

@sinhrks revisit?

Member

sinhrks commented Jul 21, 2014

@jreback Rebased. Is any other thing required?

Contributor

jreback commented Jul 21, 2014

  • can you show an example of where this fails (in current master)
  • can you point to where Tick is calculated several times (currently)?
Member

sinhrks commented Jul 22, 2014

Affected offset (reset nanosecond)

If the aply logic includes datetime conversion, nanosecond will be lost.

  • CustomBusinessDay
  • CustomBusinessMonthEnd
  • CustomBusinessMonthBegin
  • MonthBegin
  • BusinessMonthBegin
  • MonthEnd
  • BusinessMonthEnd
  • YearBegin
  • BYearBegin
  • YearEnd
  • BYearEnd
  • QuarterBegin
  • BQuarterBegin

Affected offset (dateutil support)

tz.localize raises AttributeError: 'tzfile' object has no attribute 'localize' if tz is dateutil timezone.

  • BQuarterBegin
  • QuarterEnd
  • BQuarterEnd
  • LastWeekOfMonth
  • FY5253Quarter
  • FY5253
  • WeekOfMonth
  • Easter
Member

sinhrks commented Jul 22, 2014

can you point to where Tick is calculated several times (currently)?

Caused by these:

  • Check hasattr

https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#784

  • Twice again in nanosecond conversion (hasattr and actual addition)

https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#784

Contributor

jreback commented Jul 22, 2014

not sure what you mean (about Tick being calced more than once). The 2nd time is simply int_64 addition. AFAICT. delta_to_nanoseconds is necessary. What do you think this should change to?

Member

sinhrks commented Jul 22, 2014

Yeah all operations are necessary. What I meant is every time delta is being calculated, thus it may better to change cache_readonly
https://github.com/pydata/pandas/blob/master/pandas/tseries/offsets.py#L2016.

Actually this is not affects to performance so much, thus it is possible to leave it as normal property.

Contributor

jreback commented Jul 22, 2014

I think that delta and nanos could be cache_readonly. Once you create an offset they are not changed (I don't think). You can try that (but separate PR). I am not sure how to test that, maybe just trace the code and see.

Contributor

jreback commented Jul 22, 2014

@sinhrks otherwise this looks ok. It just changes a lot of code so trying to review.

Member

sinhrks commented Jul 22, 2014

OK, modified to normal property.

Contributor

jreback commented Jul 22, 2014

seems, apply_wraps is on every apply, except for in DateOffset. maybe add a note there why this is (or is it right?)

@jreback jreback commented on the diff Jul 22, 2014

pandas/tseries/offsets.py
+ if self.normalize:
+ # normalize_date returns normal datetime
+ result = tslib.normalize_date(result)
+ result = Timestamp(result)
+
+ # nanosecond may be deleted depending on offset process
+ if not self.normalize and nano != 0:
+ if not isinstance(self, Nano) and result.nanosecond != nano:
+ if result.tz is not None:
+ # convert to UTC
+ value = tslib.tz_convert_single(result.value, 'UTC', result.tz)
+ else:
+ value = result.value
+ result = Timestamp(value + nano)
+
+ if tz is not None and result.tzinfo is None:
@jreback

jreback Jul 22, 2014

Contributor

shouldn't this be result= tslib._localize_pydatetime(result, tz) as well here?

@sinhrks

sinhrks Jul 24, 2014

Member

Because this is a flow for Timestamp, no need to care for datetime here.

@jreback

jreback Jul 24, 2014

Contributor

hmm, maybe add a note (or you can simply use the other routine). I found it confusing?

@sinhrks

sinhrks Jul 24, 2014

Member

OK. Modified to use tslib._localize_pydatetime(result, tz) to avoid any confusion.

Member

sinhrks commented Jul 24, 2014

DateOffset needs apply_wraps. I missed because of misunderstanding that DateOffset cannot be used by itself (#7375). Fixed and added tests.

Contributor

jreback commented Jul 24, 2014

ok, ping when green

Member

sinhrks commented Jul 24, 2014

@jreback now green.

@jreback jreback added a commit that referenced this pull request Jul 25, 2014

@jreback jreback Merge pull request #7697 from sinhrks/offsetnano
BUG/PERF: offsets.apply doesnt preserve nanosecond
415fbfc

@jreback jreback merged commit 415fbfc into pandas-dev:master Jul 25, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Contributor

jreback commented Jul 25, 2014

@sinhrks thanks for this...cleans up a large amount of code.....

sinhrks deleted the sinhrks:offsetnano branch Jul 25, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment