New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement business_start/end cases for shift_months #18489

Merged
merged 5 commits into from Nov 27, 2017

Conversation

Projects
None yet
2 participants
@jbrockmendel
Member

jbrockmendel commented Nov 25, 2017

Unifies apply_index implementations for MonthEnd/MonthBegin, plus extends them to BMonthEnd and BMonthBegin.

Unifies onOffset implementations for QuarterEnd/BQuarterEnd, plus extends them to QuarterBegin/BQuarterBegin.

Implements a cdef version of monthrange.

@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 25, 2017

(will run asv shortly)

from numpy cimport int64_t, int32_t

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

can we call this calendar.pxd ?

This comment has been minimized.

@jbrockmendel

jbrockmendel Nov 25, 2017

Member

We should avoid overlap with stdlib names. The thought behind ccalendar is by analogy with chardet --> cchardet

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

i don't think that's actually a problem, I guess ccalendar is ok.

@cython.boundscheck(False)
@cython.cdivision
cdef int dayofweek(int y, int m, int d) nogil:
"""Sakamoto's method, from wikipedia"""

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

can you add the link

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

doc-string

@cython.wraparound(False)
@cython.boundscheck(False)
cpdef monthrange(int64_t year, Py_ssize_t month):
cdef:

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

doc-string

cdef int is_leapyear(int64_t year) nogil:
"""Returns 1 if the given year is a leap year, 0 otherwise."""

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

doc-string

from numpy cimport int64_t, int32_t
cpdef monthrange(int64_t year, Py_ssize_t month)

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

so monthrange is actually used in a few places. changing to use this one? (similar with other methods you defined), e.g. its defined in tslib.pyx now (as well as normalize_date)

This comment has been minimized.

@jbrockmendel

jbrockmendel Nov 25, 2017

Member

It is. I figured I'd way to update the other usages until after you pointed it out, keep the diff smaller for the first round.

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

no I'd rather see the far reaching changes (if any) now. Ideally in a PR you change a small number of things, but clean them up everywhere (e.g. replace usages of monthrange). sure its not always possible, but then its not 'half-done'

# ----------------------------------------------------------------------
# Constants
# Slightly more performant cython lookups than a 2D table

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

then add that the first 12 are non-leap years and second are.

This comment has been minimized.

@jbrockmendel
@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 25, 2017

OK, I've addressed all the comments locally, am running benchmarks (several times) and fixed the travis flake problem. Doing some extra cleanup and parametrization of the tests to make sure the affected offset methods are covered, will update later.

@@ -929,8 +929,9 @@ def name(self):
if self.isAnchored:
return self.rule_code
else:
month = liboffsets._int_to_month[self.n]

This comment has been minimized.

@jreback

jreback Nov 25, 2017

Contributor

prob should de-privatize these in offsets (_int_to_month)

This comment has been minimized.

@jbrockmendel
@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 26, 2017

OK, just confirmed: the existing tests do not hit the new apply_index methods, and the speedup is pretty enormous.

@jbrockmendel jbrockmendel referenced this pull request Nov 26, 2017

Merged

Move frequencies functions to cython #17746

2 of 4 tasks complete
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 26, 2017

what is the perf issue here?

@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 26, 2017

what is the perf issue here?

I'm still running asvs to try to pin it down. AFAICT something in ccalendar must be less performant than the version in src/datetime

taskset 4 asv continuous -f 1.1 -E virtualenv master HEAD -b offset -b timeseries -b period -b timedelta
[...]
       before           after         ratio
     [be66ef83]       [b244bab9]
+      2.50±0.3ms       4.81±0.4ms     1.93  period.Algorithms.time_drop_duplicates('series')
+     2.81±0.02ms      5.27±0.07ms     1.88  period.Algorithms.time_value_counts('series')
+        398±20ms          460±1ms     1.16  offset.SemiMonthOffset.time_begin_apply_index
+         401±1ms          461±2ms     1.15  offset.SemiMonthOffset.time_end_apply_index
+         406±6ms          458±2ms     1.13  offset.SemiMonthOffset.time_begin_incr_rng
-      22.8±0.5μs      20.4±0.06μs     0.89  timeseries.DatetimeAccessor.time_dt_accessor
-         126±6μs        112±0.3μs     0.89  timeseries.DatetimeIndex.time_unique
-           1.56s            1.38s     0.89  offset.ApplyIndex.time_apply_series(<BusinessYearBegin: month=1>)
-        68.0±3μs       60.1±0.2μs     0.88  period.PeriodUnaryMethods.time_now('min')
-      79.1±0.1μs       68.6±0.2μs     0.87  period.PeriodProperties.time_start_time('M')
-        25.7±2ms         22.0±0ms     0.86  timeseries.DatetimeIndex.time_to_pydatetime
-         180±5ns        154±0.6ns     0.85  timedelta.TimedeltaProperties.time_timedelta_nanoseconds
-           1.61s            1.37s     0.85  offset.ApplyIndex.time_apply_series(<BusinessYearEnd: month=12>)
-     11.3±0.07μs      9.55±0.08μs     0.85  offset.YearBegin.time_timeseries_year_apply
-        24.8±2μs       20.8±0.1μs     0.84  offset.SemiMonthOffset.time_begin_decr
-           25.2s            21.1s     0.84  gil.nogil_datetime_fields.time_period_to_datetime
-      26.9±0.2μs      22.1±0.09μs     0.82  period.PeriodUnaryMethods.time_now('M')
-           1.74s            1.42s     0.82  offset.ApplyIndex.time_apply_series(<BusinessQuarterEnd: startingMonth=3>)
-        1.04±0μs          848±2ns     0.81  period.PeriodProperties.time_minute('min')
-           9.62s            7.76s     0.81  gil.nogil_datetime_fields.time_datetime_to_period
-      9.17±0.3ms      7.31±0.09ms     0.80  timeseries.DatetimeIndex.time_dti_tz_factorize
-        15.3±1ms      12.2±0.06ms     0.80  timeseries.DatetimeIndex.time_infer_freq_none
-           1.46s       14.5±0.3ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthEnd>)
-           1.48s      13.8±0.08ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthBegin>)
-           1.50s      11.9±0.09ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthBegin>)
-           1.54s       12.3±0.2ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthEnd>)

b244bab9 is after updating all the cimports as suggested, except for one in fields that I reverted to try track down the regression.

@codecov

This comment has been minimized.

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18489 into master will decrease coverage by 0.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18489      +/-   ##
==========================================
- Coverage   91.33%   91.32%   -0.02%     
==========================================
  Files         163      163              
  Lines       49801    49773      -28     
==========================================
- Hits        45487    45454      -33     
- Misses       4314     4319       +5
Flag Coverage Δ
#multiple 89.12% <95.23%> (-0.02%) ⬇️
#single 40.72% <28.57%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 96.94% <95.23%> (-0.09%) ⬇️
pandas/core/indexes/interval.py 92.69% <0%> (-0.53%) ⬇️
pandas/core/indexes/datetimes.py 95.52% <0%> (-0.2%) ⬇️
pandas/core/generic.py 95.73% <0%> (-0.06%) ⬇️
pandas/core/indexes/base.py 96.4% <0%> (-0.02%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.46% <0%> (-0.01%) ⬇️
pandas/core/internals.py 94.47% <0%> (-0.01%) ⬇️
pandas/core/panel.py 97.14% <0%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d101064...d3ff628. Read the comment docs.

@codecov

This comment has been minimized.

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18489 into master will decrease coverage by 50.56%.
The diff coverage is 28.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #18489       +/-   ##
===========================================
- Coverage   91.33%   40.77%   -50.57%     
===========================================
  Files         163      163               
  Lines       49801    49796        -5     
===========================================
- Hits        45487    20304    -25183     
- Misses       4314    29492    +25178
Flag Coverage Δ
#multiple ?
#single 40.77% <28.57%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 44.04% <28.57%> (-52.99%) ⬇️
pandas/tools/hashing.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/io/formats/style.py 0% <0%> (-100%) ⬇️
pandas/parser.py 0% <0%> (-100%) ⬇️
pandas/lib.py 0% <0%> (-100%) ⬇️
pandas/io/json/json.py 0% <0%> (-100%) ⬇️
pandas/types/concat.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.08%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.28%) ⬇️
... and 112 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d101064...c503c53. Read the comment docs.

@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 26, 2017

Separating the ccalendar stuff from the new apply_index and onOffset methods. The newly-pushed commit should be an unambiguous win.

asv continuous -f 1.1 -E virtualenv master HEAD -b offset -b period -b timedelta -b timeseries
[...]
     [d1010643]       [d3ff6286]
+     27.5±0.09μs         36.5±2μs     1.33  offset.CBDayHolidays.time_custom_bday_cal_decr
+         499±1ms         612±20ms     1.23  offset.ApplyIndex.time_apply_series(<QuarterEnd: startingMonth=3>)
+     18.5±0.04μs       22.2±0.8μs     1.20  offset.Day.time_timeseries_day_apply
+         467±2ms          554±2ms     1.19  offset.SemiMonthOffset.time_end_apply_index
+      22.1±0.1μs       25.0±0.2μs     1.13  offset.SemiMonthOffset.time_end_decr_n
+      24.3±0.1μs       27.5±0.1μs     1.13  offset.CBDayHolidays.time_custom_bday_cal_incr_n
+         911±4ns      1.02±0.06μs     1.12  period.PeriodProperties.time_is_leap_year('M')
+     21.0±0.09μs         23.4±1μs     1.11  period.PeriodUnaryMethods.time_now('M')
+       107±0.2ns         118±10ns     1.10  timedelta.TimedeltaProperties.time_timedelta_seconds
-           1.73s            1.53s     0.89  timeseries.Iteration.time_iter_periodindex
-     3.93±0.04ms      3.47±0.02ms     0.88  period.PeriodIndexConstructor.time_from_pydatetime('D')
-     7.60±0.07μs      6.60±0.02μs     0.87  timeseries.DatetimeIndex.time_timestamp_tzinfo_cons
-           1.62s            1.41s     0.87  offset.ApplyIndex.time_apply_series(<BusinessYearEnd: month=12>)
-         530±9μs        459±0.6μs     0.87  period.Algorithms.time_value_counts('index')
-         536±7ms          458±3ms     0.85  offset.SemiMonthOffset.time_begin_apply_index
-      1.03±0.1μs          866±7ns     0.84  period.PeriodProperties.time_day('min')
-        133±10μs       111±0.05μs     0.83  timeseries.DatetimeIndex.time_unique
-     1.05±0.04μs          873±1ns     0.83  period.PeriodProperties.time_dayofyear('min')
-      20.1±0.3μs      16.4±0.09μs     0.82  timeseries.AsOf.time_asof_single_early
-      4.39±0.1ms      3.31±0.01ms     0.75  timeseries.ToDatetime.time_cache_true_with_unique_seconds_and_unit
-           24.0s            14.2s     0.59  gil.nogil_datetime_fields.time_datetime_to_period
-           1.46s       20.8±0.8ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthEnd>)
-           1.55s       20.9±0.1ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthBegin>)
-           1.61s       19.6±0.4ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthBegin>)
-           1.79s       21.6±0.2ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthEnd>)
-         668±2μs      6.93±0.02μs     0.01  offset.OnOffset.time_on_offset(<BusinessQuarterBegin: startingMonth=3>)
-         628±2μs      6.10±0.04μs     0.01  offset.OnOffset.time_on_offset(<QuarterBegin: startingMonth=3>)

Running again. The 100x improvements are real. The slowdowns I expect to be noise.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 26, 2017

what path was BusinessMonthEnd for example taking before that made it a lot slower? all in python?

@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 26, 2017

The base class applyindex raises, at which point it falls back to point wise addition

@jreback jreback added the Performance label Nov 27, 2017

@jreback jreback added this to the 0.22.0 milestone Nov 27, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 27, 2017

this needs a note for performance in whats for 0.22.0. ping when pushed as this is green already.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 27, 2017

add an additional line (even though you have one for #18218). in general any PR that is perf related should get an entry (or you can include that PR number on an existing entry). cleaning/reorg PR's that don't touch the user don't need ones.

@jbrockmendel

This comment has been minimized.

Member

jbrockmendel commented Nov 27, 2017

ping

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 27, 2017

thanks!

@jreback jreback merged commit f745e52 into pandas-dev:master Nov 27, 2017

0 of 3 checks passed

ci/circleci Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@jbrockmendel jbrockmendel referenced this pull request Nov 28, 2017

Merged

standalone implementation of ccalendar #18540

0 of 4 tasks complete

@jbrockmendel jbrockmendel deleted the jbrockmendel:tslibs-offsets6 branch Dec 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment