Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom-business-days offsets very slow #6584

Closed
bjonen opened this issue Mar 10, 2014 · 17 comments · Fixed by #6592
Closed

Custom-business-days offsets very slow #6584

bjonen opened this issue Mar 10, 2014 · 17 comments · Fixed by #6592
Labels
Datetime Datetime data dtype Frequency DateOffsets Performance Memory or execution speed performance
Milestone

Comments

@bjonen
Copy link
Contributor

bjonen commented Mar 10, 2014

The custom-business-days are currently significantly slower (around factor 4) compared to pd.offsets.BusinessDay(). Without actually specifying any custom business days:

import datetime as dt

import pandas as pd

date = dt.datetime(2011,1,1)
cday = pd.offsets.CustomBusinessDay()
%timeit date + pd.offsets.BusinessDay() #  6.59 µs
%timeit date + cday                     #  26.1 µs

Profiling pd.offsets.CustomBusinessDay.apply shows that only around 13% of the time is spent in np.busday_offset. The majority of time is spent casting the dates from datetime to datetime64 etc.

I'm not so familiar with the code but one idea would be to work in datetime by default and try to stick with it as much as possible. The method could then look something like this:

    def apply(self, other):
        if not isinstance(other,datetime):
            other = other.astype(dt.datetime)

        if self.n <= 0:
            roll = 'forward'
        else:
            roll = 'backward'

        dt_str = other.strftime('%Y-%m-%d')
        result = np.busday_offset(dt_str,self.n,roll=roll,busdaycal=self.busdaycalendar)
        result = result.astype(dt.datetime)
        if not self.normalize:
            result = dt.datetime.combine(result,other.time())

        if self.offset:
            print self.offsets
            result = result + self.offset

        return result

While this might not be a perfect comparison because I left out some conversion code, the changes yield a sizable speedup.

%timeit date + cday      # 14.9 µs

Ultimately I would like to have a UsBday offset. The code looks like this:

import datetime as dt

import numpy as np
import pandas as pd

class UsBday(object):
    def __init__(self):
        self.gen_cal()

    def gen_cal(self):
        holidays = []
        for hlday in ['new_years','mlk_day',
                      'presidents_day','good_friday',
                      'memorial_day','independence_day','
                      labor_day','thanksgiving','xmas_holiday']:
            hlday_func = getattr(self,hlday)
            tmp_holidays = [ hlday_func(year) for year in 
                             range(1950,dt.datetime.today().year+2) ]
            holidays.extend(tmp_holidays)
        self.cal = np.busdaycalendar(holidays=holidays)
        self.bday_offset = pd.offsets.CustomBusinessDay(holidays=holidays)

    def nxt_bday(self,dt):
        nxt_bday = np.busday_offset(
            dt.strftime('%Y-%m-%d'),offsets=0,
            roll='forward',busdaycal=self.cal)
        return nxt_bday

    # - holiday definitions - #
    @staticmethod
    def new_years(year):
        cand = dt.datetime(year,1,1)
        dt_str = cand.strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str,offsets=0,roll='forward')
        return res

    @staticmethod
    def mlk_day(year):
        # third monday in January
        dt_str = dt.datetime(year,1,1).strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str, 2, roll='forward', weekmask='Mon')
        return res

    @staticmethod
    def presidents_day(year):
        # third monday February
        dt_str = dt.datetime(year,2,1).strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str, 2, roll='forward', weekmask='Mon')
        return res

    @staticmethod
    def good_friday(year):
        from dateutil.easter import easter
        easter_sun = easter(year)
        gr_fr = easter_sun - 2 * pd.offsets.Day()
        return gr_fr.strftime('%Y-%m-%d')

    @staticmethod
    def memorial_day(year):
        # final Monday of May
        dt_str = dt.datetime(year,1,1).strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str, -1, roll='forward', weekmask='Mon')
        return res

    @staticmethod
    def independence_day(year):
        # July 4th
        cand = dt.datetime(year,7,4).strftime('%Y-%m-%d')
        return cand

    @staticmethod
    def labor_day(year):
        # first Monday in September
        res = np.busday_offset(str(year) + '-09',offsets=0,
                               roll='forward',weekmask='Mon')
        return res

    @staticmethod
    def thanksgiving(year):
        # fourth thursday in november
        res = np.busday_offset(str(year) + '-11',offsets=-1,
                               roll='forward',weekmask='Thu')
        return res

    @staticmethod
    def xmas_holiday(year):
        cand = dt.datetime(year,12,25)
        while not np.is_busday(cand.strftime('%Y-%m-%d')):
            if cand.weekday() == 6:
                cand += pd.offsets.BDay()
            elif cand.weekday() == 5:
                cand -= pd.offsets.BDay()
            else:
                cand += pd.offsets.BDay()
        return cand.strftime('%Y-%m-%d')

us_bday = UsBday()
us_offset = .us_bday.bday_offset
%timeit date + us_offset: 26.1 µs

# Using numpy directly 
%timeit us_bday.nxt_bday(date) # 7.85 µs

My first intuition when I noticed that custom business days are slower was that this is due to the large list of holidays passed to numpy. The timings at the end of the code block, however, show that adding a custom business day with realistic holidays does not alter the performance by much. The main speed difference results from interfacing with numpy and is therefore a Pandas issue.

I know that CustomBusinessDays is in experimental mode. I hope this feedback can help improve it because I think it is an important feature in Pandas. Also perhaps it would be nice to ship certain custom calendars, for example for the US, directly with Pandas.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

this was contributed several pandas releases ago and has not had many comments...

you always want to work in np.datetime64 as its just an integer base and is MUCH faster than datetimes. That said, haven't really looked at this in detail, and I am sure perf could be improved.

Pls profile and see if you can figure out where and submit a PR!

@jreback jreback added this to the 0.15.0 milestone Mar 10, 2014
@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

#5148 might have an impact on this as well

@bjonen
Copy link
Contributor Author

bjonen commented Mar 10, 2014

Ok thanks, I'll have a look at it.

@rockg
Copy link
Contributor

rockg commented Mar 10, 2014

And I think we shouldn't have a UsBday offset, but rather start implementing calendars that can be added to the BusinessDay offset. This way we don't have a ton of date classes lying around.

Further, it would be nice to have some rule factory that would take generic rules and be able to develop the offset from that. For example, the below are rules that I have for US holidays (Code is relative to a date composed of Month, Day). I implemented a different offset scheme where 0, -1, +1 actually have some meaning and 3 day means to make it a 3 day weekend where Saturday means a Friday holiday and Sunday means a Monday holiday. So -1d+3Mon will be interpreted as the 3rd Monday from the reference date. From this it's very easy to create a holiday curve by just going through each rule and each year. This way there aren't many holiday functions, but rules in a simple table. What do you think?

usdRules = [ { 'Month': 1, 'Day': 1,  'Code': '3day', 'Name': 'New Years Day'},
            { 'Month': 1, 'Day': 1, 'Code': '-1d+3Mon', 'Name': 'Dr. Martin Luther King, Jr.'}, 
            { 'Month': 2, 'Day': 1,  'Code': '-1d+3Mon', 'Name': 'President''s Day'},
            { 'Month': 5, 'Day': 24, 'Code': '+1Mon',     'Name': 'Memorial Day'},
            { 'Month': 7, 'Day': 4,  'Code': '3day',    'Name': 'July 4th'},
            { 'Month': 9, 'Day': 1,  'Code': '-1d+1Mon', 'Name': 'Labor Day'},
            { 'Month': 10, 'Day': 1, 'Code': '-1d+2Mon', 'Name': 'Columbus Day'},
            { 'Month': 11, 'Day': 11, 'Code': '3day',    'Name': 'Veterans Day'},
            { 'Month': 11,'Day': 1,  'Code': '-1d+4Thu', 'Name': 'Thanksgiving'},
            { 'Month': 12,'Day': 25, 'Code': '3day', 'Name': 'Christmas'}
            ]

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

Its actually a bit more complicated, their are some holidays that are year dependent, e.g. president's deaths.

This could certainly be implemented via a Holiday type of class (that could be registered and then used by BusinessDay (and alleivate the need for Custom Business Day) - or that could be the impl.

@rockg
Copy link
Contributor

rockg commented Mar 10, 2014

That could be easily supported by having an optional start/end date parameter (for example, if a holiday changes date or for the example you describe) or if it's only in a given year, have an optional year field.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

sure...you can do lots with holiday!

need someone to write the Holidays class (and supporting machinery), to integrate with BusinessDay/CustomBusinessDay. A BusinessDay is really just a set of custom holidays (though impl makes it easier to use a weekday filter), but same idea.

@rockg
Copy link
Contributor

rockg commented Mar 10, 2014

I have most of this done. I can commit what I have and we can go from there. @bjonen is that okay?

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

@rockg awesome!

can you show an example?

@rockg
Copy link
Contributor

rockg commented Mar 10, 2014

Here are some:

In [1]: import HolidayCalendar, ApplyOffsetRule

In [2]: cal = HolidayCalendar('US')

In [3]: import datetime

In [7]: holidays = cal.holidayCurve(datetime.datetime(2012,1,1), datetime.datetime(2012,12,31))

In [8]: for h in holidays:
   ...:     print h
   ...:    
2012-01-02 00:00:00
2012-01-16 00:00:00
2012-02-20 00:00:00
2012-05-28 00:00:00
2012-07-04 00:00:00
2012-09-03 00:00:00
2012-10-08 00:00:00
2012-11-12 00:00:00
2012-11-22 00:00:00
2012-12-25 00:00:00

In [10]: ApplyOffsetRule(datetime.datetime(2012,12,24), '+1b', 'US')
Out[10]: datetime.datetime(2012, 12, 26, 0, 0)

In [11]: ApplyOffsetRule(datetime.datetime(2012,12,24), '+1b')
Out[11]: datetime.datetime(2012, 12, 25, 0, 0)

ApplyOffsetRule is what parses the date rules to create both the holidays and can apply holidays to other rules like +1CustomBusinessDay ('+1b' above). HolidayCalendar('US') is a stored object that just contains the rules as an attribute. We can very easily pass this into CustomBusinessDay(calendar=HolidayCalendar('US')) or CustomBusinessDay(calendar=USHolidayCalendar).

@bjonen
Copy link
Contributor Author

bjonen commented Mar 10, 2014

I investigated the issue a bit further. We have to deal with two cases. Either increment

  1. np.datetime64 -> np.datetime64 or
  2. dt.datetime -> dt.datetime
np_dt = np.datetime64('now')
dt_dt = np_dt.astype(dt.datetime)

# ad 1)
def np_dt_incr(np_dt):
    np_day = np_dt.astype('datetime64[D]')
    np_time = np_dt - np_day
    np_day_incr = np.busday_offset(np_day,1)
    np_dt_incr = np_day_incr + np_time
    return np_dt_incr

np_dt_incr(np_dt)
Out[133]: numpy.datetime64('2014-03-11T16:46:01+0100')

%timeit np_dt_incr(np_dt)
100000 loops, best of 3: 13.8 µs per loop

# ad 2)
def datetime_dt_incr(date):
    np_dt = np.datetime64(date.date())
    np_incr_dt = np.busday_offset(np_dt,1)
    dt_date = np_incr_dt.astype(dt.datetime)
    dt_full = dt.datetime.combine(dt_date,date.time())
    return dt_full
datetime_dt_incr(dt_dt)
Out[135]: datetime.datetime(2014, 3, 11, 15, 46, 1)

%timeit datetime_dt_incr(dt_dt)
100000 loops, best of 3: 8.06 µs per loop

So if we split up the two cases we can get around 14 ms with numpy dates and 8 ms with datetime. 

The reason is that subtracting and then adding back time in `np_dt_incr` is very costly. I looked around but couldn't find a better way to do this. That means, right now it is better to stick to datetime to do the time adding and subtracting. 

Maybe we should have a separate thread regarding the introduction of holiday calendars. 

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

I never got why the back-and-forth betwen np.datetime64 and datetime's here...

i didn't really look into detail. But if datetimes are faster then use them. Just make sure it passes the current tests.!

pls submit a PR when ready

@bjonen
Copy link
Contributor Author

bjonen commented Mar 10, 2014

The reason is np.busday_offset only takes 'datetime64[D]' as input.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

ahh ok.....so maybe convert to/fro from that (that's what its doing i guess then).....

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

In [1]: ts = Timestamp('20130101')

In [2]: np.array(ts.value,dtype='datetime64[D]')
Out[2]: array(1356998400000000000L, dtype='datetime64[D]')

In [3]: %timeit np.array(ts.value,dtype='datetime64[D]')
1000000 loops, best of 3: 930 ns per loop

In [5]: Timestamp(np.array(ts.value,dtype='datetime64[D]').item())
Out[5]: Timestamp('2013-01-01 00:00:00', tz=None)

In [6]: %timeit Timestamp(np.array(ts.value,dtype='datetime64[D]').item())
100000 loops, best of 3: 3.62 ᄉs per loop

not 100% sure what conversions are happening low-level, but you can
prob get around using astype and just directly manipulate the values

@bjonen
Copy link
Contributor Author

bjonen commented Mar 14, 2014

@rockg Regarding the holiday calendar. Your approach works for me.

I looked around a bit for existing calendars that we could use (e.g. http://www.mozilla.org/en-US/projects/calendar/holidays/). We could extract the holdays from the .ical files without having to worry about the exact rules. However, most of the calendars do not range back very long. For most of my usecases that is important however.

@cancan101
Copy link
Contributor

Not in an ideal format, but this data goes back a long time: http://www.nyse.com/pdfs/closings.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Frequency DateOffsets Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants