Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: making DatetimeIndex.date more performant #18058

Closed
jreback opened this Issue Nov 1, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@jreback
Copy link
Contributor

commented Nov 1, 2017

We can substantially speed up DatetimeIndex.date with a small tweak in the code, from SO

In [44]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')

In [45]: %timeit rng.date
480 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [46]: %timeit rng.normalize().to_pydatetime()
94.7 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: rng.normalize().to_pydatetime()
Out[47]: 
array([datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0), ...,
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0)], dtype=object)

In [48]: rng.date
Out[48]: 
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

so [47] and [48] are almost the same, the difference is datetime for [47] and date for [48].

If we allowed ints_to_pydatetime to create date objects (just needs a simple function pointer) around https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L140, then this would work, IOW

@property
def date(self):
     return self.normalize().to_pydate()

where .to_pydate() is basically .to_pydatetime() but adding an additional arg, say kind='date', which ints_to_pydatetime would handle (and create date rather than datetime).

This bypasses the iteration which creates many python objects.

@jreback jreback added this to the Next Major Release milestone Nov 1, 2017

@jreback jreback changed the title PERF: makeing DatetimeIndex.date more performant PERF: making DatetimeIndex.date more performant Nov 1, 2017

@tmnhat2001

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2017

Here are the changes made following the steps outlined above. In datetimes.py:

@property
def date(self):
    """
    Returns numpy array of python datetime.date objects (namely, the date
    part of Timestamps without timezone information).
    """
    return self._maybe_mask_results(libalgos.arrmap_object(
        self.asobject.values, lambda x: x.date()))

@property
def date_new(self):
    return self.normalize().to_pydate()

def to_pydate(self):
    return libts.ints_to_pydatetime(self.asi8, kind="date")

In tslib.pyx, I added a method to create the date objects, a check for the function to use and consider tz
only if kind is not date:

cdef inline object create_date_from_ts(
		int64_t value, pandas_datetimestruct dts,
        object tz, object freq):
    """ convenience routine to construct a datetime.date from its parts """
    return date(dts.year, dts.month, dts.day)

def ints_to_pydatetime(ndarray[int64_t] arr, tz=None, freq=None, box=False, kind="datetime"):
    # convert an i8 repr to an ndarray of datetimes or Timestamp (if box ==
    # True)

    cdef:
        Py_ssize_t i, n = len(arr)
        ndarray[int64_t] trans, deltas
        pandas_datetimestruct dts
        object dt
        int64_t value
        ndarray[object] result = np.empty(n, dtype=object)
        object (*func_create)(int64_t, pandas_datetimestruct, object, object)
		
    if kind == "date":
        func_create = create_date_from_ts
    else:
        if box and is_string_object(freq):
            from pandas.tseries.frequencies import to_offset
            freq = to_offset(freq)

        if box:
            func_create = create_timestamp_from_ts
        else:
            func_create = create_datetime_from_ts

    if tz is not None and kind != "date":
        if is_utc(tz):
            for i in range(n):
                value = arr[i]
                if value == NPY_NAT:
                    result[i] = NaT
                else:
                    dt64_to_dtstruct(value, &dts)
                    result[i] = func_create(value, dts, tz, freq)
        elif is_tzlocal(tz) or is_fixed_offset(tz):
            for i in range(n):
                value = arr[i]
                if value == NPY_NAT:
                    result[i] = NaT
                else:
                    dt64_to_dtstruct(value, &dts)
                    dt = create_datetime_from_ts(value, dts, tz, freq)
                    dt = dt + tz.utcoffset(dt)
                    if box:
                        dt = Timestamp(dt)
                    result[i] = dt
        else:
            trans, deltas, typ = get_dst_info(tz)

            for i in range(n):

                value = arr[i]
                if value == NPY_NAT:
                    result[i] = NaT
                else:

                    # Adjust datetime64 timestamp, recompute datetimestruct
                    pos = trans.searchsorted(value, side='right') - 1
                    if treat_tz_as_pytz(tz):
                        # find right representation of dst etc in pytz timezone
                        new_tz = tz._tzinfos[tz._transition_info[pos]]
                    else:
                        # no zone-name change for dateutil tzs - dst etc
                        # represented in single object.
                        new_tz = tz

                    dt64_to_dtstruct(value + deltas[pos], &dts)
                    result[i] = func_create(value, dts, new_tz, freq)
    else:
        for i in range(n):

            value = arr[i]
            if value == NPY_NAT:
                result[i] = NaT
            else:
                dt64_to_dtstruct(value, &dts)
                result[i] = func_create(value, dts, None, freq)

    return result

A quick comparison using timeit:

In [1]: import pandas as pd

In [2]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')

In [3]: %timeit rng.date
555 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit rng.date_new
90.4 ms ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit rng.normalize().to_pydatetime()
121 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: rng.date
Out[6]:
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

In [7]: rng.date_new
Out[7]:
array([datetime.date(2000, 4, 3), datetime.date(2000, 4, 3),
       datetime.date(2000, 4, 3), ..., datetime.date(2045, 11, 19),
       datetime.date(2045, 11, 19), datetime.date(2045, 11, 19)], dtype=object)

In [8]: rng.normalize().to_pydatetime()
Out[8]:
array([datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0),
       datetime.datetime(2000, 4, 3, 0, 0), ...,
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0),
       datetime.datetime(2045, 11, 19, 0, 0)], dtype=object)
@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 7, 2017

if you would put this in a PR we can have a look

tmnhat2001 added a commit to tmnhat2001/pandas that referenced this issue Nov 8, 2017

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Nov 12, 2017

tmnhat2001 added a commit to tmnhat2001/pandas that referenced this issue Nov 14, 2017

tmnhat2001 added a commit to tmnhat2001/pandas that referenced this issue Nov 18, 2017

tmnhat2001 added a commit to tmnhat2001/pandas that referenced this issue Nov 22, 2017

jreback added a commit that referenced this issue Nov 22, 2017

tmnhat2001 added a commit to tmnhat2001/pandas that referenced this issue Nov 24, 2017

@tmnhat2001 tmnhat2001 referenced this issue Nov 24, 2017

Merged

Improve DatetimeIndex.time performance #18461

3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.