# Pandas conversion of Timedelta is very slow #18092

Closed
opened this Issue Nov 3, 2017 · 11 comments

Projects
None yet
3 participants

### Stanpol commented Nov 3, 2017

#### Code Sample, a copy-pastable example if possible

```a=pd.Series([pd.Timedelta(days=x) for x in np.random.randint(0, 10, 60000)])

# Let's convert Timedelta to days with Pandas
%time a.dt.days
# CPU times: user 457 ms, sys: 4.08 ms, total: 461 ms
# Wall time: 464 ms

# Let's convert Timedelta to days by division
%time (a / np.timedelta64(1, 'D')).astype(np.int64)
# CPU times: user 3.19 ms, sys: 1.79 ms, total: 4.98 ms
# Wall time: 3.18 ms

# Make sure results are the same
((a / np.timedelta64(1, 'D')).astype(np.int64)==a.dt.days).value_counts()
# True    60000
# dtype: int64
```

#### Problem description

For large Series it takes very long time to do a simple conversion, this should be optimised.

#### Expected Output

.dt.days should be as quick as dividing by np.timedelta64(1, 'D')

## INSTALLED VERSIONS

commit: None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None

Contributor

### jreback commented Nov 3, 2017

 so the issue is that https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/timedeltas.py#L383 is not done in a vectorized way, IOW needs to simply construct the returned arrays and then use `_maybe_mask_results`, rather than doing this in a list-comp with individual Timedelta construction. you can't actually divide by `np.timedelta64` (well strictly speaking for days you can but not for any other unit), so rather fix the general problem. want to have a go at a PR?

Member

### WillAyd commented Nov 4, 2017

 I took a look at this and re-factored the _get_field function to look as follows: ``````def _get_field(self, m): def map_attr(x): nonlocal m return getattr(Timedelta(x), m) values = self.asi8 hasnans = self.hasnans if hasnans: vfunc = np.vectorize(map_attr, otypes=[np.float]) else: vfunc = np.vectorize(map_attr, otypes=[np.int]) result = vfunc(values) return Index(result, name=self.name) `````` However, I didn't really see any tangible performance improvement. The np.vectorize docs mention that the function is for convenience and not necessarily performance, as it's essentially a for loop. @jreback - is the code above that I provided in line with what you were expecting? If that's the case I'm not sure that is really the root cause from my initial tests
Contributor

### jreback commented Nov 4, 2017

 no this is not vectorization, see how other things are done in datetimes.py
Member

### WillAyd commented Nov 4, 2017 • edited

 Thanks Jeff. I refactored to the below and did see significant speed improvements: ``````def _get_field(self, m): freqs = { 'days' : 'D', 'seconds' : 's', 'microseconds' : 'us', 'nanoseconds' : 'ns' } freq = freqs[m] values = self.asi8 hasnans = self.hasnans result = libts.get_date_field(values, freq) if hasnans: result = self._maybe_mask_results(result, convert='float64') return Index(result, name=self.name) `````` Do you have a point of view as to where to the put the freqs dict I have above? I've included it in the method here for visibility, but I was thinking it could be better served as a classmethod to map the properties to their appropriate frequency codes
Contributor

### jreback commented Nov 4, 2017

 @Stanpol great! you shouldn't need the freq dict, simply pass in the correct value in the accessor itself (which I think is what it was doing). and pls add some asv's for this.
Member

### WillAyd commented Nov 4, 2017

 Thanks for the tip on the accessors - that's easy enough. One issue I'm seeing now though is that I might need to be careful with handling dates vs time deltas. I noticed the test_fields method in test_timedelta.py is failing with the below: ``````E Index values are different (100.0 %) E [left]: Int64Index([2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64') E [right]: Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64') `````` The difference of one day I assume is due to me using the get_date_field method in libts while passing in Timedelta objects. Any tips on how to best handle that? I'll take a look at adding some asv's as you suggest
Contributor

### jreback commented Nov 4, 2017

 what does this have to do with dates?
Contributor

### jreback commented Nov 4, 2017

 you shouldn't be using this `get_date_field`, you are solely calling routings on timedeltas.
Member

### WillAyd commented Nov 7, 2017

 My mistake. Looking further into this I noticed that all of the logic for converting between days, hours, minutes, seconds, etc... is contained within the `_ensure_components` method of the `_Timedelta` class in timedeltas.pyx, so it's difficult to avoid some kind of individual Timedelta construction as the code currently stands. I'll keep plugging at it but I haven't done much in C / Cython before so it may be slow going on my end to figure out how to make timedelta field access work similar to date objects. If anyone else out there has thoughts on how to tackle then by all means. FWIW here's what I tried to implement in fields.pyx to mimic what exists for dates. I did this solely to check performance so there isn't any error handling. My very un-scientific tests weren't showing any improvement over existing code. ``````@cython.wraparound(False) @cython.boundscheck(False) def get_timedelta_field(ndarray[int64_t] tdindex, object field): cdef: Py_ssize_t i, count = 0 ndarray[int32_t] out count = len(tdindex) out = np.empty(count, dtype='i4') for i in range(count): if tdindex[i] == NPY_NAT: out[i] = -1 continue out[i] = getattr(_Timedelta(microseconds = tdindex[i] / 1000), field) return out ``````
Contributor

### jreback commented Nov 8, 2017

 note that #18161 was merged so the code you are working is slightly moved around but substantially the same
Member

### WillAyd commented Nov 9, 2017

 Thanks for the heads up. Already revised - hope to have something over in the next few days

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 4e51ee0 ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 993ea6b ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` e369c1f ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 9003258 ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 1be6911 ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 3c5d782 ```

Merged

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 20b7477 ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 35cb876 ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` fb543ec ```

### WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 12, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) ```
``` 08a13fb ```

### jreback added a commit that referenced this issue Nov 12, 2017

``` PERF: Vectorized Timedelta property access (#18092) (#18225) ```
``` cd80f08 ```

### No-Stream added a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

``` PERF: Vectorized Timedelta property access (pandas-dev#18092) (pandas… ```
`…-dev#18225)`
``` de4aa61 ```