Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Pandas conversion of Timedelta is very slow #18092
Code Sample, a copy-pastable example if possible
a=pd.Series([pd.Timedelta(days=x) for x in np.random.randint(0, 10, 60000)]) # Let's convert Timedelta to days with Pandas %time a.dt.days # CPU times: user 457 ms, sys: 4.08 ms, total: 461 ms # Wall time: 464 ms # Let's convert Timedelta to days by division %time (a / np.timedelta64(1, 'D')).astype(np.int64) # CPU times: user 3.19 ms, sys: 1.79 ms, total: 4.98 ms # Wall time: 3.18 ms # Make sure results are the same ((a / np.timedelta64(1, 'D')).astype(np.int64)==a.dt.days).value_counts() # True 60000 # dtype: int64
For large Series it takes very long time to do a simple conversion, this should be optimised.
.dt.days should be as quick as dividing by np.timedelta64(1, 'D')
so the issue is that https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/timedeltas.py#L383 is not done in a vectorized way, IOW needs to simply construct the returned arrays and then use
you can't actually divide by
want to have a go at a PR?
I took a look at this and re-factored the _get_field function to look as follows:
However, I didn't really see any tangible performance improvement. The np.vectorize docs mention that the function is for convenience and not necessarily performance, as it's essentially a for loop.
@jreback - is the code above that I provided in line with what you were expecting? If that's the case I'm not sure that is really the root cause from my initial tests
Thanks Jeff. I refactored to the below and did see significant speed improvements:
Do you have a point of view as to where to the put the freqs dict I have above? I've included it in the method here for visibility, but I was thinking it could be better served as a classmethod to map the properties to their appropriate frequency codes
Thanks for the tip on the accessors - that's easy enough. One issue I'm seeing now though is that I might need to be careful with handling dates vs time deltas. I noticed the test_fields method in test_timedelta.py is failing with the below:
The difference of one day I assume is due to me using the get_date_field method in libts while passing in Timedelta objects. Any tips on how to best handle that?
I'll take a look at adding some asv's as you suggest
My mistake. Looking further into this I noticed that all of the logic for converting between days, hours, minutes, seconds, etc... is contained within the
I'll keep plugging at it but I haven't done much in C / Cython before so it may be slow going on my end to figure out how to make timedelta field access work similar to date objects. If anyone else out there has thoughts on how to tackle then by all means.
FWIW here's what I tried to implement in fields.pyx to mimic what exists for dates. I did this solely to check performance so there isn't any error handling. My very un-scientific tests weren't showing any improvement over existing code.