pd.Series.asof performance regression: 14x slower #14461

Closed
laudney opened this Issue Oct 20, 2016 · 9 comments

Comments

Projects
None yet
3 participants
Contributor

laudney commented Oct 20, 2016

pd.Series.asof takes 14x longer in 0.19.0 than 0.18.1

In[14]: pd.__version__
Out[14]: '0.18.1'
In[15]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[16]: %timeit s.asof(4)
The slowest run took 8.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.96 µs per loop

In[8]: pd.__version__
Out[8]: '0.19.0'
In[9]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[10]: %timeit s.asof(4)
10000 loops, best of 3: 125 µs per loop

Yes, I have noticed that in our benchmarks as well, but forgot to open an issue. Thanks for reporting!
If you want to have a look what could be the cause, very welcome!

jorisvandenbossche added this to the 0.19.1 milestone Oct 20, 2016

Contributor

jreback commented Oct 20, 2016

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

Contributor

laudney commented Oct 21, 2016

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

pd.Series.asof is extremely flexible. If I pass in a value before the start of the index, it returns NA. If I pass in a value after the end of the index, it returns the very last non-NA value.

Now in one of my cases I have several thousand time series and for a given date, I want to get the asof value for each one and I currently do this in a loop. The date can be anything like a weekend or a holiday.

I've thought about creating a giant DataFrame, reindex it to include every single calendar date, then fillna(method='ffill) but I still need to deal with the dates before the start or after the end of the index with if statements.

And if those time series cover different date/time ranges, the resulting DataFrame is extremely sparse.

Maybe there is a better and more idiomatic way for this particular case?

In other cases, asof has to be called on individual Series without the possibility of concatenating them into a giant DataFrame. The speed of the call is therefore very imporant.

Contributor

jreback commented Oct 21, 2016

So this is a general method which can handle both Series and DataFrames. an implementation that handles both directly in cython could certainly be done. I would encourage a pull-request to do this.

For example the nulls are pre-computed here, which makes the code much simpler, but is not necessary when iterating (which you can only do in a performant way in cython, comparing nulls as you go).

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Non-monotonic indices are not involved I think, as it just raises for that (the error raise is added in 0.19.0, but that is of course not the perf issue). I rather think this is a case where the generalization of the method for both Series and DataFrame has impacted the performance for (certain) Series only cases.

With a few small adaptions in the current code I can bring it back to 0.18 performance for series (not calculating all nulls in advance + indexing the values instead of Series (in case of Series you return a scalar, so don't need to index the object itself with iloc))

Contributor

jreback commented Oct 21, 2016

@jorisvandenbossche the non-monotonic checks actually are important, they do take some small amount of time. Again when iterating you can do these in-line.

@laudney would welcome a pull-request.

@laudney I think it would be rather simple to fix the performance issue. You have to compare the implemention of Series.asof of 0.18 with the current implementation (the differences are the pre-computations of the nulls and no longer working with the underlying array but with the series itself. Both give a performance degradation). As @jreback said, pull request is very welcome!

Contributor

laudney commented Oct 22, 2016

@jreback @jorisvandenbossche Let me take a look but no promise though!

laudney referenced this issue Oct 22, 2016

Merged

PERF: performance regression in Series.asof #14476

2 of 2 tasks complete
Contributor

laudney commented Oct 22, 2016

@jreback @jorisvandenbossche please check my pull request #14476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment