Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
pd.Series.asof performance regression: 14x slower #14461
Comments
|
Yes, I have noticed that in our benchmarks as well, but forgot to open an issue. Thanks for reporting! |
jorisvandenbossche
added Performance Regression
labels
Oct 20, 2016
jorisvandenbossche
added this to the
0.19.1
milestone
Oct 20, 2016
|
well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf. Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic. |
pd.Series.asof is extremely flexible. If I pass in a value before the start of the index, it returns NA. If I pass in a value after the end of the index, it returns the very last non-NA value. Now in one of my cases I have several thousand time series and for a given date, I want to get the asof value for each one and I currently do this in a loop. The date can be anything like a weekend or a holiday. I've thought about creating a giant DataFrame, reindex it to include every single calendar date, then fillna(method='ffill) but I still need to deal with the dates before the start or after the end of the index with if statements. And if those time series cover different date/time ranges, the resulting DataFrame is extremely sparse. Maybe there is a better and more idiomatic way for this particular case? In other cases, asof has to be called on individual Series without the possibility of concatenating them into a giant DataFrame. The speed of the call is therefore very imporant. |
|
So this is a general method which can handle both Series and DataFrames. an implementation that handles both directly in cython could certainly be done. I would encourage a pull-request to do this. For example the nulls are pre-computed here, which makes the code much simpler, but is not necessary when iterating (which you can only do in a performant way in cython, comparing nulls as you go). |
Non-monotonic indices are not involved I think, as it just raises for that (the error raise is added in 0.19.0, but that is of course not the perf issue). I rather think this is a case where the generalization of the method for both Series and DataFrame has impacted the performance for (certain) Series only cases. With a few small adaptions in the current code I can bring it back to 0.18 performance for series (not calculating all nulls in advance + indexing the values instead of Series (in case of Series you return a scalar, so don't need to index the object itself with |
|
@jorisvandenbossche the non-monotonic checks actually are important, they do take some small amount of time. Again when iterating you can do these in-line. @laudney would welcome a pull-request. |
|
@laudney I think it would be rather simple to fix the performance issue. You have to compare the implemention of Series.asof of 0.18 with the current implementation (the differences are the pre-computations of the nulls and no longer working with the underlying array but with the series itself. Both give a performance degradation). As @jreback said, pull request is very welcome! |
|
@jreback @jorisvandenbossche Let me take a look but no promise though! |
laudney
referenced
this issue
Oct 22, 2016
Merged
PERF: performance regression in Series.asof #14476
|
@jreback @jorisvandenbossche please check my pull request #14476 |
laudney commentedOct 20, 2016
pd.Series.asof takes 14x longer in 0.19.0 than 0.18.1