New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Series.asof performance regression: 14x slower #14461

Closed
laudney opened this Issue Oct 20, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@laudney
Contributor

laudney commented Oct 20, 2016

pd.Series.asof takes 14x longer in 0.19.0 than 0.18.1

In[14]: pd.__version__
Out[14]: '0.18.1'
In[15]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[16]: %timeit s.asof(4)
The slowest run took 8.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.96 µs per loop

In[8]: pd.__version__
Out[8]: '0.19.0'
In[9]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[10]: %timeit s.asof(4)
10000 loops, best of 3: 125 µs per loop
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 20, 2016

Member

Yes, I have noticed that in our benchmarks as well, but forgot to open an issue. Thanks for reporting!
If you want to have a look what could be the cause, very welcome!

Member

jorisvandenbossche commented Oct 20, 2016

Yes, I have noticed that in our benchmarks as well, but forgot to open an issue. Thanks for reporting!
If you want to have a look what could be the cause, very welcome!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 20, 2016

Contributor

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

Contributor

jreback commented Oct 20, 2016

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

@laudney

This comment has been minimized.

Show comment
Hide comment
@laudney

laudney Oct 21, 2016

Contributor

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

pd.Series.asof is extremely flexible. If I pass in a value before the start of the index, it returns NA. If I pass in a value after the end of the index, it returns the very last non-NA value.

Now in one of my cases I have several thousand time series and for a given date, I want to get the asof value for each one and I currently do this in a loop. The date can be anything like a weekend or a holiday.

I've thought about creating a giant DataFrame, reindex it to include every single calendar date, then fillna(method='ffill) but I still need to deal with the dates before the start or after the end of the index with if statements.

And if those time series cover different date/time ranges, the resulting DataFrame is extremely sparse.

Maybe there is a better and more idiomatic way for this particular case?

In other cases, asof has to be called on individual Series without the possibility of concatenating them into a giant DataFrame. The speed of the call is therefore very imporant.

Contributor

laudney commented Oct 21, 2016

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

pd.Series.asof is extremely flexible. If I pass in a value before the start of the index, it returns NA. If I pass in a value after the end of the index, it returns the very last non-NA value.

Now in one of my cases I have several thousand time series and for a given date, I want to get the asof value for each one and I currently do this in a loop. The date can be anything like a weekend or a holiday.

I've thought about creating a giant DataFrame, reindex it to include every single calendar date, then fillna(method='ffill) but I still need to deal with the dates before the start or after the end of the index with if statements.

And if those time series cover different date/time ranges, the resulting DataFrame is extremely sparse.

Maybe there is a better and more idiomatic way for this particular case?

In other cases, asof has to be called on individual Series without the possibility of concatenating them into a giant DataFrame. The speed of the call is therefore very imporant.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 21, 2016

Contributor

So this is a general method which can handle both Series and DataFrames. an implementation that handles both directly in cython could certainly be done. I would encourage a pull-request to do this.

For example the nulls are pre-computed here, which makes the code much simpler, but is not necessary when iterating (which you can only do in a performant way in cython, comparing nulls as you go).

Contributor

jreback commented Oct 21, 2016

So this is a general method which can handle both Series and DataFrames. an implementation that handles both directly in cython could certainly be done. I would encourage a pull-request to do this.

For example the nulls are pre-computed here, which makes the code much simpler, but is not necessary when iterating (which you can only do in a performant way in cython, comparing nulls as you go).

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 21, 2016

Member

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Non-monotonic indices are not involved I think, as it just raises for that (the error raise is added in 0.19.0, but that is of course not the perf issue). I rather think this is a case where the generalization of the method for both Series and DataFrame has impacted the performance for (certain) Series only cases.

With a few small adaptions in the current code I can bring it back to 0.18 performance for series (not calculating all nulls in advance + indexing the values instead of Series (in case of Series you return a scalar, so don't need to index the object itself with iloc))

Member

jorisvandenbossche commented Oct 21, 2016

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Non-monotonic indices are not involved I think, as it just raises for that (the error raise is added in 0.19.0, but that is of course not the perf issue). I rather think this is a case where the generalization of the method for both Series and DataFrame has impacted the performance for (certain) Series only cases.

With a few small adaptions in the current code I can bring it back to 0.18 performance for series (not calculating all nulls in advance + indexing the values instead of Series (in case of Series you return a scalar, so don't need to index the object itself with iloc))

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 21, 2016

Contributor

@jorisvandenbossche the non-monotonic checks actually are important, they do take some small amount of time. Again when iterating you can do these in-line.

@laudney would welcome a pull-request.

Contributor

jreback commented Oct 21, 2016

@jorisvandenbossche the non-monotonic checks actually are important, they do take some small amount of time. Again when iterating you can do these in-line.

@laudney would welcome a pull-request.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Oct 22, 2016

Member

@laudney I think it would be rather simple to fix the performance issue. You have to compare the implemention of Series.asof of 0.18 with the current implementation (the differences are the pre-computations of the nulls and no longer working with the underlying array but with the series itself. Both give a performance degradation). As @jreback said, pull request is very welcome!

Member

jorisvandenbossche commented Oct 22, 2016

@laudney I think it would be rather simple to fix the performance issue. You have to compare the implemention of Series.asof of 0.18 with the current implementation (the differences are the pre-computations of the nulls and no longer working with the underlying array but with the series itself. Both give a performance degradation). As @jreback said, pull request is very welcome!

@laudney

This comment has been minimized.

Show comment
Hide comment
@laudney

laudney Oct 22, 2016

Contributor

@jreback @jorisvandenbossche Let me take a look but no promise though!

Contributor

laudney commented Oct 22, 2016

@jreback @jorisvandenbossche Let me take a look but no promise though!

@laudney laudney referenced this issue Oct 22, 2016

Merged

PERF: performance regression in Series.asof #14476

2 of 2 tasks complete
@laudney

This comment has been minimized.

Show comment
Hide comment
@laudney

laudney Oct 22, 2016

Contributor

@jreback @jorisvandenbossche please check my pull request #14476

Contributor

laudney commented Oct 22, 2016

@jreback @jorisvandenbossche please check my pull request #14476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment