Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

Merged

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Feb 13, 2014

While working on #6328 I've stumbled upon this piece that deserves optimization.

Current implementation Index.is_all_dates invokes lib.infer_dtype which does a full-scan dtype inference which is not not always necessary. If I read cython sources correctly, new code is equivalent to the old one and the performance increase comes from short-circuiting the check on the first non-datetime element.

This change cuts roughly 50% of runtime of the following snippet:

import numpy as np
import pandas as pd

NUM_ELEMENTS = 1e7
s = pd.Series(np.arange(NUM_ELEMENTS),
              index=['a%s' % i for i in np.arange(NUM_ELEMENTS)])

print("pd.version", pd.__version__)
%timeit s[:int(NUM_ELEMENTS/2)]

On recent master:

$ ipython test_index_is_all_dates.ipy 
pd.version 0.13.1-108-g87b4308
10 loops, best of 3: 55.5 ms per loop

On this branch:

$ ipython test_index_is_all_dates.ipy 
pd.version 0.13.1-109-g3b43e2b
10 loops, best of 3: 28 ms per loop

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

fyi...this is a cached attribute so only hit once, but it is valid

the infer_dtype routines does do this optimization however

@jreback jreback added this to the 0.14.0 milestone Feb 13, 2014
@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

fyi...this is a cached attribute so only hit once, but it is valid

I know that. Unfortunately, the cache doesn't help in case of slicing: the result is always a new object (unless the slice was itself cached), the cache is not calculated yet and the value is required during construction unconditionally (fastpath=True doesn't help).

the infer_dtype routines does do this optimization however

I'm not sure what you mean. lib.infer_dtype needs to know the exact type and thus is forced to scan all the elements. is_all_dates is only interested if all of the series elements are datetimes and the first element which doesn't fit is enough to have the answer.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

look at the impl of infer_dtype it fails fast. (as does is_datetime_array)

@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

Hmmm... can it be that you confuse these two cases:

In [50]: arr_s = np.array(map(str, range(10000)))

In [51]: arr_o = np.array(map(str, range(10000)), dtype=object)

In [52]: timeit pd.lib.infer_dtype(arr_s)
1000000 loops, best of 3: 351 ns per loop

In [53]: timeit pd.lib.infer_dtype(arr_o)
10000 loops, best of 3: 42 µs per loop

In [54]: pd.lib.infer_dtype(arr_s)
Out[54]: 'string'

In [55]: pd.lib.infer_dtype(arr_o)
Out[55]: 'string'

Because, I cannot see how infer_dtype can return 'string' for arr_o without scanning the whole array at line 124 and returning the value at line 125:

    elif PyString_Check(val):
        if is_string_array(values):
            return 'string'

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

no that is correct. The is_string_array like all of the array functions fails fast. Of course they DO scan the entire array; but some cases are very optimized, e.g. if its already dtype = datetime64[ns] then it doesn't even need to do anything.

Pls run a perf-check with the change and see if anything changes. and post the results

test_perf.sh

@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

Here's the result. Is there a benchmark for slice indexing?

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

look at vb_suite/indexing.py

so you prob need to create a benchmark which tests this change....nothing is obviously improved (but its a worthwhile change though)

you can use the example from above that you did

@immerrr
Copy link
Contributor Author

immerrr commented Feb 14, 2014

Done and updated the gist.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

nice!

ok just need an entry in release notes (ref this pr as their is no associated issue) and good 2 go

The patch avoids scanning the whole slice in Index.is_all_dates check.
@immerrr
Copy link
Contributor Author

immerrr commented Feb 14, 2014

Done.

@jreback jreback merged commit e3f666f into pandas-dev:master Feb 14, 2014
@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@immerrr thanks.....

sorry about being pendantic....but expect a lot of contributions from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants