Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: ensure_string_array with non-numpy input array #37371

Merged
merged 7 commits into from
Oct 26, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Oct 23, 2020

Currently, if the input array to ensure_string_array is a python object (e.g. an ExtensionArray), the function will constantly switch between python and cython level code, which is slow.

This PR fixes that by ensuring we always have a numpy array, avoiding the trips to python level code.

>>> n = 50_000
>>> cat = pd.Categorical([*['a', pd.NA] * n])
>>> %timeit cat.astype("string")
447 ms ± 11.9 ms per loop  # master
5.43 ms ± 80.5 µs per loop  # this PR

xref #35519, #36464.

@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance Strings String extension data type and string data labels Oct 23, 2020
@jreback jreback added this to the 1.2 milestone Oct 23, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! assume we have sufficient benchmarks?

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

for i in range(n):
val = arr[i]
arr_val = arr[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr_val and result_val need to be in cdef?

@jorisvandenbossche
Copy link
Member

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:

In [16]: n = 50_000
    ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(this is also something that can work in general for other dtypes as well)

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 24, 2020

The underlying problem is in the string conversion, so should be fixed there:

In [1]: n = 50_000
...:    cat = pd.Categorical([*['a', pd.NA] * n])

In [2]: %timeit pd.arrray(cat, dtype="string")  # same underlying problem 
508 ms ± 9.6 ms per loop  # master
6.03 ms ± 184 µs per loop  # this PR

@topper-123
Copy link
Contributor Author

I think the failures are unrelated. I'll update later with the whatsnew and timing runs, if needed.

@topper-123
Copy link
Contributor Author

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

I've added this in the whatsnet, but this actually fixes a perf. regression coming from #36464 (the change pandas/_libs/lib.pyx).

@jreback
Copy link
Contributor

jreback commented Oct 26, 2020

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:

In [16]: n = 50_000
    ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(this is also something that can work in general for other dtypes as well)

possibly but this is a separate issue and orthogonal here.

@@ -651,6 +651,13 @@ cpdef ndarray[object] ensure_string_array(
cdef:
Py_ssize_t i = 0, n = len(arr)

from pandas.core.dtypes.common import is_extension_array_dtype

if is_extension_array_dtype(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid doing this as it circularizs things, you can check if it has a .to_numpy method or maybe @jbrockmendel has a better way here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've changed to use hasattr(arr, "to_numpy").

@jreback jreback merged commit 3c8c4c9 into pandas-dev:master Oct 26, 2020
@jreback
Copy link
Contributor

jreback commented Oct 26, 2020

thanks @topper-123

@topper-123 topper-123 deleted the ensure_string_array_perf branch October 26, 2020 17:49
if hasattr(arr, "to_numpy"):
arr = arr.to_numpy()
elif not isinstance(arr, np.ndarray):
arr = np.array(arr, dtype="object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be asarray

not a big deal, but my preference would have been to type the input arr as an ndarray and handle non-ndarray in the calling function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants