Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: directly astype with numpy if series is already nansafe #8732

Closed
jreback opened this issue Nov 4, 2014 · 2 comments
Closed

PERF: directly astype with numpy if series is already nansafe #8732

jreback opened this issue Nov 4, 2014 · 2 comments
Labels
Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@jreback
Copy link
Contributor

jreback commented Nov 4, 2014

from SO

so the null check is pretty cheap. if no nulls, then can just bypass nansafe an use the underlying numpy routine. should be a nice speedup.

``
In [13]: arr = np.random.randint(1,10,size=1000000)

In [14]: s = Series(arr)

In [15]: s.notnull().all()
Out[15]: True

In [16]: %timeit s.notnull().all()
1000 loops, best of 3: 1.35 ms per loop

In [17]: %timeit s.astype(str)
1 loops, best of 3: 2.52 s per loop

In [18]: %timeit s.values.astype(str)
10 loops, best of 3: 37.7 ms per loop

@jreback jreback added Performance Memory or execution speed performance Good as first PR Strings String extension data type and string data labels Nov 4, 2014
vikram pushed a commit to vikram/pandas that referenced this issue Nov 29, 2014
vikram pushed a commit to vikram/pandas that referenced this issue Nov 29, 2014
@vikram
Copy link

vikram commented Nov 29, 2014

The time is actually not in checking for nulls.
But in ensuring that every element returned is a string.

If you did s.values.astype(str) what you get back is an object holding int. This is numpy doing the conversion, where as pandas iterates over each item and calls str(item) on it.
So if you do s.astype(str) you be an object holding str.

https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L866

So I don't think it can be fixed if we still want to returns object holding str.

Potentially https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L843
can be improved. If the array doesn't have nulls and we don't have the is_datelike
then instead of iterating, we can just return arr.astype(new_dtype)

I can sort out a pull request if there is interest.

@jbrockmendel
Copy link
Member

s.values.astype(str) is now slightly slower than s.astype(str) (331ms vs 309ms locally). Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants