Series StringMethods very slow #2602

Closed
jim22k opened this Issue Dec 27, 2012 · 1 comment

Comments

Projects
None yet
2 participants

jim22k commented Dec 27, 2012

I understand the benefit of Series.str methods which automatically handle NA, but the implementation seems really slow.

>>> s = pd.Series(['abcdefg', np.nan]*500000)
>>> timeit s.str[:5]
1 loops, best of 3: 2.55 s per loop
>>> timeit s.map(lambda row: row[:5], na_action='ignore')
1 loops, best of 3: 558 ms per loop

Looking in the code the difference seems to be that Series.map with na_action='ignore' uses some vectorized code to filter out the NA values while Series.str uses the _na_map function with a try/except for each item in the Series (non-vectorized).

Can I make a request to eliminate the _na_map in favor of something more like Series.map(na_action='ignore')?

wesm closed this in 016b320 Dec 28, 2012

Owner

wesm commented Dec 28, 2012

Thanks for pointing this out. I changed the impl and am getting ~282ms now vs. 1.76s originally on your example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment