Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string getitem methods are slow #4694

Closed
hayd opened this issue Aug 27, 2013 · 12 comments
Closed

string getitem methods are slow #4694

hayd opened this issue Aug 27, 2013 · 12 comments
Labels
Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@hayd
Copy link
Contributor

hayd commented Aug 27, 2013

related #2802

It seems that str[1] is significantly slower than .apply(lambda x: x[1])

See this So answer http://stackoverflow.com/a/18473330/1240268

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

Couple of reasons:

  1. the mapped function is actually lambda x: x[i] if len(x) > i else np.nan
  2. isnull is called to compute a mask for mapping over in lib.map_infer_mask (which is in inference.pyx)

looks like the perf hit is about 2x, might be able to squash that by moving string methods to cython

@hayd
Copy link
Contributor Author

hayd commented Aug 27, 2013

Ah, that'll do it (I guess it's only sometimes apply doesn't care about errors?).

maybe cythonizing these is the way forward, I guess even with object dtype you get some perf improvement.

@jreback
Copy link
Contributor

jreback commented Aug 27, 2013

these methods could be much faster (there is an issue out there about this) if you basically push everything to use native c calls (eg stuff like strcmp and such) or maybe add a nice c library in the mix

just cythonizing doesn't help much

but this would be a bit of work

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

wonder if this is worth looking into: http://bstring.sourceforge.net/

@jtratner
Copy link
Contributor

If you're going to c level, better to use a c library that handles strings
/ unicode for you so we don't have to worry as much about all the gotchas
with c strings.

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

darn bstring doesn't support unicdoe

@cpcloud
Copy link
Member

cpcloud commented Aug 27, 2013

Converting these functions to C without breakage is going to be very difficult. You'll probably have to use ICU and have a compatibility layer between Cython (PyICU might make this a bit easier) and ICU.

We definitely cannot use the C standard library string functions since they don't handle Unicode.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 1, 2015
@brandon-rhodes
Copy link
Contributor

Is anyone working on this currently? Would I be duplicating effort if I were to look into possible quick wins for at least getting these slow str routines a bit faster?

@jreback
Copy link
Contributor

jreback commented May 9, 2016

@brandon-rhodes don't think so. would be great!

prob DO need some asv benchmarks for these.

@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@3vts
Copy link
Contributor

3vts commented Feb 15, 2020

Is this still an ongoing effort? I would like to give it a try

@brandon-rhodes
Copy link
Contributor

@3vts Feel free to give it a try! I did not wind up with time to make progress on it, and my guess is that the project I was on that needed the extra performance found a workaround. To be honest, I had, alas, forgotten all about it in the intervening years.

@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

This is fairly fast with the new pyarrow string type which have a lot of benefits over the string object implementation so closing for now. Can reopen if there are specific hotspots that are addressable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

8 participants