-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct tokenization with multi-character split
#9585
Conversation
* Fixes #9538 * Adds new test cases for previously-failing uses * Tested on Python 2.7 with Unicode and non-Unicode strings, and on Python 3.5
What's the performance impact of this change? |
Code paths for the previously-supported use cases in Python2 aren't changed. Benchmarking Python3 on a ~1MB text string with substantial replacement done, using the default (single-character) New Version:
Old Version:
So no real impact. This should be expected as most of the work will be done by I'll update shortly with a comparison for using multi-character |
Python2, multi-character(4), non-Unicode, new version:
Python2, single-character, non-Unicode, old version:
Python2, multi-character(4), Unicode, old OR new version
Python2, multi-character(50), Unicode, old OR new version
Python2, single-character, Unicode, old OR new version
Python3, multi-character(4), new version:
Python3, multi-character(50), new version:
Python3, single-character, old version
The takeaway here is that using multi-character splits is definitely slower, but probably just as a result of building/reading longer strings rather than as the result of these changes. Users should probably prefer to use single-character There's a bit of weirdness (I expected the Let me know if there is any other information I can provide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the benchmark. The changes look good to me.
Issue was caused by incorrect usage of maketrans with a multi-character split string. The Python3 implementation was updated to use a dictionary for translation of all strings. Python2 doesn't support this for non-Unicode strings, so I added an additional case that uses replace instead.