Correct tokenization with multi-character `split` #9585

connor-bracewell · 2018-03-07T17:40:54Z

Fixes Tokenization crashes when split string has more than one character #9538
Adds new test cases for previously-failing uses
Tested on Python 2.7 with Unicode and non-Unicode strings, and on Python 3.5

Issue was caused by incorrect usage of maketrans with a multi-character split string. The Python3 implementation was updated to use a dictionary for translation of all strings. Python2 doesn't support this for non-Unicode strings, so I added an additional case that uses replace instead.

* Fixes #9538 * Adds new test cases for previously-failing uses * Tested on Python 2.7 with Unicode and non-Unicode strings, and on Python 3.5

fchollet · 2018-03-07T21:36:54Z

What's the performance impact of this change?

connor-bracewell · 2018-03-07T22:29:19Z

Code paths for the previously-supported use cases in Python2 aren't changed.

Benchmarking Python3 on a ~1MB text string with substantial replacement done, using the default (single-character) split gives:

New Version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  36.5339  46.0856  39.8815  2.7605  38.7632  4.4961       6;0  25.0743      20           1
---------------------------------------------------------------------------------------------------------------

Old Version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  38.6402  43.9231  40.6444  1.4219  40.2049  1.5261       8;1  24.6037      23           1
---------------------------------------------------------------------------------------------------------------

So no real impact.

This should be expected as most of the work will be done by translate in both the new and old versions, while this change only affects the input to maketrans.

I'll update shortly with a comparison for using multi-character split values vs. single-character on both versions.

connor-bracewell · 2018-03-07T23:27:57Z

Python2, multi-character(4), non-Unicode, new version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  55.7990  73.9241  63.7629  6.0415  65.3000  9.1802       6;0  15.6831      15           1
---------------------------------------------------------------------------------------------------------------

Python2, single-character, non-Unicode, old version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  45.1980  54.1379  48.7416  2.8647  47.3619  4.6909       6;0  20.5164      18           1
---------------------------------------------------------------------------------------------------------------

Python2, multi-character(4), Unicode, old OR new version

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)          Min       Max      Mean  StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test                  216.8522  243.5811  228.5402  9.9451  229.2888  11.8578       2;0  4.3756       5           1
-------------------------------------------------------------------------------------------------------------------

Python2, multi-character(50), Unicode, old OR new version

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test                  271.2440  303.5090  288.3976  11.4776  288.5911  9.7762       2;0  3.4674       5           1
-------------------------------------------------------------------------------------------------------------------

Python2, single-character, Unicode, old OR new version

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test                  211.6311  237.6649  223.2730  10.1113  225.0071  14.2470       2;0  4.4788       5           1
--------------------------------------------------------------------------------------------------------------------

Python3, multi-character(4), new version:

----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in ms)          Min       Max      Mean  StdDev    Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test                  204.7108  212.9554  209.4802  3.0667  209.5287  3.5677       2;0  4.7737       5           1
------------------------------------------------------------------------------------------------------------------

Python3, multi-character(50), new version:

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test                  334.6331  377.7274  354.4465  15.8379  354.4966  19.1653       2;0  2.8213       5           1
--------------------------------------------------------------------------------------------------------------------

Python3, single-character, old version

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  38.7690  49.8361  42.1896  3.2401  40.5851  4.7968       5;0  23.7026      23           1
---------------------------------------------------------------------------------------------------------------

The takeaway here is that using multi-character splits is definitely slower, but probably just as a result of building/reading longer strings rather than as the result of these changes. Users should probably prefer to use single-character split values where possible (and seeing as this issue wasn't previously discovered, I suspect most already are).

There's a bit of weirdness (I expected the replace implementation for multi-character split on Python2 to be slower, not faster, than the translate Python3 implementation), but I don't suspect it will have user impact.

Let me know if there is any other information I can provide.

fchollet

Thanks for running the benchmark. The changes look good to me.

Correct tokenization with multi-character split

0cb8b03

* Fixes #9538 * Adds new test cases for previously-failing uses * Tested on Python 2.7 with Unicode and non-Unicode strings, and on Python 3.5

fchollet approved these changes Mar 7, 2018

View reviewed changes

fchollet merged commit 70ad0d6 into keras-team:master Mar 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct tokenization with multi-character `split` #9585

Correct tokenization with multi-character `split` #9585

connor-bracewell commented Mar 7, 2018

fchollet commented Mar 7, 2018

connor-bracewell commented Mar 7, 2018

connor-bracewell commented Mar 7, 2018 •

edited

fchollet left a comment

Correct tokenization with multi-character split #9585

Correct tokenization with multi-character split #9585

Conversation

connor-bracewell commented Mar 7, 2018

fchollet commented Mar 7, 2018

connor-bracewell commented Mar 7, 2018

connor-bracewell commented Mar 7, 2018 • edited

fchollet left a comment

Choose a reason for hiding this comment

Correct tokenization with multi-character `split` #9585

Correct tokenization with multi-character `split` #9585

connor-bracewell commented Mar 7, 2018 •

edited