Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct tokenization with multi-character split #9585

Merged
merged 1 commit into from
Mar 7, 2018
Merged

Correct tokenization with multi-character split #9585

merged 1 commit into from
Mar 7, 2018

Conversation

connor-bracewell
Copy link
Contributor

Issue was caused by incorrect usage of maketrans with a multi-character split string. The Python3 implementation was updated to use a dictionary for translation of all strings. Python2 doesn't support this for non-Unicode strings, so I added an additional case that uses replace instead.

* Fixes #9538
* Adds new test cases for previously-failing uses
* Tested on Python 2.7 with Unicode and non-Unicode strings, and on Python 3.5
@fchollet
Copy link
Member

fchollet commented Mar 7, 2018

What's the performance impact of this change?

@connor-bracewell
Copy link
Contributor Author

Code paths for the previously-supported use cases in Python2 aren't changed.

Benchmarking Python3 on a ~1MB text string with substantial replacement done, using the default (single-character) split gives:

New Version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  36.5339  46.0856  39.8815  2.7605  38.7632  4.4961       6;0  25.0743      20           1
---------------------------------------------------------------------------------------------------------------

Old Version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  38.6402  43.9231  40.6444  1.4219  40.2049  1.5261       8;1  24.6037      23           1
---------------------------------------------------------------------------------------------------------------

So no real impact.

This should be expected as most of the work will be done by translate in both the new and old versions, while this change only affects the input to maketrans.

I'll update shortly with a comparison for using multi-character split values vs. single-character on both versions.

@connor-bracewell
Copy link
Contributor Author

connor-bracewell commented Mar 7, 2018

Python2, multi-character(4), non-Unicode, new version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  55.7990  73.9241  63.7629  6.0415  65.3000  9.1802       6;0  15.6831      15           1
---------------------------------------------------------------------------------------------------------------

Python2, single-character, non-Unicode, old version:

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  45.1980  54.1379  48.7416  2.8647  47.3619  4.6909       6;0  20.5164      18           1
---------------------------------------------------------------------------------------------------------------

Python2, multi-character(4), Unicode, old OR new version

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)          Min       Max      Mean  StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test                  216.8522  243.5811  228.5402  9.9451  229.2888  11.8578       2;0  4.3756       5           1
-------------------------------------------------------------------------------------------------------------------

Python2, multi-character(50), Unicode, old OR new version

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test                  271.2440  303.5090  288.3976  11.4776  288.5911  9.7762       2;0  3.4674       5           1
-------------------------------------------------------------------------------------------------------------------

Python2, single-character, Unicode, old OR new version

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test                  211.6311  237.6649  223.2730  10.1113  225.0071  14.2470       2;0  4.4788       5           1
--------------------------------------------------------------------------------------------------------------------

Python3, multi-character(4), new version:

----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in ms)          Min       Max      Mean  StdDev    Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test                  204.7108  212.9554  209.4802  3.0667  209.5287  3.5677       2;0  4.7737       5           1
------------------------------------------------------------------------------------------------------------------

Python3, multi-character(50), new version:

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test                  334.6331  377.7274  354.4465  15.8379  354.4966  19.1653       2;0  2.8213       5           1
--------------------------------------------------------------------------------------------------------------------

Python3, single-character, old version

---------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in ms)         Min      Max     Mean  StdDev   Median     IQR  Outliers      OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------
test                  38.7690  49.8361  42.1896  3.2401  40.5851  4.7968       5;0  23.7026      23           1
---------------------------------------------------------------------------------------------------------------

The takeaway here is that using multi-character splits is definitely slower, but probably just as a result of building/reading longer strings rather than as the result of these changes. Users should probably prefer to use single-character split values where possible (and seeing as this issue wasn't previously discovered, I suspect most already are).

There's a bit of weirdness (I expected the replace implementation for multi-character split on Python2 to be slower, not faster, than the translate Python3 implementation), but I don't suspect it will have user impact.

Let me know if there is any other information I can provide.

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the benchmark. The changes look good to me.

@fchollet fchollet merged commit 70ad0d6 into keras-team:master Mar 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants