Parser performance with unicode #73

Closed
zmoazeni opened this Issue Mar 19, 2013 · 10 comments

3 participants

@zmoazeni

Hi folks, I'm really liking parslet, but I'm having trouble trying to optimize my parser. All the information is added to this gist.

I have a crude css parser and its performance with unicode characters is really confusing. Given this example css file, it parses the file in roughly 6 seconds on my machine.

I generated it from the following:

File.open("input-good-perf.css", "w") {|f| 5000.times {|i| f.puts("/* foo - #{i} */"); f.puts("#bar { border: #{i}px}")}};

However, when we introduce a single unicode character per line, it takes about 1 minute and 12 seconds to parse!

I generated that file from the following:

File.open("input-bad-perf.css", "w") {|f| 5000.times {|i| f.puts("/* foo – #{i} */"); f.puts("#bar { border: #{i}px}")}};

(Note that is using a unicode "–" U+2013, I think that's an "em dash")

And finally, if I combine the two, I get mixed results:

File.open("both.css", "w") {|f| f.puts("/* foo – 0 */"); f.puts("#bar { border: 0px}"); 4999.times {|i| f.puts("/* foo – #{i+1} */"); f.puts("#bar { border: #{i+1}px}")}};

That is a single unicode character and all rest ascii, name number of lines. That file parses in 45 seconds. (I didn't upload this file).

I also included the css_runner.rb I'm using against these files. And I verified that the contents of the files are read in as UTF-8.

(rdb:1) contents.encoding
#<Encoding:UTF-8>

Any idea what is causing the performance to tank so dramatically with unicode? I'm running MRI 1.9.3-p385 on a Macbook Air.

@zmoazeni

Tonight I decided to install ruby v2.0.0-p0 and see if that changed the perf numbers at all. It did, for the worse:

  • only_ascii.css: went from ~6sec in 1.9.3 to ~13sec in 2.0.
  • both.css (1 unicode character in the entire file from above): went from 45sec in 1.9.3 to ~3min in 2.0!

Crazy. I decided to also do some ruby-prof profiling for both files in both ruby versions, and I've uploaded the PNGs below.

The % ratios look the same between ruby versions (e.g. the wall % time in 1.9.3 vs 2.0.0 doesn't look much different for the same file). However "only-ascii" vs "both" (1 unicode character) has a big difference in String#index and String#slice. String#index take up 10% of the time in 1.9.3-ascii and 46% of the time in 1.9.3-both.

I generated these by running:

gem install ruby-prof
brew install graphviz
ruby-prof --mode=wall -p dot --file=both-prof.dot -m 3 ./test/just_parse2.rb tmp/both.css
dot -T png -o both-prof-1-9-3-p385-min-3.png both-prof.dot

both-prof-1-9-3-p385-min-3:
both-prof-1-9-3-p385-min-3

both-prof-2-0-0-p0-min-3:
both-prof-2-0-0-p0-min-3

only-ascii-prof-1-9-3-p385-min-3:
only-ascii-prof-1-9-3-p385-min-3

only-ascii-prof-2-0-0-p0-min-3:
only-ascii-prof-2-0-0-p0-min-3

@zmoazeni

I did some more testing without parslet and added to a gist, String#index behaves quite differently when there's a unicode character in the mix

@zmoazeni

Received an immediate response on https://bugs.ruby-lang.org/issues/8129

When all the characters in a string are ASCII characters (single bytes), the byte index for any given character can be calculated in constant time.
When the string contains multibyte characters, finding the byte index given a character index becomes O(n).
If you need fast character indexing, try splitting the string into an array or characters.

@charliesome

@zmoazeni or as nobu suggested, transcode to UTF-32LE first as it is a fixed width encoding.

@zmoazeni

Thanks @charliesome - I accidentally missed nobu's comment when updating this thread. UTF-32LE doesn't work for my use case since UTF-32LE isn't a compatible encoding with UTF-8. I'm seeing "incompatible character encodings: UTF-32LE and UTF-8 (Encoding::CompatibilityError)". So I'm going to work a pull request updating parslet to not depend on String#index. Probably by trying the String#chars approach first.

@nobu

I recommend to use StringScanner.

@zmoazeni

@nobu Thanks for the tip!

@zmoazeni zmoazeni added a commit to zmoazeni/parslet that referenced this issue Mar 21, 2013
@zmoazeni zmoazeni Optimizes multibyte parsing
String#index and String#[] has terrible performance, O(n),when multibyte
characters are included in the string. So this uses the stdlib
StringScanner and uses regexp for all matching.

Relevant links:

* https://bugs.ruby-lang.org/issues/8129
* #73 - kschiess#73
ad50b0e
@zmoazeni

Going to keep the discussion here, since the rest of the comments are here. With the PR #74 all 3 files I mentioned above parse in the roughly same time.

@zmoazeni zmoazeni added a commit to zmoazeni/parslet that referenced this issue Mar 21, 2013
@zmoazeni zmoazeni Optimizes multibyte parsing
String#index and String#[] has terrible performance, O(n),when multibyte
characters are included in the string. So this uses the stdlib
StringScanner and uses regexp for all matching.

Relevant links:

* https://bugs.ruby-lang.org/issues/8129
* #73 - kschiess#73

Big thanks to @nobu and @charliesome for their help and suggestions
d099533
@zmoazeni

Closing this now that #74 is merged.

@zmoazeni zmoazeni closed this Mar 24, 2013
@sideci-sample sideci-sample pushed a commit to sideci-sample/sideci-sample-csscss that referenced this issue Apr 23, 2014
@zmoazeni zmoazeni Using my optimized parslet branch
Waiting for this to be merged upstream before releasing this as a gem.

Related:

* kschiess/parslet#73
* kschiess/parslet#74
2b6dfac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment