[JRUBY-6668] StringScanner#scan_until spins forever on UTF-8 data #174

Closed
sgonyea opened this Issue May 17, 2012 · 8 comments

2 participants

@sgonyea

While running the tests in the ruby library 'mustache' (link: https://github.com/defunkt/mustache), one test in particular is failing:

https://github.com/defunkt/mustache/blob/master/test/mustache_test.rb#L510-522

JRuby dies calling StringScanner#scan_until here:

https://github.com/defunkt/mustache/blob/master/lib/mustache/parser.rb#L231

You can reproduce the issue with the following:

require 'strscan'
regex = /(^[ \t]*)?\{\{/
text = "<h1>中文 test</h1>\n\n{{> utf8_partial}}\n"
text.force_encoding 'BINARY'
scanner = StringScanner.new(text)
scanner.scan_until(regex) # Fans spin up, and this method never returns.

This seems to happen regardless of whether or not JRuby is in 1.8 or 1.9 mode. I am running this test like so:

JRUBY_OPTS=--1.9 ruby -I"lib:test" test/mustache_test.rb -n test_utf8 -v

I've also run it with: JRUBY_OPTS="--1.9 LC_ALL=en_US.UTF-8"

It appears that this affects UTF-8 characters. If I replace the chinese characters with "foo bar", then there is no problem.

I moved this issue here, as JIRA was butchering the UTF-8:

http://jira.codehaus.org/browse/JRUBY-6668?jwupdated=35361&focusedCommentId=299014#comment-299014

@headius
JRuby Team member

Confirmed on master.

@headius
JRuby Team member

I suspect this is due to missing encoding logic in StringScanner.

@sgonyea

Yeah, I traced its execution to inside Joni:

https://github.com/jruby/joni/blob/master/src/org/joni/Matcher.java#L460-464

But I wasn't sure if it was a JRuby or a Joni bug. (http://jira.codehaus.org/browse/JRUBY-6668#comment-298976)

Thanks for checking into it.

@sgonyea

Though I'm inclined to call this a Joni bug as well. The code in Matcher.java probably shouldn't always assume that enc.length will be positive, given that it seems to return -1 in some cases.

@headius headius added a commit that closed this issue May 17, 2012
@headius headius Fix #174
[JRUBY-6668] StringScanner#scan_until spins forever on UTF-8 data

We were not preparing the regex properly. Added that, and the
given example completes normally.
c550df6
@headius headius closed this in c550df6 May 17, 2012
@headius
JRuby Team member

FWIW, Joni is a little flaky when you feed it data that's not encoded like the pattern expects, and we've run into many cases where it will get stuck looping forever. It probably could use more defensive checks, but they might take away from raw speed...

@sgonyea

Wow, you are awesome. That was fast.

@headius
JRuby Team member

I got lucky :)

@headius
JRuby Team member

I didn't think about 1.8 mode with this patch, and ended up causing it to be more strict than it's supposed to be. Fix coming to revert the logic for 1.8 mode only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment