Valid UTF-8 input can cause infinite loop in JONI #17

haozhun · 2015-03-18T22:39:20Z

In #7, @electrum identified a location that can cause inifinite loop in JONI. It is marked as won't fix because input can be sanitized beforehand and JONI assumes that the input is always valid.

When the pattern is "\uD8000", it can be pre-sanitized, as you suggested in #7. What if the pattern is "\\uD800"? How can the user sanitize it?

If JONI is willing to add a check, it would be the same fix for #7, checking whether the return value of enc.length is negative in OptExactInfo.concatStr.

The text was updated successfully, but these errors were encountered:

haozhun · 2015-03-26T00:52:34Z

In addition, \uD800\uDC00, which is a legal sequence, will also result in infinite loop, because JONI consider every \uXXXX as a code point.

guyboertje · 2016-04-26T14:26:55Z

@haozhun - can you show some jruby or java code that illustrates the endless loop?

headius · 2016-05-02T17:49:40Z

Note that in the past year we did add the ability to interrupt joni when it's stuck looping on bad input (or just large input/slow regex).

@haozhun Can you propose a patch? @lopex would probably be the best one to review such a change.

haozhun · 2016-05-03T20:40:27Z

Java code that illustrate the infinite loop. This can be mitigated by using NonStrict... instead as illustrated in the commented out code.

    public static void main(String[] args)
    {
        byte[] pattern = "A\\uD800".getBytes(StandardCharsets.UTF_8);
        byte[] str = ("AB").getBytes(StandardCharsets.UTF_8);
        Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, UTF8Encoding.INSTANCE, Syntax.Java);
        // Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, NonStrictUTF8Encoding.INSTANCE, Syntax.Java);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        System.out.println(result);
    }

Patch: #21

guyboertje · 2016-05-04T09:49:49Z

Ahh I see, this does not apply to JRuby (checked 1.7.24) because there is a range check.

raises RegexpError: invalid Unicode range: /A\uD800/

To avoid endless loop described here: jruby/joni#17 GitOrigin-RevId: 21619a0255e1facf7e1aaa5879ca36956b98e45a

To avoid endless loop described here: jruby/joni#17 (cherry picked from commit 21619a0255e1facf7e1aaa5879ca36956b98e45a) GitOrigin-RevId: 7c698742fa33d97047e98b7bbee9e5307844712b

haozhun mentioned this issue Mar 26, 2015

Check the validity of all mbc created #21

Open

haozhun mentioned this issue Dec 6, 2017

Regular expression failure for certain strings prestodb/presto#8711

Closed

lopex mentioned this issue Oct 2, 2018

Unable to find org.jcodings.specific.BaseUTF8Encoding.mbcCaseFold jruby/jcodings#25

Closed

lopex mentioned this issue Apr 25, 2019

Implement approximate length and other length routines for proper broken character processing jruby/jcodings#26

Open

SergeyZh pushed a commit to JetBrains/intellij-community that referenced this issue Feb 11, 2020

TextMate: use non-strict utf-8 encoding (IDEA-232576)

ce70f6f

To avoid endless loop described here: jruby/joni#17 GitOrigin-RevId: 21619a0255e1facf7e1aaa5879ca36956b98e45a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valid UTF-8 input can cause infinite loop in JONI #17

Valid UTF-8 input can cause infinite loop in JONI #17

haozhun commented Mar 18, 2015

haozhun commented Mar 26, 2015

guyboertje commented Apr 26, 2016

headius commented May 2, 2016

haozhun commented May 3, 2016 •

edited

guyboertje commented May 4, 2016

Valid UTF-8 input can cause infinite loop in JONI #17

Valid UTF-8 input can cause infinite loop in JONI #17

Comments

haozhun commented Mar 18, 2015

haozhun commented Mar 26, 2015

guyboertje commented Apr 26, 2016

headius commented May 2, 2016

haozhun commented May 3, 2016 • edited

guyboertje commented May 4, 2016

haozhun commented May 3, 2016 •

edited