Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valid UTF-8 input can cause infinite loop in JONI #17

Open
haozhun opened this issue Mar 18, 2015 · 5 comments
Open

Valid UTF-8 input can cause infinite loop in JONI #17

haozhun opened this issue Mar 18, 2015 · 5 comments

Comments

@haozhun
Copy link
Contributor

haozhun commented Mar 18, 2015

In #7, @electrum identified a location that can cause inifinite loop in JONI. It is marked as won't fix because input can be sanitized beforehand and JONI assumes that the input is always valid.

When the pattern is "\uD8000", it can be pre-sanitized, as you suggested in #7. What if the pattern is "\\uD800"? How can the user sanitize it?

If JONI is willing to add a check, it would be the same fix for #7, checking whether the return value of enc.length is negative in OptExactInfo.concatStr.

@haozhun
Copy link
Contributor Author

haozhun commented Mar 26, 2015

In addition, \uD800\uDC00, which is a legal sequence, will also result in infinite loop, because JONI consider every \uXXXX as a code point.

@guyboertje
Copy link

@haozhun - can you show some jruby or java code that illustrates the endless loop?

@headius
Copy link
Member

headius commented May 2, 2016

Note that in the past year we did add the ability to interrupt joni when it's stuck looping on bad input (or just large input/slow regex).

@haozhun Can you propose a patch? @lopex would probably be the best one to review such a change.

@haozhun
Copy link
Contributor Author

haozhun commented May 3, 2016

Java code that illustrate the infinite loop. This can be mitigated by using NonStrict... instead as illustrated in the commented out code.

    public static void main(String[] args)
    {
        byte[] pattern = "A\\uD800".getBytes(StandardCharsets.UTF_8);
        byte[] str = ("AB").getBytes(StandardCharsets.UTF_8);
        Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, UTF8Encoding.INSTANCE, Syntax.Java);
        // Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, NonStrictUTF8Encoding.INSTANCE, Syntax.Java);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        System.out.println(result);
    }

Patch: #21

@guyboertje
Copy link

Ahh I see, this does not apply to JRuby (checked 1.7.24) because there is a range check.

raises RegexpError: invalid Unicode range: /A\uD800/

SergeyZh pushed a commit to JetBrains/intellij-community that referenced this issue Feb 11, 2020
To avoid endless loop described here: jruby/joni#17

GitOrigin-RevId: 21619a0255e1facf7e1aaa5879ca36956b98e45a
SergeyZh pushed a commit to JetBrains/intellij-community that referenced this issue Feb 12, 2020
To avoid endless loop described here: jruby/joni#17

(cherry picked from commit 21619a0255e1facf7e1aaa5879ca36956b98e45a)

GitOrigin-RevId: 7c698742fa33d97047e98b7bbee9e5307844712b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants