Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: ngx.re: split() might enter infinite loops #106

Closed

Conversation

thibaultcha
Copy link
Member

@thibaultcha thibaultcha commented Apr 27, 2017

This is a proposed fix for #104. The issue encountered is similar to #79, but #79 cannot address this issue, since the introduced branch only runs when regex == "".

This proposes to unify the behavior triggered by ambiguous regexes, and that behavior would be to return the subject untouched (because the regex is ambiguous). I believe this behavior might be preferable than to split all characters, but sadly, it is a breaking change, alas :'(

Thoughts on how to best handle this?

@agentzh
Copy link
Member

agentzh commented Apr 27, 2017

@thibaultcha I think we should follow perl's split's behavior here:

$ perl -e 'print join ",", split /|/, "abcd"'
a,b,c,d

@agentzh
Copy link
Member

agentzh commented Apr 27, 2017

@thibaultcha I think instead of checking if the regex is empty, we should check if the matched separator is an empty string.

@thibaultcha
Copy link
Member Author

@agentzh Updated the patch. We now mimic the Perl behavior with empty matches and with /^/ as well:

$ perl -e 'print join ":", split //, "abcd"'
a:b:c:d

$ perl -e 'print join ":", split /|/, "abcd"'
a:b:c:d

$ perl -e 'print join ":", split /()/, "abcd"'
a::b::c::d

$ perl -e 'print join ":", split /^/, "ab\ncd"'
ab
:cd

$ perl -e 'print join ":", split /^/m, "ab\ncd"'
ab
:cd

$ perl -e 'print join ":", split / ^/, "ab\ncd"'
ab
cd

$ perl -e 'print join ":", split / ^/x, "ab\ncd"'
ab
:cd

What are your thoughts on respecting this behavior?

lib/ngx/re.lua Outdated
res_idx = res_idx + 1
res[res_idx] = sub(subj, sub_idx, from - 1)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, oversight...

lib/ngx/re.lua Outdated
@@ -89,6 +94,10 @@ function _M.split(subj, regex, opts, ctx, max, res)
-- needed because of further calls to string.sub in this function.
subj = tostring(subj)

if not opts then
opts = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2-space indentations?

lib/ngx/re.lua Outdated
res_idx = res_idx + 1
pos = pos + 1
end
local start_regex = find(regex, "^", nil, true) ~= nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is too hacky to scan the regex literal using string.find.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the regex engine should handle this automatically. If not, then there must be some other issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not satisfied with it either... Open to suggestions!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the regex engine should handle this automatically. If not, then there must be some other issues.

Sorry missed this. Hmm, I will investigate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On subj = "ab\ncd" with /^/ If we don't add the m flag, we get:

{ "a", "b\ncd" }

So we do not match on newlines automatically, but only at position 0, because it is our first character, and the rest doesn't get matched. It seemed to me (after a bit of online research) that Perl's split was giving some special care of its own to /^/, but I could definitely be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should try the "m" regex flag in the Lua split? By default, ^ only matches the beginning of the string IIRC.

Copy link
Member Author

@thibaultcha thibaultcha Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should try the "m" regex flag in the Lua split?

You mean require users to add it themselves in the opts argument? Because Perl's split behavior is - apparently - to add the m automatically when /^/ is given.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thibaultcha Well, here we should follow our own semantics instead of perl's:

$ resty -e 'print((ngx.re.match("a\nb", "^b", "")) and "true" or "false")'
false

$ resty -e 'print((ngx.re.match("a\nb", "^b", "m")) and "true" or "false")'
true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes our lives easier, I'm not against it! 😅 Will update.

@thibaultcha thibaultcha force-pushed the fix/re-split-infinite-loop branch 2 times, most recently from 999aaca to 247e63c Compare April 29, 2017 05:56
@thibaultcha
Copy link
Member Author

thibaultcha commented Apr 29, 2017

I updated the patch to not incorporate Perl's split behavior with /^/. However, to preserve the behavior introduced with #79, we still check for empty regexes. I'm not sure if we should check for all empty matches regexes like () or |, which currently have a different behavior than "":

-- empty regex:
ngx_re.split("abcd", "")   -- { "a", "b", "c", "d" }
-- others:
ngx_re.split("abcd", "|")  -- { " ", "a", "b", "c", "d" }
ngx_re.split("abcd", "()") -- { " ", " ", "a", "b", "c", "d" }

Should we do something about it?

@agentzh
Copy link
Member

agentzh commented Apr 29, 2017

@thibaultcha I think instead of checking empty patterns, we should check empty captures instead.

@thibaultcha
Copy link
Member Author

I'm not sure what you mean, nor how.

@agentzh
Copy link
Member

agentzh commented Apr 29, 2017

@thibaultcha Just check if the separator pattern actually matches any non-zero length of input data (by checking the pos returned by PCRE). If yes, simply move the cursor to the next character of the input.

@thibaultcha thibaultcha force-pushed the fix/re-split-infinite-loop branch 3 times, most recently from 3080f9e to 4ab563f Compare May 9, 2017 02:55
@thibaultcha
Copy link
Member Author

thibaultcha commented May 9, 2017

@agentzh Sorry for the delay! I pushed a new version of the patch. Moving to the next character on empty matches (what the patch was already doing, but only for empty regexes) does not seem to be enough for cases like:

$ perl -e 'print join ":", split /^/m, "ab\ncd"'
ab
:cd

We now mimic this behavior in ngx_re.split() if the user specifies the m flag:

ngx_re.split("ab\ncd", "^", "m")
{ "ab\n", "cd" }

To do this we handle empty matches by running a second time the regex further ahead, it seems to be the only way to do satisfy both the empty regex and the /^/m one.

Let me know what you think of it!

PS: I don't really like this code duplication between the max and the non-max branches. It leads to an overly complicated function at first look, and needs twice as many tests for both branches :/

@thibaultcha thibaultcha force-pushed the fix/re-split-infinite-loop branch 2 times, most recently from 7e61a8f to 590e523 Compare May 9, 2017 03:06
@thibaultcha
Copy link
Member Author

@agentzh Have you had some time to review this? Considering it is a bugfix, I think it needs some attention. As pointed out in my previous comment, I think we need to look-ahead when such empty matches happen, to know how many characters must be skipped (not necessary the next character as previously mentioned, implemented). Thanks!

lib/ngx/re.lua Outdated
local old_pos = ctx.pos

local from2, _, _, err2 = re_split_helper(subj, compiled,
compile_once, flags, ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thibaultcha It looks strange to me to perform a second match in the same location. Why? You said it is "the only way", but I still don't understand it. Will you elaborate?

Copy link
Member Author

@thibaultcha thibaultcha May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this needs elaboration. So, considering the following string:

"abcd\nefgh"

And the following regex:

/^/m

We have a match at characters 0 and 5. If the only thing we do is moving to the next character after the first match, we have something like the following result:

{ "a", "bcd", "e", "fgh" }

Because our function is not smart enough to know that an empty match does not necessarily mean we are splitting char-by-char. What we need is keep the first match (0), and find where the next match is (we know it'll be empty too), which will be 5. We can then deduce our first sub-string is 1 to 4:

{ "abcd" }

Otherwise, a simple increment would have cut 0 to 1, and then 2 to 4:

{ "a", "bcd" }

Simply incrementing the string pos by one would work for regexps like /()/ were we know the next character is the next match, but not for subjects were we don't know where the next empty match will be at. At least, that's what seems to be the only way (insisting on the "seems") :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thibaultcha It appears clearer to me now. Now I'd just like to be this match's result not wasted.

lib/ngx/re.lua Outdated

from = from2
to = from2
ctx.pos = old_pos
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not backtrack to the old position? This way we can make this extra match not get wasted. The same applies to the to index returned by this extra match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm yeah I thought this was already the case (edited my explanatory comment), but just realized it actually is backtracking to the old position. Will try to get rid of it. Would want nothing less performant than that for lua-resty-core :)

@thibaultcha thibaultcha force-pushed the fix/re-split-infinite-loop branch 3 times, most recently from ae4006f to 1138688 Compare May 16, 2017 02:49
@thibaultcha
Copy link
Member Author

@agentzh I have updated the patch with a new, smaller implementation that I find myself quite found of :) None of the existing tests were modified, some were added to cover additional or missing ground.

@thibaultcha
Copy link
Member Author

Also this patch was rebased on master.

@agentzh
Copy link
Member

agentzh commented May 16, 2017

@thibaultcha This indeed looks much better. Thanks!

@doujiang24 @dndx Will you review this PR when you guys have a chance? Thanks!

@thibaultcha
Copy link
Member Author

Ping :)

@agentzh
Copy link
Member

agentzh commented Jun 6, 2017

@thibaultcha Sorry for the delay! I'm still waiting for @doujiang24's review :)

Copy link
Member

@doujiang24 doujiang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thibaultcha @agentzh Sorry for the delay!
LGTM :)

@thibaultcha
Copy link
Member Author

Thanks @doujiang24 !

@agentzh
Copy link
Member

agentzh commented Jun 9, 2017

@thibaultcha Merged. Thanks!

@agentzh agentzh closed this Jun 9, 2017
@thibaultcha thibaultcha deleted the fix/re-split-infinite-loop branch January 12, 2018 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants