-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: ngx.re: split() might enter infinite loops #106
bugfix: ngx.re: split() might enter infinite loops #106
Conversation
@thibaultcha I think we should follow perl's $ perl -e 'print join ",", split /|/, "abcd"'
a,b,c,d |
@thibaultcha I think instead of checking if the regex is empty, we should check if the matched separator is an empty string. |
96169cc
to
b3fecd6
Compare
@agentzh Updated the patch. We now mimic the Perl behavior with empty matches and with
What are your thoughts on respecting this behavior? |
lib/ngx/re.lua
Outdated
res_idx = res_idx + 1 | ||
res[res_idx] = sub(subj, sub_idx, from - 1) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, oversight...
lib/ngx/re.lua
Outdated
@@ -89,6 +94,10 @@ function _M.split(subj, regex, opts, ctx, max, res) | |||
-- needed because of further calls to string.sub in this function. | |||
subj = tostring(subj) | |||
|
|||
if not opts then | |||
opts = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2-space indentations?
b3fecd6
to
e89ec2a
Compare
lib/ngx/re.lua
Outdated
res_idx = res_idx + 1 | ||
pos = pos + 1 | ||
end | ||
local start_regex = find(regex, "^", nil, true) ~= nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is too hacky to scan the regex literal using string.find
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the regex engine should handle this automatically. If not, then there must be some other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not satisfied with it either... Open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the regex engine should handle this automatically. If not, then there must be some other issues.
Sorry missed this. Hmm, I will investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On subj = "ab\ncd"
with /^/
If we don't add the m
flag, we get:
{ "a", "b\ncd" }
So we do not match on newlines automatically, but only at position 0, because it is our first character, and the rest doesn't get matched. It seemed to me (after a bit of online research) that Perl's split
was giving some special care of its own to /^/
, but I could definitely be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should try the "m" regex flag in the Lua split? By default, ^ only matches the beginning of the string IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should try the "m" regex flag in the Lua split?
You mean require users to add it themselves in the opts argument? Because Perl's split
behavior is - apparently - to add the m
automatically when /^/
is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thibaultcha Well, here we should follow our own semantics instead of perl's:
$ resty -e 'print((ngx.re.match("a\nb", "^b", "")) and "true" or "false")'
false
$ resty -e 'print((ngx.re.match("a\nb", "^b", "m")) and "true" or "false")'
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes our lives easier, I'm not against it! 😅 Will update.
999aaca
to
247e63c
Compare
I updated the patch to not incorporate Perl's split behavior with -- empty regex:
ngx_re.split("abcd", "") -- { "a", "b", "c", "d" }
-- others:
ngx_re.split("abcd", "|") -- { " ", "a", "b", "c", "d" }
ngx_re.split("abcd", "()") -- { " ", " ", "a", "b", "c", "d" } Should we do something about it? |
@thibaultcha I think instead of checking empty patterns, we should check empty captures instead. |
I'm not sure what you mean, nor how. |
@thibaultcha Just check if the separator pattern actually matches any non-zero length of input data (by checking the |
3080f9e
to
4ab563f
Compare
@agentzh Sorry for the delay! I pushed a new version of the patch. Moving to the next character on empty matches (what the patch was already doing, but only for empty regexes) does not seem to be enough for cases like: $ perl -e 'print join ":", split /^/m, "ab\ncd"'
ab
:cd We now mimic this behavior in ngx_re.split("ab\ncd", "^", "m")
{ "ab\n", "cd" } To do this we handle empty matches by running a second time the regex further ahead, it seems to be the only way to do satisfy both the empty regex and the Let me know what you think of it! PS: I don't really like this code duplication between the max and the non-max branches. It leads to an overly complicated function at first look, and needs twice as many tests for both branches :/ |
7e61a8f
to
590e523
Compare
@agentzh Have you had some time to review this? Considering it is a bugfix, I think it needs some attention. As pointed out in my previous comment, I think we need to look-ahead when such empty matches happen, to know how many characters must be skipped (not necessary the next character as previously mentioned, implemented). Thanks! |
lib/ngx/re.lua
Outdated
local old_pos = ctx.pos | ||
|
||
local from2, _, _, err2 = re_split_helper(subj, compiled, | ||
compile_once, flags, ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thibaultcha It looks strange to me to perform a second match in the same location. Why? You said it is "the only way", but I still don't understand it. Will you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this needs elaboration. So, considering the following string:
"abcd\nefgh"
And the following regex:
/^/m
We have a match at characters 0
and 5
. If the only thing we do is moving to the next character after the first match, we have something like the following result:
{ "a", "bcd", "e", "fgh" }
Because our function is not smart enough to know that an empty match does not necessarily mean we are splitting char-by-char. What we need is keep the first match (0
), and find where the next match is (we know it'll be empty too), which will be 5
. We can then deduce our first sub-string is 1 to 4
:
{ "abcd" }
Otherwise, a simple increment would have cut 0 to 1
, and then 2 to 4
:
{ "a", "bcd" }
Simply incrementing the string pos by one would work for regexps like /()/
were we know the next character is the next match, but not for subjects were we don't know where the next empty match will be at. At least, that's what seems to be the only way (insisting on the "seems") :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thibaultcha It appears clearer to me now. Now I'd just like to be this match's result not wasted.
lib/ngx/re.lua
Outdated
|
||
from = from2 | ||
to = from2 | ||
ctx.pos = old_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not backtrack to the old position? This way we can make this extra match not get wasted. The same applies to the to
index returned by this extra match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm yeah I thought this was already the case (edited my explanatory comment), but just realized it actually is backtracking to the old position. Will try to get rid of it. Would want nothing less performant than that for lua-resty-core :)
ae4006f
to
1138688
Compare
yields empty matches
1138688
to
8d5e3b8
Compare
@agentzh I have updated the patch with a new, smaller implementation that I find myself quite found of :) None of the existing tests were modified, some were added to cover additional or missing ground. |
Also this patch was rebased on master. |
@thibaultcha This indeed looks much better. Thanks! @doujiang24 @dndx Will you review this PR when you guys have a chance? Thanks! |
Ping :) |
@thibaultcha Sorry for the delay! I'm still waiting for @doujiang24's review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thibaultcha @agentzh Sorry for the delay!
LGTM :)
Thanks @doujiang24 ! |
@thibaultcha Merged. Thanks! |
This is a proposed fix for #104. The issue encountered is similar to #79, but #79 cannot address this issue, since the introduced branch only runs when
regex == ""
.This proposes to unify the behavior triggered by ambiguous regexes, and that behavior would be to return the subject untouched (because the regex is ambiguous). I believe this behavior might be preferable than to split all characters, but sadly, it is a breaking change, alas :'(
Thoughts on how to best handle this?