New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregal capture after recursive match #48

Closed
osamutake opened this Issue Jan 21, 2015 · 9 comments

Comments

Projects
None yet
2 participants
@osamutake

osamutake commented Jan 21, 2015

Thank you very much for maintaining the excellent library.

I found the latest version of Onigmo suffers from the following incorrect behavior when used with a recursive expression.

/\(((?:[^(]|\g<0>)*)\)/ matches "(abc)(abc)" => OK 馃挌
matches[0] == (0, 5) corresponding to "(abc)" => OK 馃挌
matches[1] == (6, 4) corresponding to a string with negative length => NG 馃挃

/\(((?:[^()]|\g<0>)*)\)/ matches "((abc)(abc))" => OK 馃挌
matches[0] == (0, 12) corresponding to "((abc)(abc))" => OK 馃挌
matches[1] == (7, 11) corresponding to "abc)" => NG 馃挃

It seems that matches[].rm_so refers to the last capture while matches[].rm_eo refers to to the top level capture. I believe the users will be happier if both of them refers to the top level capture. When tested in ruby, the former example returns an invalid string for $2 that causes ArgumentError when given to a function as described at ruby/姝h琛ㄧ従.

@k-takata k-takata added the bug label Nov 29, 2016

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Nov 29, 2016

Owner

Sorry for the very late response.
I think this is a bug of oniguruma/onigmo, but it seems difficult to fix.

I'm trying to investigate this, but I haven't found the cause yet.
Debug log: https://gist.github.com/k-takata/833f4ef7c4f94428e8de938103e97d5f

Owner

k-takata commented Nov 29, 2016

Sorry for the very late response.
I think this is a bug of oniguruma/onigmo, but it seems difficult to fix.

I'm trying to investigate this, but I haven't found the cause yet.
Debug log: https://gist.github.com/k-takata/833f4ef7c4f94428e8de938103e97d5f

k-takata added a commit that referenced this issue Nov 30, 2016

Fix wrong capture inside subexp calls (Issue #48)
Capturing inside subexp calls should use a stack.
This fixes only the first issue. The second one has not been fixed yet.
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Nov 30, 2016

Owner

The first one has been fixed in the devel-6.0 branch.
The second one is still under investigation.

Owner

k-takata commented Nov 30, 2016

The first one has been fixed in the devel-6.0 branch.
The second one is still under investigation.

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Dec 2, 2016

Owner

I think I understand what happens with the second example.

Capturing happens three times:

  1. (abc)(abc): position 1 to 11, nest level 1
  2. the first abc: position 2 to 5, nest level 2
  3. the second abc: position 7 to 10, nest level 2

Matching sequence is the following:

  1. Start of the 1st one captures position 1.
  2. Start of the 2nd one captures position 2.
  3. End of the 2nd one captures position 5.
  4. Start of the 3rd one captures position 7.
  5. End of the 3rd one captures position 10.
  6. End of the 1st one captures position 11.

Finally, the last captured position of start and end will be returned.
So the start of \1 becomes 7 (No.4) and the end of \1 becomes 11 (No.6).
The nest level is not considered here.

I believe the users will be happier if both of them refers to the top level capture.

I'm not sure this can be easily implemented. Also not sure this doesn't break existing behavior for non-nested capturing.
Need more investigation...

Owner

k-takata commented Dec 2, 2016

I think I understand what happens with the second example.

Capturing happens three times:

  1. (abc)(abc): position 1 to 11, nest level 1
  2. the first abc: position 2 to 5, nest level 2
  3. the second abc: position 7 to 10, nest level 2

Matching sequence is the following:

  1. Start of the 1st one captures position 1.
  2. Start of the 2nd one captures position 2.
  3. End of the 2nd one captures position 5.
  4. Start of the 3rd one captures position 7.
  5. End of the 3rd one captures position 10.
  6. End of the 1st one captures position 11.

Finally, the last captured position of start and end will be returned.
So the start of \1 becomes 7 (No.4) and the end of \1 becomes 11 (No.6).
The nest level is not considered here.

I believe the users will be happier if both of them refers to the top level capture.

I'm not sure this can be easily implemented. Also not sure this doesn't break existing behavior for non-nested capturing.
Need more investigation...

@osamutake

This comment has been minimized.

Show comment
Hide comment
@osamutake

osamutake Dec 2, 2016

Thank you very much for your detailed investigation.

I imagine that if we push the start and end positions on the nest-level stack and pop them from it at the nest-level switching, the problem will be solved. Note that it does not affect non-nested capturing.

I'm afraid the first one should have been fixed in the same manner.

I do not know how difficult to implement it though...

osamutake commented Dec 2, 2016

Thank you very much for your detailed investigation.

I imagine that if we push the start and end positions on the nest-level stack and pop them from it at the nest-level switching, the problem will be solved. Note that it does not affect non-nested capturing.

I'm afraid the first one should have been fixed in the same manner.

I do not know how difficult to implement it though...

@k-takata k-takata closed this in 7591b3c Dec 4, 2016

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Dec 4, 2016

Owner

I think this is fixed by the latest commit.

Owner

k-takata commented Dec 4, 2016

I think this is fixed by the latest commit.

@osamutake

This comment has been minimized.

Show comment
Hide comment
@osamutake

osamutake Dec 4, 2016

Thank you very much for the fix.

I wonder if the commit dd4638c was really needed independently from 7591b3c.
Doesn't 7591b3c fix the first example?

I can't follow the logic precisely but, I imagine the two failed tests originate from a same origin.

osamutake commented Dec 4, 2016

Thank you very much for the fix.

I wonder if the commit dd4638c was really needed independently from 7591b3c.
Doesn't 7591b3c fix the first example?

I can't follow the logic precisely but, I imagine the two failed tests originate from a same origin.

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Dec 4, 2016

Owner

dd4638c is needed to store the start/end position of each capturing to a stack.
Without this, the position will be overwritten by the latest one.

Owner

k-takata commented Dec 4, 2016

dd4638c is needed to store the start/end position of each capturing to a stack.
Without this, the position will be overwritten by the latest one.

@osamutake

This comment has been minimized.

Show comment
Hide comment
@osamutake

osamutake Dec 4, 2016

I thought even if it is overwritten, it is recovered at level switching but it seems my imagination.
Sorry for bothering and thank you again for your labor.

osamutake commented Dec 4, 2016

I thought even if it is overwritten, it is recovered at level switching but it seems my imagination.
Sorry for bothering and thank you again for your labor.

k-takata added a commit that referenced this issue Dec 7, 2016

k-takata added a commit that referenced this issue Dec 7, 2016

Better fix for Issue #48
This partly reverts 7591b3c.
Also related: dd4638c

k-takata added a commit that referenced this issue Dec 7, 2016

Better fix for Issue #48
Remove redundant condition.
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Dec 7, 2016

Owner

I found a better solution for this.
Even without my fixes, a pattern in sample/listcap.c captured as expected with recursive match. This meant that Onigmo had a mechanism to handle nested captures.

  1. \g<p>(?@<p>\(\g<s>\)){0}(?@<s>(?:\g<p>)*|){0}
  2. \(((?:[^(]|\g<0>)*)\)

Difference between 1 and 2 is whether captured groups are directly called or not.
In 1, both <p> and <s> are directly referenced by \g<p> and \g<s>.
In 2, the whole pattern is referenced by \g<0> but \1 is not referenced. It is indirectly called via \g<0>.

Now, with the commit 4bd79e2, captured groups which are called indirectly are also handled as if they are called directly.

BTW, I'm not sure both IN_CALL and IN_RECCALL in the commit are really needed.

Owner

k-takata commented Dec 7, 2016

I found a better solution for this.
Even without my fixes, a pattern in sample/listcap.c captured as expected with recursive match. This meant that Onigmo had a mechanism to handle nested captures.

  1. \g<p>(?@<p>\(\g<s>\)){0}(?@<s>(?:\g<p>)*|){0}
  2. \(((?:[^(]|\g<0>)*)\)

Difference between 1 and 2 is whether captured groups are directly called or not.
In 1, both <p> and <s> are directly referenced by \g<p> and \g<s>.
In 2, the whole pattern is referenced by \g<0> but \1 is not referenced. It is indirectly called via \g<0>.

Now, with the commit 4bd79e2, captured groups which are called indirectly are also handled as if they are called directly.

BTW, I'm not sure both IN_CALL and IN_RECCALL in the commit are really needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment