Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespaces negation in group not working as expected #1065

Closed
hurricup opened this issue Feb 24, 2023 · 4 comments · Fixed by #1066
Closed

Whitespaces negation in group not working as expected #1065

hurricup opened this issue Feb 24, 2023 · 4 comments · Fixed by #1066
Assignees
Labels
bug Not working as intended
Milestone

Comments

@hurricup
Copy link

hurricup commented Feb 24, 2023

I'm currently migrating from the 1.7.0 to 1.9.0 and my test uncovered that in 1.7.0 [^\n\-\s%]+ did not matched spaces, but in 1.9.0 it does. And now I need to write it like [^\n\- \t%]+ And this feels wrong.

https://regex101.com/r/eOBDPX/1

@lsf37
Copy link
Member

lsf37 commented Feb 25, 2023

Hm, that does sound wrong, [^\s] should not match space. Thanks for reporting that, will investigate.

@lsf37 lsf37 self-assigned this Feb 25, 2023
@lsf37
Copy link
Member

lsf37 commented Feb 25, 2023

So, with just a single [^\s]+ everything appears to be working as expected, but with [^\n\-\s%]+ if there is also a . default rule present in the spec, I get an error at generation time, which is definitely a problem.

Not sure yet what is going on, but this is a minimal failing example:

%%
%%

[^\n\s]  {  }
.        {  }

The presence of all of \n, \s, . seems to be important, removing any of them leads to correct output.

Generator error looks like we're trying to access/emit a char class that doesn't exist:

Index -1 out of bounds for length 4
java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 4
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:100)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
	at java.base/java.util.Objects.checkIndex(Objects.java:385)
	at java.base/java.util.ArrayList.get(ArrayList.java:427)
	at jflex.core.unicode.CharClasses.getIntervals(CharClasses.java:397)
	at jflex.core.unicode.CharClasses.computeTables(CharClasses.java:414)
	at jflex.core.unicode.CharClasses.getTables(CharClasses.java:466)
	at jflex.generator.Emitter.emitCharMapTables(Emitter.java:596)
	at jflex.generator.Emitter.emit(Emitter.java:1384)
	at jflex.generator.LexGenerator.generate(LexGenerator.java:111)
	at jflex.Main.generate(Main.java:341)
	at jflex.Main.main(Main.java:357)

@lsf37
Copy link
Member

lsf37 commented Feb 25, 2023

Char class generation seems to have a problem -- in class 2 10 occurs twice, which should be impossible:

CharClasses:
class 0:
{ [0-8][14-31]['!'-132][134-159][161-5759][5761-8191][8203-8231][8234-8238][8240-8286][8288-12287][12289-55295][57344-1114111] }
class 1:
{ [9][' '][160][5760][8192-8202][8239][8287][12288] }
class 2:
{ [10-13][10][133][8232-8233] }
class 3:
{ [14-9][11-13][55296-57343] }

@lsf37 lsf37 added the bug Not working as intended label Feb 26, 2023
@lsf37
Copy link
Member

lsf37 commented Feb 26, 2023

The bug is in the (new) code for normalisation of character classes. It uses a version of set subtraction (a - b) that is only safe when b is contained in a, which is not always the case.

In particular, the bug will trigger when we negate a character class that has overlapping contents, i.e. the \n is contained in \s in the failing test case, which means that the code first removes \n, and then is attempting to remove \s which is attempting to remove \n again. The operation as such succeeds, but leaves the set in an inconsistent state. That state only triggers a visible error in the next set operation that relies on the set invariants, which is why we need the second rule containing . to trigger anything visible (. having a non-empty intersection with [^\n\s]).

This also explains why replacing \s with just space fixes the problem, because the content elements of the negated character class don't intersect any more.

The fix should be fairly straightforward.

lsf37 added a commit that referenced this issue Feb 26, 2023
`IntCharSet.sub(s)` expects `s` to be fully contained in `this`. If the
contents of the inner char class expression overlap, this assumption is
violated and leads to an inconsistent IntCharSet state.

Fix this by computing the union of the inner expression first, and then
subtracting the union from the universal set.

Fixes #1065
lsf37 added a commit that referenced this issue Feb 26, 2023
`IntCharSet.sub(s)` expects `s` to be fully contained in `this`. If the
contents of the inner char class expression overlap, this assumption is
violated and leads to an inconsistent IntCharSet state.

Fix this by computing the union of the inner expression first, and then
returning the complement. Since the only difference to the CCLASS case
is the complement at the end, the two cases can be merged.

Fixes #1065
lsf37 added a commit that referenced this issue Feb 26, 2023
Regression test for #1065: test that a negated char class with
overlapping content is generated and matched correctly.
lsf37 added a commit that referenced this issue Feb 26, 2023
`IntCharSet.sub(s)` expects `s` to be fully contained in `this`. If the
contents of the inner char class expression overlap, this assumption is
violated and leads to an inconsistent IntCharSet state.

Fix this by computing the union of the inner expression first, and then
returning the complement. Since the only difference to the CCLASS case
is the complement at the end, the two cases can be merged.

Fixes #1065
lsf37 added a commit that referenced this issue Feb 26, 2023
Regression test for #1065: test that a negated char class with
overlapping content is generated and matched correctly.
@lsf37 lsf37 added this to the 1.9.1 milestone Feb 26, 2023
@lsf37 lsf37 closed this as completed in 29b6852 Feb 26, 2023
github-actions bot pushed a commit that referenced this issue Feb 26, 2023
commit 29b6852
Author:     Gerwin Klein <lsf@jflex.de>
AuthorDate: Sun Feb 26 12:21:25 2023 +1100
Commit:     Gerwin Klein <lsf@jflex.de>
CommitDate: Sun Feb 26 13:05:46 2023 +1100

    RegExp: establish IntCharSet.sub preconditions

    `IntCharSet.sub(s)` expects `s` to be fully contained in `this`. If the
    contents of the inner char class expression overlap, this assumption is
    violated and leads to an inconsistent IntCharSet state.

    Fix this by computing the union of the inner expression first, and then
    returning the complement. Since the only difference to the CCLASS case
    is the complement at the end, the two cases can be merged.

    Fixes #1065

Updated from
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants