Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ss" in look-behind raises syntax error #92

Closed
nurse opened this issue Jul 14, 2017 · 14 comments · Fixed by #116
Closed

"ss" in look-behind raises syntax error #92

nurse opened this issue Jul 14, 2017 · 14 comments · Fixed by #116
Labels

Comments

@nurse
Copy link
Contributor

nurse commented Jul 14, 2017

"ss" in look-behind may raise syntax error.

irb(main):006:0> /(?<!ss)/iu
=> /(?<!ss)/i
irb(main):002:0> /(?<!ass)/iu
SyntaxError: (irb):2: invalid pattern in look-behind: /(?<!ass)/i

This seems because "ss" is expanded to "ß".
Though Onigmo avoids the issue if there's only "ss", but with another characters it cause error.

@aquach
Copy link

aquach commented Sep 15, 2017

I'm getting a similar error with a non-ascii string:

(/(?<!post)e/i).match?("\u{a0}")
RegexpError: invalid pattern in look-behind: /(?<!post)e/i

@aquach
Copy link

aquach commented Sep 15, 2017

Seems like a pretty big deal since it means that lookbehind is now incompatible with non-ascii characters, under certain conditions.

@kpolitowicz
Copy link

Just upgraded production to 2.4.2 (from 2.3.5) and encountered this problem with regexps like:

/(?<!<strong>)(A\u2014a)(?!<\/strong>)/i # SyntaxError
/(?<!<st>)(A\u2014a)(?!<\/st>)/i # SyntaxError
/(?<!<s>)(A\u2014a)(?!<\/s>)/i # OK
/(?<!<sr>)(A\u2014a)(?!<\/sr>)/i # OK
/(?<!<strong>)(Aa)(?!<\/strong>)/i # OK
/(?<!<strong>)(A\u2014a)(?!<\/strong>)/ # OK

It seems the problem goes as far back as 2.4.0 (and even 2.4.0-rc). 2.3.x is okay with all of the above.

@kpolitowicz
Copy link

I just tested Ruby 2.5 (and Ruby 2.4.3) and no improvement. Is there a chance someone will fix this soon or is it considered a non-breaking bug? Thanks!

@Mifrill
Copy link

Mifrill commented Jan 5, 2018

[/(?<=accessories|ubehör)(.*?\S)\s{3,}/i, 1]

works fine for ruby 2.3.3

For ruby 2.5.0-preview1

SyntaxError: /filie_name.rb:line: invalid pattern in look-behind: /(?<=accessories|ubehör):(.*?\S)\s{3,}/i
[/(?<=capacity|ubeh.r)(.*?\S)\s{3,}/i, 1]

Just replace ö

@BrianOn99
Copy link

BrianOn99 commented Mar 18, 2018

As @nurse mentioned, ss in some language is ß or ẞ, so the syntax tree is expanded as

PATTERN: /ss/ (UTF-8)
<alt:56006e730450>
   <list:56006e7304d0>
      <alt:56006e730510>
         <string:56006e730550>s
         <string:56006e730590>S
         <string:56006e730610> 0xc5 0xbf   # This is ſ
      <string:56006e730790>s
   <string:56006e730690> 0xc3 0x9f  # This is ß
   <string:56006e730710> 0xe1 0xba 0x9e  # This is ẞ

equivalent to (s|S|ſ)s | ß | ẞ, where the second s is ignore-case exact1-ic

Now for (?<!ss), regcomp.c setup_tree will expand the 3 alternation as (?<!s|S|ſ)|(?<!ß)(?<!ẞ). It is necessary because the anchor node type need a well defined char_len. Now the 3 expanded alternation each has a char_len of 1.

So what is the problem of /(?<!ass)/? /ass/ has this representation:

PATTERN: /ass/ (UTF-8)
<list:5603f54e0450>
   <string:5603f54e0490>a
   <alt:5603f54e0530>
      <list:5603f54e0570>
         <alt:5603f54e05b0>
            <string:5603f54e05f0>s
            <string:5603f54e0630>S
            <string:5603f54e06b0> 0xc5 0xbf
         <string:5603f54e0870>s
      <string:5603f54e0730> 0xc3 0x9f
      <string:5603f54e07b0> 0xe1 0xba 0x9e

equivalent to a((s|S|ſ) | ß | ẞ).

setup_look_behind will not expand the alternatiion to 3 look behind if the alternation is not top level. I think it is because it is hard to compute the char_len. In this case it is still possible to suck in the a and expand the look behind as something like (?<!a(s|S|ſ)s)|(?<!aß)(?<!aẞ), but what if the pattern is /(?<!assssssssssssss)/?

@k-takata
Copy link
Owner

Sorry for the very late response, but I haven't look into this yet.
Does this caused by eb666b8? Does oniguruma have also the same problem?

@k-takata
Copy link
Owner

It seems that oniguruma has fixed the problem in kkos/oniguruma@257082d.
Maybe we need a similar change.

@kpolitowicz
Copy link

@k-takata Late? Not even 2 years old ;) But seriously, big thanks for fixing this issue. :+9000:

Do you know how long will it take for Ruby build to pick the fix up? Will it be automatically included in the next, say 2.5 or 2.6 patch?

@k-takata
Copy link
Owner

@nurse Will you include this in Ruby 2.5 and 2.6?

@nurse
Copy link
Contributor Author

nurse commented Feb 12, 2019

This will be included into Ruby 2.7, but I don't have a plan to backport this into 2.6.

@mauromorales
Copy link

mauromorales commented Apr 7, 2020

FYI, I applied this changes on top of Ruby 2.6.6 and they effectively fix the issue see mauromorales/ruby@b2e52be

@nurse Is there a way I could help you backport it to Ruby 2.6.6 or a newer version in 2.6? I also wouldn't mind doing the work for 2.7 (or any other version) I just need some guidance since I've never contributed to the ruby project.

@mauromorales
Copy link

@nurse I just tried this in Ruby 2.7.2 and the problem persists

@eliotsykes
Copy link

I'm commenting on this closed issue to hopefully assist other searchers who land here looking for a workaround.

Until Ruby uses the fixed version of Onigmo, you may find replacing the ss with character classes [s][s] works, e.g.

-/(?<!ass)/i
+/(?<!a[s][s])/i

To test the patterns:

buggy_pattern = /(?<!ass)/i
fixed_pattern = /(?<!a[s][s])/i
unicode_str = "😀"

unicode_str.match? buggy_pattern
# => raises RegexpError (invalid pattern in look-behind: /(?<!ass)/i)

unicode_str.match? fixed_pattern
# => true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants