New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifiers are dropped in \X regular expression matches #4832

Closed
olleolleolle opened this Issue Oct 30, 2017 · 5 comments

Comments

Projects
None yet
3 participants
@olleolleolle
Contributor

olleolleolle commented Oct 30, 2017

The \X regular expression matches on "extended grapheme cluster".

This Issue is about how that match becomes wrong.

Environment

Versions:

  • JRuby version: jruby 9.1.13.0 (2.3.3) 2017-09-06 8e1c115 Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
  • Operating system and platform: Darwin Olles-MacBook-Pro.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Expected Behavior

$ /usr/bin/ruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "å" 1:"å">

The circle above the a is a "modifier". Here, in MRI, it's in the MatchData.

Actual Behavior

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "a" 1:"a">

Note the absence of the "modifier".

Read more

In order to know what I'm talking about, here are links.

StackOverflow answer about "What is even \X?"

Keywords: extended grapheme cluster

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Oct 30, 2017

Member

The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent \\X scan fails to consume those characters and only produces the non-combing 'a', 'o', 'A', 'O' characters.

This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters.

Member

headius commented Oct 30, 2017

The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent \\X scan fails to consume those characters and only produces the non-combing 'a', 'o', 'A', 'O' characters.

This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Oct 30, 2017

Member

@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nf\c).match(/(\X)/)[1]"
"å"
Member

headius commented Oct 30, 2017

@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nf\c).match(/(\X)/)[1]"
"å"

@headius headius added this to the JRuby 9.2.0.0 milestone Oct 30, 2017

@jensnockert

This comment has been minimized.

Show comment
Hide comment
@jensnockert

jensnockert Oct 30, 2017

This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا‎" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels.

jensnockert commented Oct 30, 2017

This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا‎" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels.

@olleolleolle

This comment has been minimized.

Show comment
Hide comment
@olleolleolle

olleolleolle Oct 31, 2017

Contributor

Oh, closed by mistake! Re-opened.

Contributor

olleolleolle commented Oct 31, 2017

Oh, closed by mistake! Re-opened.

@olleolleolle olleolleolle reopened this Oct 31, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Nov 28, 2017

Member

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

Member

headius commented Nov 28, 2017

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment