Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<:Foo> syntax in regexes ambiguous #118

Open
jnthn opened this issue Sep 28, 2016 · 5 comments
Open

<:Foo> syntax in regexes ambiguous #118

jnthn opened this issue Sep 28, 2016 · 5 comments

Comments

@jnthn
Copy link
Contributor

jnthn commented Sep 28, 2016

In S05 it defines <:Foo> as:

Unicode properties are indicated by use of pair notation in place of a normal rule name:

<:Letter>   # a letter
<:!Letter>  # a non-letter

Properties with arguments are passed as the argument to the pair:

<:East_Asian_Width<Narrow>>
<:!Blk<ASCII>>

The second form is unambiguous. The first, less so. Here's a quote from the Unicode database (in PropertyValueAliases.txt):

NOTE: Property value names are NOT unique across properties. For example:

AL means Arabic Letter for the Bidi_Class property, and
AL means Above_Left for the Canonical_Combining_Class property, and
AL means Alphabetic for the Line_Break property.

In addition, some property names may be the same as some property value names.
For example:

sc means the Script property, and
Sc means the General_Category property value Currency_Symbol (Sc)

The combination of property value and property name is, however, unique.

Which raises the question of what <:AL> would mean, or <:Sc>. The one that actually tripped me up is <:space>, which can either be an alias for the WSpace property (per PropertyAliases.txt):

WSpace                   ; White_Space                 ; space

Or a property value name from the linebreak property:

lb ; SP                               ; Space

The ambiguity is currently resolved by the order we make entries into the lookup hash, which is defined by the order we generate the C code in ucd2c.pl, which in term is randomized due to Perl 5 hash order randomization. So, you can get a spectest fails, regenerate from the exact same Unicode database
version and ucd2c.pl, and "get lucky" next time around. I came upon this by getting "unlucky" when doing the Unicode 9 database version bump, but it's been a problem all along.

@patch
Copy link
Member

patch commented Sep 28, 2016

I think S05 is lacking intended details. No regex engine allows for arbitrary property values for all properties without the associated names, due to the obvious conflicts. Most regex engines allow standalone General_Category values and some allow standalone Script and/or Block values (which do conflict). Perl 5 supports all three, with a preference for Script over Block when they conflict.

This discussion is happening simultaneously for an active ECMAScript proposal and the current plan is to only support standalone values for General_Category with the option to expand in the future if needed: https://github.com/mathiasbynens/es-regexp-unicode-property-escapes

Also, supporting Script instead of Script_Extension would be a mistake since the latter is generally what people expect and should be encouraged over Script. I personally think that the General_Category-only route is by far the safest and most straightforward. If an additional property were to be supported, Script_Extension is the next most useful and does not conflict with General_Category by design.

@patch
Copy link
Member

patch commented Sep 28, 2016

A good description of Script (sc) vs. Script_Extensions (scx):
http://unicode.org/reports/tr18/#Script_Property

samcv added a commit to samcv/MoarVM that referenced this issue Dec 27, 2016
Even though the last change fixed the problem where the
property code for 'space' was different than the one for
'White_Space', it failed some of the tests, notably testing
' ' ~~ /<:space>/ which may by this problem here:
Raku/old-design-docs#118
Long discussion here:
https://irclog.perlgeek.de/moarvm/2016-12-27#i_13805707
@samcv
Copy link
Contributor

samcv commented May 21, 2017

As part of my Unicode Grant I am having to address this.

From a perspective of implementing it on MoarVM, we are given a name, lets say "Latin" and look up what property is associated with that. In this case it would be the "Script" property.

Currently MoarVM throws all the property values in together and assumes that they are distinct with one property value to one specific property, which does not work in practice.

As I work on re-implementing this part of the code I need to decide which property values should be resolvable to property names (which is needed for regex without specifying the actual property you are trying to query).

I am going to put together a list of all of the conflicts and we can hopefully decide how we want to go about prioritizing them. Or at the very least knowing where all the overlaps are and which ones we want to prioritize and which are inconsequential.

@samcv
Copy link
Contributor

samcv commented May 21, 2017

# All except <True False T F Yes No Y N> and Script/Block overlaps
L => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type", "Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class", "General_Category", "Joining_Type"],
Other => ["Indic_Syllabic_Category", "Grapheme_Cluster_Break", "Word_Break", "Sentence_Break", "General_Category"],
EX => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
Numeric => ["Word_Break", "Line_Break", "Sentence_Break", "Numeric_Type"],
XX => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
CR => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
R => ["Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class", "Joining_Type"],
M => ["NFKC_Quick_Check", "Jamo_Short_Name", "General_Category", "NFC_Quick_Check"],
LF => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
Regional_Indicator => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
AL => ["Bidi_Class", "Canonical_Combining_Class", "Line_Break"],
EM => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
NU => ["Word_Break", "Line_Break", "Sentence_Break"],
A => ["East_Asian_Width", "Jamo_Short_Name", "Canonical_Combining_Class"],
E_Base => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
RI => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
B => ["Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class"],
ZWJ => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
EB => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
Extend => ["Grapheme_Cluster_Break", "Word_Break", "Sentence_Break"],
None => ["Bidi_Paired_Bracket_Type", "Decomposition_Type", "Numeric_Type"],
S => ["Bidi_Class", "Jamo_Short_Name", "General_Category"],
E_Modifier => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
NA => ["Age", "Hangul_Syllable_Type", "Indic_Positional_Category"],
Format => ["Word_Break", "Sentence_Break", "General_Category"],
C => ["Jamo_Short_Name", "General_Category", "Joining_Type"],
Right => ["Canonical_Combining_Class", "Indic_Positional_Category"],
Unassigned => ["Age", "General_Category"],
Control => ["Grapheme_Cluster_Break", "General_Category"],
Nukta => ["Indic_Syllabic_Category", "Canonical_Combining_Class"],
E => ["Joining_Group", "Jamo_Short_Name"],
Surrogate => ["Line_Break", "General_Category"],
Punctuation => ["Block", "General_Category"],
V => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
Nonspacing_Mark => ["Bidi_Class", "General_Category"],
Number => ["Indic_Syllabic_Category", "General_Category"],
SP => ["Line_Break", "Sentence_Break"],
E_Base_GAZ => ["Grapheme_Cluster_Break", "Word_Break"],
Close_Punctuation => ["Line_Break", "General_Category"],
Unknown => ["Script", "Line_Break"],
GAZ => ["Grapheme_Cluster_Break", "Word_Break"],
LV => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
IS => ["Canonical_Combining_Class", "Line_Break"],
CL => ["Line_Break", "Sentence_Break"],
Open_Punctuation => ["Line_Break", "General_Category"],
Private_Use => ["Block", "General_Category"],
Paragraph_Separator => ["Bidi_Class", "General_Category"],
Pe => ["Joining_Group", "General_Category"],
D => ["Jamo_Short_Name", "Joining_Type"],
Narrow => ["East_Asian_Width", "Decomposition_Type"],
NL => ["Word_Break", "Line_Break"],
Wide => ["East_Asian_Width", "Decomposition_Type"],
Virama => ["Indic_Syllabic_Category", "Canonical_Combining_Class"],
Hebrew_Letter => ["Word_Break", "Line_Break"],
U => ["Jamo_Short_Name", "Joining_Type"],
LE => ["Word_Break", "Sentence_Break"],
Left => ["Canonical_Combining_Class", "Indic_Positional_Category"],
Glue_After_Zwj => ["Grapheme_Cluster_Break", "Word_Break"],
Close => ["Bidi_Paired_Bracket_Type", "Sentence_Break"],
BB => ["Jamo_Short_Name", "Line_Break"],
HL => ["Word_Break", "Line_Break"],
P => ["Jamo_Short_Name", "General_Category"],
Maybe => ["NFKC_Quick_Check", "NFC_Quick_Check"],
EBG => ["Grapheme_Cluster_Break", "Word_Break"],
Combining_Mark => ["Line_Break", "General_Category"],
LVT => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
FO => ["Word_Break", "Sentence_Break"],
H => ["East_Asian_Width", "Jamo_Short_Name"],
Ambiguous => ["East_Asian_Width", "Line_Break"],

Here are all the ones that are Block and Script overlaps:

# Only Script/Block overlap
Malayalam => ["Script", "Block"],
Sundanese => ["Script", "Block"],
Mahajani => ["Script", "Block"],
Pau_Cin_Hau => ["Script", "Block"],
Tibetan => ["Script", "Block"],
Sora_Sompeng => ["Script", "Block"],
Runic => ["Script", "Block"],
Thai => ["Script", "Block"],
Osage => ["Script", "Block"],
Rejang => ["Script", "Block"],
Bassa_Vah => ["Script", "Block"],
Gurmukhi => ["Script", "Block"],
Glagolitic => ["Script", "Block"],
Old_Hungarian => ["Script", "Block"],
Grantha => ["Script", "Block"],
Palmyrene => ["Script", "Block"],
Gothic => ["Script", "Block"],
Lao => ["Script", "Block"],
Nabataean => ["Script", "Block"],
Limbu => ["Script", "Block"],
Old_Persian => ["Script", "Block"],
Phoenician => ["Script", "Block"],
Tai_Le => ["Script", "Block"],
Ol_Chiki => ["Script", "Block"],
Khudawadi => ["Script", "Block"],
Old_Permic => ["Script", "Block"],
Elbasan => ["Script", "Block"],
Duployan => ["Script", "Block"],
Samaritan => ["Script", "Block"],
Syriac => ["Script", "Block"],
Devanagari => ["Script", "Block"],
Greek => ["Script", "Block"],
Lycian => ["Script", "Block"],
Ethiopic => ["Script", "Block"],
Thaana => ["Script", "Block"],
Hatran => ["Script", "Block"],
Siddham => ["Script", "Block"],
Psalter_Pahlavi => ["Script", "Block"],
Kharoshthi => ["Script", "Block"],
Mandaic => ["Script", "Block"],
Newa => ["Script", "Block"],
Kayah_Li => ["Script", "Block"],
Warang_Citi => ["Script", "Block"],
Multani => ["Script", "Block"],
Osmanya => ["Script", "Block"],
Georgian => ["Script", "Block"],
Armenian => ["Script", "Block"],
Sinhala => ["Script", "Block"],
Hiragana => ["Script", "Block"],
Shavian => ["Script", "Block"],
New_Tai_Lue => ["Script", "Block"],
Bamum => ["Script", "Block"],
Cyrillic => ["Script", "Block"],
Old_South_Arabian => ["Script", "Block"],
Myanmar => ["Script", "Block"],
Miao => ["Script", "Block"],
Meroitic_Cursive => ["Script", "Block"],
Tirhuta => ["Script", "Block"],
Coptic => ["Script", "Block"],
Caucasian_Albanian => ["Script", "Block"],
Hanunoo => ["Script", "Block"],
Tamil => ["Script", "Block"],
Avestan => ["Script", "Block"],
Cherokee => ["Script", "Block"],
Inscriptional_Pahlavi => ["Script", "Block"],
Kannada => ["Script", "Block"],
Tifinagh => ["Script", "Block"],
Javanese => ["Script", "Block"],
Inscriptional_Parthian => ["Script", "Block"],
Mro => ["Script", "Block"],
Cham => ["Script", "Block"],
Takri => ["Script", "Block"],
Hangul => ["Script", "Block"],
Old_Turkic => ["Script", "Block"],
Oriya => ["Script", "Block"],
Kaithi => ["Script", "Block"],
Ahom => ["Script", "Block"],
Linear_A => ["Script", "Block"],
Meetei_Mayek => ["Script", "Block"],
Egyptian_Hieroglyphs => ["Script", "Block"],
Ugaritic => ["Script", "Block"],
Buginese => ["Script", "Block"],
Tagalog => ["Script", "Block"],
Anatolian_Hieroglyphs => ["Script", "Block"],
Pahawh_Hmong => ["Script", "Block"],
Tangut => ["Script", "Block"],
Telugu => ["Script", "Block"],
Batak => ["Script", "Block"],
Phags_Pa => ["Script", "Block"],
Vai => ["Script", "Block"],
Mongolian => ["Script", "Block"],
Modi => ["Script", "Block"],
Bhaiksuki => ["Script", "Block"],
Lisu => ["Script", "Block"],
Lydian => ["Script", "Block"],
Brahmi => ["Script", "Block"],
Cuneiform => ["Script", "Block"],
Tai_Viet => ["Script", "Block"],
Syloti_Nagri => ["Script", "Block"],
Chakma => ["Script", "Block"],
Adlam => ["Script", "Block"],
Braille => ["Script", "Block"],
Marchen => ["Script", "Block"],
Deseret => ["Script", "Block"],
Imperial_Aramaic => ["Script", "Block"],
Arabic => ["Script", "Block"],
Khmer => ["Script", "Block"],
Balinese => ["Script", "Block"],
Bengali => ["Script", "Block"],
Bopomofo => ["Script", "Block"],
Tai_Tham => ["Script", "Block"],
Mende_Kikakui => ["Script", "Block"],
Hebrew => ["Script", "Block"],
Meroitic_Hieroglyphs => ["Script", "Block"],
Sharada => ["Script", "Block"],
Khojki => ["Script", "Block"],
Lepcha => ["Script", "Block"],
Saurashtra => ["Script", "Block"],
Tagbanwa => ["Script", "Block"],
Old_Italic => ["Script", "Block"],
Gujarati => ["Script", "Block"],
Carian => ["Script", "Block"],
Old_North_Arabian => ["Script", "Block"],
Ogham => ["Script", "Block"],
Buhid => ["Script", "Block"],
Manichaean => ["Script", "Block"],
Katakana => ["Script", "Block", "Word_Break"],

@samcv
Copy link
Contributor

samcv commented May 22, 2017

All of the property names that conflict with values are Bool properties:

«« IDC Conflict with property name [blk]  is a boolean property
«« VS Conflict with property name [blk]  is a boolean property
«« White_Space Conflict with property name [bc]  is a boolean property
«« Alphabetic Conflict with property name [lb]  is a boolean property
«« Hyphen Conflict with property name [lb]  is a boolean property
«« Ideographic Conflict with property name [lb]  is a boolean property
«« Lower Conflict with property name [SB]  is a boolean property
«« STerm Conflict with property name [SB]  is a boolean property
«« Upper Conflict with property name [SB]  is a boolean property

I would like this to be 0th in priority
0. Property Name (i.e. <:White_Space>, <:Hyphen>)

If we set our preferred properties to be General_Category and Script, then we get 49 property values with overlaps. If we add a third preferred property Grapheme_Cluster_Break we only have 30 remaining.

From here we can resolve Canonical_Combining_Class, and also we should resolve Numeric_Type so that people can use <:Numeric> in their regex (I'm sure that there must already exist code where this is used so we need to make sure this is resolved as well).

Leaving us at a hierarchy of
0. Property Name (i.e. <:White_Space>, <:Hyphen>)

  1. General_Category
  2. Script
  3. Grapheme_Cluster_Break
  4. Canonical_Combining_Class
  5. Numeric_Type

I am open to adding whichever properties people think most important to the ordered priority list as well.

The ones with overlap remaining after this point:

NU => ["Word_Break", "Line_Break", "Sentence_Break"],
NA => ["Age", "Hangul_Syllable_Type", "Indic_Positional_Category"],
E => ["Joining_Group", "Jamo_Short_Name"],
SP => ["Line_Break", "Sentence_Break"],
CL => ["Line_Break", "Sentence_Break"],
D => ["Jamo_Short_Name", "Joining_Type"],
Narrow => ["East_Asian_Width", "Decomposition_Type"],
NL => ["Word_Break", "Line_Break"],
Wide => ["East_Asian_Width", "Decomposition_Type"],
Hebrew_Letter => ["Word_Break", "Line_Break"],
U => ["Jamo_Short_Name", "Joining_Type"],
LE => ["Word_Break", "Sentence_Break"],
Close => ["Bidi_Paired_Bracket_Type", "Sentence_Break"],
BB => ["Jamo_Short_Name", "Line_Break"],
HL => ["Word_Break", "Line_Break"],
Maybe => ["NFKC_Quick_Check", "NFC_Quick_Check"],
FO => ["Word_Break", "Sentence_Break"],
H => ["East_Asian_Width", "Jamo_Short_Name"],
Ambiguous => ["East_Asian_Width", "Line_Break"],

Any ideas above adding further to the hierarchy (even if they don't have any overlap presently [Unicode 9.0] it could be introduced later) will be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants