-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unicode: add emoji properties #45264
Comments
cc @robpike |
CC @mpvl |
I suspect this would be better done in a separate package, probably not in the standard library, as I believe that the data set will grow substantial over time and most programs won't need it. |
#40724 |
That's not the right issue number for the duplicate, or else not the right issue for that duplicate number. |
I agree that this should be done outside the stdlib. The Unicode technical reports (UAX, UAS, UTR) make it clear that they are independent specifications, and that conformance to the Unicode standard does not imply conformance to the technical report. It's also unclear if emoji is all we add, and not one of the other reports, like identifiers, script properties, etc. I will say however, that the current API for constructing range tables is not very ergonomic. I ended up forgoing efficiency and used functions just so I had access to |
This isn't exactly true. While UTS and UTR indeed aren't required to implement, Unicode Standard Annex, however, might be required by Standard. There's a list of such in Chapter 3 of Standard.
As pointed out, UTS isn't part of unicode standard, so I think this is quite reasonable to not put emoji stuff in Although I feel like at least range tables (without emoji sequences and other trickery) should be somewhere close to
|
I agree, I've often wanted
// +build go1.14,!go1.16
package xid
const unicodeTestVersion = "12.1.0"
// +build go1.16
package xid
const unicodeTestVersion = "13.0.0" |
A natural place to put these properties would be This repo already has a runenames package, which could naturally hold emoji names as well (although I'm unsure how this fits with emoji sequences). The argument for putting it in core so it can be used in regexp is valid. However, this would mean including more tables in core that are currently missing. A more reasonable solution to support emoji in regexp would be to allow user-defined character classes, allowing users to add classes from x/text, for instance. It should also be mentioned what the goal is of these tables. Depending on the application, rangetables may not be the best representation. Judging from UTS #51, for instance, a UTF-8 trie, which allows associating a set of related properties with a single rune, seems more appropriate. The x/text repo has all the infrastructure in place to generate such tries conveniently. |
@smasher164: the x/text repo uses a similar trick. The generators are multi-version aware and will automatically add to/modify build tags for generated tables. It uses it, however, to ensure that the versions align with the latest Go. Your comment, however, suggest that you would want the other way around: have Go adopt a later version. This gives rise to the idea that core could use tables from This obviously would require a separate proposal. There are some serious implications for this. Also, there are packages with hardcoded range tables. But all this could be worked around. |
I don't want to be "that guy" but which policies are applied for decisions to where include what? Earlier it was said that "conformance to the Unicode standard does not imply conformance to the technical report". @mpvl I see that While Despite of UTS being "independent specification" by definition, Standard itself clearly mentions that only a few UTS synchronized with its version: UTS#10 (collation), UTS#39 (security), UTS#46 (idna) and UTS#51 (emoji). For example, conformance to Unicode 13 means that it should contain "Khitan Small Script". And
Since other tables probably represented by other Technical Reports, addition of single one doesn't imply that other ones should be included too. These are still independent specs.
I agree that it should be more flexible way to do so. Although I don't think that I would be able to write a proper proposal for that.
I can't say for others but I ended up in a situation where I need to be aware about emojis in text. First, to correctly remove such characters from text or replace them with non-graphical representations (e.g. using names from UCD). Second, to count number of characters when single emoji or emoji sequence represents single "character". |
The stdlib tables are generated from x/text. Core even depends on x/text and build tags in x/text ensure that the Unicode version of x/text is matched to that of core. So tip of x/text is ahead in Unicode version compared to core.
That seems like a bug in runenames' generate script if true. It should update automatically with a Unicode upgrade. @nigeltao. |
I could imagine an API like func RegisterClass(name string, table *unicode.RangeTable) in either func WithClass(name string, table *unicode.RangeTable, expr string) (*Regexp, error) Either way, this would be a separate proposal.
I could imagine the stdlib being behind the supported version in x/text. That way, for example, someone who wanted to use unicode 13 functionality on Go 1.15 could simply import Maybe the way forward here is to either define these properties in |
Core Unicode tables are generated from x/text and core even imports x/text for various use cases, like normalization. Also, the x/text tables use build tags to keep these tables in sync. Theoretically, core could refer to x/text for all its tables, which would allow getting rid of the build tag trick and would allow using newer Unicode versions independently from the Go version. That needs some serious thought and some adjustment to existing packages like strconv IIRC. |
Something like that. Passing a function with a signature |
I figured out what's wrong. It is not a bug in x/text and not a package issue per se. I suppose build environment for pkgsite uses some older Go release and takes older table which has
I personally do not like the idea of pulling v0 packages for use in somewhat stable releases of Go. Whether you plan on using range tables or not for these kind of characters, I suppose it is now decided to put them in separate package within |
I think that makes sense, yes. API for pluggable character classes for regexp is fine for me and probably covers other use cases too. It will be wise to fill separate proposal and discuss details of the implementations there. |
This proposal has been added to the active column of the proposals project |
Even if we added these to unicode.Properties, regexp only does Categories and Scripts. Do you need emoji things in regexp, or was that just brought up for completeness? |
I think I do not need regexp support. At least for me regexp isn't a top priority. But it's not a simple question, to be honest. In a short time span I had to solve multiple unrelated problems with emojis. I feel that some of them can be solved easier using some sort of property handles in regexp.
Is there any reason for that? I found that when I looked through |
I don't remember why I left Property out. Possibly it just seemed like too much for too little benefit. |
As an anecdote, the python |
As long as regexp is not a requirement, then adding these to unicode.Properties probably makes sense. |
I took a look at UAX#44 and marked with
|
This is a copy of @gudvinr 's answer above, with missing properties highlighted. General
Name
Name_Alias
Block
Age
General_Category
Script
Script_Extensions
+White_Space
Alphabetic
Hangul_Syllable_Type
+Noncharacter_Code_Point
Default_Ignorable_Code_Point
+Deprecated
+Logical_Order_Exception
+Variation_Selector
Case
Uppercase
Lowercase
Lowercase_Mapping
Titlecase_Mapping
Uppercase_Mapping
Case_Folding
Simple_Lowercase_Mapping
Simple_Titlecase_Mapping
Simple_Uppercase_Mapping
Simple_Case_Folding
+Soft_Dotted
Cased
Case_Ignorable
Changes_When_Lowercased
Changes_When_Uppercased
Changes_When_Titlecased
Changes_When_Casefolded
Changes_When_Casemapped
Emoji
Emoji
Emoji_Presentation
Emoji_Modifier
Emoji_Modifier_Base
Emoji_Component
Extended_Pictographic
Numeric
Numeric_Value
Numeric_Type
+Hex_Digit
+ASCII_Hex_Digit
Normalization
Canonical_Combining_Class
Decomposition_Mapping (not recommended)
Composition_Exclusion (not recommended)
Full_Composition_Exclusion (not recommended)
Decomposition_Type
FC_NFKC_Closure (deprecated)
NFC_Quick_Check
NFKC_Quick_Check
NFD_Quick_Check
NFKD_Quick_Check
Expands_On_NFC (deprecated)
Expands_On_NFD (deprecated)
Expands_On_NFKC (deprecated)
Expands_On_NFKD (deprecated)
NFKC_Casefold
Changes_When_NFKC_Casefolded
Shaping and Rendering
+Join_Control
Joining_Group
Joining_Type
Vertical_Orientation
East_Asian_Width
+Prepended_Concatenation_Mark
Bidirectional
Bidi_Class
+Bidi_Control
Bidi_Mirrored
Bidi_Mirroring_Glyph
Bidi_Paired_Bracket
Bidi_Paired_Bracket_Type
Identifiers
ID_Continue
ID_Start
XID_Continue
XID_Start
+Pattern_Syntax
+Pattern_White_Space
Segmentation
Line_Break
Grapheme_Cluster_Break
Sentence_Break
Word_Break
CJK
+Ideographic
+Unified_Ideograph
+Radical
+IDS_Binary_Operator
+IDS_Trinary_Operator
Unicode_Radical_Stroke
Equivalent_Unified_Ideograph
Miscellaneous
Math
+Quotation_Mark
+Dash
+Hyphen (deprecated, stabilized)
+Sentence_Terminal
+Terminal_Punctuation
+Diacritic
+Extender
Grapheme_Base
Grapheme_Extend
Grapheme_Link (deprecated)
Unicode_1_Name
ISO_Comment (deprecated, stabilized)
+Regional_Indicator
Indic_Positional_Category
Indic_Syllabic_Category
Contributory Properties (not recommended)
+Other_Alphabetic
+Other_Default_Ignorable_Code_Point
+Other_Grapheme_Extend
+Other_ID_Start
+Other_ID_Continue
+Other_Lowercase
+Other_Math
+Other_Uppercase
Jamo_Short_Name |
Just to chip in: the missing properties would be useful for Go GUI libraries, in particular for implementing bi-directional and complex script rendering. But x/text might be just as well a place to keep them as unicode/ for them. |
To add my view: especially if there is not going to be regexp support for properties, it doesn't make sense to add these properties to the set of properties for package Many of the "unsupported" properties as already supported in The reason why x/text didn't use RangeTables for many of these properties is because such properties are often not useful in isolation. This holds true for Case-, Normalization-, Bidi-, Grapheme-, Identifier-, and I suspect also Emoji-related properties. Folding these properties in a single per-rune/per-topic trie data structure, has proven to give significant performance benefits. The packages I could imagine that a selection of these properties would be useful for regexp, though. |
Note, btw, that the list of unsupported properties includes non-boolean properties (such as EastAsianWidth, included in |
Good point. Here's the list of only unsupported boolean properties:
|
Based on the discussion above, this proposal seems like a likely decline. |
So, properties can't be added to properties list. What is the recommended way to go then? |
Perhaps we could still include these properties in x/text? But I suppose that should be a new issue? |
@mpvl has some ideas about how to provide some info in x/text, but that would be a separate package. |
No change in consensus, so declined. |
What version of Go are you using (
go version
)?What do you propose?
There are number of range tables in
unicode
package of stdlib which define some of character properties from Unicode Character Database.Unicode also has additional sets of properties besides ones defined in core standard. These properties described in technical reports.
Notably, UTS#51 defines sets of properties to determine which unicode characters are emojis:
Emoji
property - These characters are recommended for use as emojiExtended_Pictographic
property - These characters are pictographicEmoji_Component
property - These characters are used in emoji sequencesEmoji_Presentation
property - A character that, by default, should appear with an emoji presentationEmoji_Modifier
- A character that can be used to modify the appearance of a preceding emojiEmoji_Modifier_Base
- A character whose appearance can be modified by a subsequent emoji modifierCharacter property
Regional_Indicator
already present inunicode
package.Data source
At the time of writing, go1.16 contains range tables from Unicode 13.0.0.
Thus, properties for emoji data also should be taken from UCD emoji data 13.0.0.
Package changes
New
RangeTable
variables (order follows emoji-data.txt):Emoji = _Emoji
Emoji_Component = _Emoji_Component
Emoji_Presentation = _Emoji_Presentation
Emoji_Modifier = _Emoji_Modifier
Emoji_Modifier_Base = _Emoji_Modifier_Base
Emoji_Component = _Emoji_Component
Extended_Pictographic = _Extended_Pictographic
Inclusion of functions for checking character properties like
IsEmoji
,IsEmojiModifier
,IsEmojiModifierBase
, etc doesn't make a lot of sense since there's alreadyunicode.In
function.However, some kind of function in form of
IsEmojiData
that checks range tables for all emoji-related properties might be useful to e.g. filter out all emoji components from text.To make these properties usable in
regexp
package, their names (or corresponding abbreviations) should be included intoCategories
orScripts
.Additional notes
Although UTS#51 defines emoji sequences, this issue does not cover this topic since emoji sequence consists of multiple characters and
unicode
package doesn't have a concept of "character sequence".Examples in other languages
The text was updated successfully, but these errors were encountered: