New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8303018: Unicode Emoji Properties #13006
Conversation
👋 Welcome back naoto! A progress list of the required criteria for merging this PR into |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all familiar with all these templates but everything I do understand lgtm :-)
* {@code false} otherwise. | ||
* @since 21 | ||
*/ | ||
public static boolean isExtendedPictographic(int codePoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a "cut 'n paste" error here, it looks like the description has been copied from isEmojiComponent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. You are right. Corrected.
|
||
/** | ||
* Determines if the specified character (Unicode code point) has the | ||
* Emoji Presentation by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"has the Emoji Presentation", should that be "has the Emoji Presentation property"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this one as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this one as well
Spec update looks good. I suppose I have to use an emoji to react to that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build changes look ok.
maskEmojiPresentation = 0x008000000000L, | ||
maskEmojiModifier = 0x010000000000L, | ||
maskEmojiModifierBase = 0x020000000000L, | ||
maskEmojiComponent = 0x040000000000L, | ||
maskExtendedPictographic = 0x080000000000L; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice the have the '=' aligned in the same column as the ones above, or at least have the emoji ones align with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Indentation aligned, and although in production -string
is used so comments are not emitted, added information for B
table in GenerateCharacter.propertiesComments
for consistency.
Would it make sense to add something like the following to if ((val & maskEmoji) == maskEmoji) {
result.append(", emoji ");
}
if ((val & maskEmojiPresentation) == maskEmojiPresentation) {
result.append(", emojiPresentation ");
}
if ((val & maskEmojiModifier) == maskEmojiModifier) {
result.append(", emojiModifier ");
}
if ((val & maskEmojiModifierBase) == maskEmojiModifierBase) {
result.append(", emojiModifierBase ");
}
if ((val & maskEmojiComponent) == maskEmojiComponent) {
result.append(", emojiComponent ");
}
if ((val & maskExtendedPictographic) == maskExtendedPictographic) {
result.append(", extendedPictographic ");
} This updates the comments of the B table to something like this sample:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There are opportunities to modernize the code, but maybe separately.
case "Emoji_Modifier_Base" -> EMOJI_MODIFIER_BASE; | ||
case "Emoji_Component" -> EMOJI_COMPONENT; | ||
case "Extended_Pictographic" -> EXTENDED_PICTOGRAPHIC; | ||
default -> throw new InternalError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to include the "type" as the exception argument. It give some idea as to the corruption or missing case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added type
to the error message
maskOtherLowercase = 0x000100000000L, | ||
maskOtherUppercase = 0x000200000000L, | ||
maskOtherAlphabetic = 0x000400000000L, | ||
maskIdeographic = 0x000800000000L, | ||
maskIDStart = 0x001000000000L, | ||
maskIDContinue = 0x002000000000L, | ||
maskEmoji = 0x004000000000L, | ||
maskEmojiPresentation = 0x008000000000L, | ||
maskEmojiModifier = 0x010000000000L, | ||
maskEmojiModifierBase = 0x020000000000L, | ||
maskEmojiComponent = 0x040000000000L, | ||
maskExtendedPictographic = 0x080000000000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to leverage a common definition (perhaps a bit number) here and in EmojiData.java
and build the constants with <<< shifts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I managed to get rid of the constants in EmojiData
altogether, by using the constants in GenerateCharacter
. Used the bit numbers to construct constants.
if (x.equals("maskEmoji")) return "0x" + hex4(maskEmoji >> 32); | ||
if (x.equals("maskEmojiPresentation")) return "0x" + hex4(maskEmojiPresentation >> 32); | ||
if (x.equals("maskEmojiModifier")) return "0x" + hex4(maskEmojiModifier >> 32); | ||
if (x.equals("maskEmojiModifierBase")) return "0x" + hex4(maskEmojiModifierBase >> 32); | ||
if (x.equals("maskEmojiComponent")) return "0x" + hex4(maskEmojiComponent >> 32); | ||
if (x.equals("maskExtendedPictographic")) return "0x" + hex4(maskExtendedPictographic >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An upgrade would be to modify hex4(), hexNN() to use HexFormat.of().toUpperCase().toHexDigits((short)xxx)
The HexFormat is reusable and would avoid creating extra strings.
Perhaps also create a method that combines the repetitive shift and prefixing.
This if...then... sequence could be an expression switch (x) {...}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good, but it would be for another day IMO.
Unrelated side note: Reviewing this PR inspired me to see if one could generate more efficient switch expressions for the CharacterDataLatin1 methods as replacement for the property lookup / masking done today. This seemed to give a small improvements on benchmarks and also collapsed a few methods to simply "return false". Could be something to explore at a later point. |
Yeah, agree. |
@naotoj This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 88 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
/integrate |
Going to push as commit f593a6b.
Your commit was automatically rebased without conflicts. |
Proposing accessor methods to Emoji properties defined in Unicode Technical Standard #51 in
java.lang.Character
class. This is per a request from the client group, as well as refining the currently existing ad-hoc emoji implementation in regex. A CSR has also been drafted, and I would appreciate reviews for it too.Progress
Issues
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13006/head:pull/13006
$ git checkout pull/13006
Update a local copy of the PR:
$ git checkout pull/13006
$ git pull https://git.openjdk.org/jdk.git pull/13006/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 13006
View PR using the GUI difftool:
$ git pr show -t 13006
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13006.diff