-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8311220: Optimization for StringLatin UpperLower #14751
8311220: Optimization for StringLatin UpperLower #14751
Conversation
👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into |
Webrevs
|
@wenshao This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
Is anyone working on this PR? |
@AlanBateman It has been more than 1 month, can you help me to review this PR? |
@AlanBateman can you help me to review this PR? |
@cl4es can you help me to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this overall looks reasonable, but I think a more thorough proof / test would help to build confidence that all these changes are semantically neutral.
The isLowerCaseEx
needs to explain why two lower-case codepoints are omitted (perhaps this should be hasUpperCase
?)
@@ -90,6 +90,10 @@ class CharacterDataLatin1 extends CharacterData { | |||
return (getPropertiesEx(ch) & $$maskOtherLowercase) != 0; | |||
} | |||
|
|||
boolean isLowerCaseEx(int ch) { | |||
return ch >= 'a' && (ch <= 'z' || ch == 181 || (ch >= 223 && ch != 247)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the contract for this? Specifically there are two special superscripte codepoints (170 and 186) which are lower-case (Character.isLowerCase(170) => true
) but doesn't have an upper-case (Character.toUpperCase(170) => 170
). It seems reasonable to exclude them if only used for operations like toUpper/toLower (since they won't change), but it should be spelled out to avoid surprises.
For consistency I think we should use hex literals in this file, e.g. 0xDF
instead of 223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of the isLowerCaseEx method and the previous implementation "cp != CharacterDataLatin1.instance.toUpperCaseEx(cp)"
The result is exactly the same.
The code below compares all numbers in [-128, 128]
import sun.misc.Unsafe;
import java.lang.invoke.MethodHandle;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.MethodType;
import java.lang.reflect.Field;
import java.nio.charset.StandardCharsets;
public class CharacterDataLatin1Test {
public static void main(String[] args) throws Throwable {
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; ++i) {
byte b = (byte) i;
int cp = b & 0xff;
boolean r0 = cp != toUpperCaseEx(cp);
boolean r1 = isLowerCaseEx(cp);
if (r0) {
System.out.println(cp + "\t0x" + Integer.toHexString(cp)
+ "\t" + new String(new byte[] {b}, StandardCharsets.ISO_8859_1));
}
if (r0 != r1) {
System.out.println("error " + i);
}
}
}
static boolean isLowerCaseEx(int ch) {
return ch >= 'a' && (ch <= 'z' || ch == 0xb5 || (ch >= 0xdf && ch != 0xf7));
}
static int toUpperCaseEx(int cp) throws Throwable {
Field theUnsafeField = Unsafe.class.getDeclaredField("theUnsafe");
theUnsafeField.setAccessible(true);
Unsafe unsafe = (Unsafe) theUnsafeField.get(null);
Class<?> charbinClass = Class.forName("java.lang.CharacterDataLatin1");
Field field = charbinClass.getDeclaredField("instance");
long fieldOffset = unsafe.staticFieldOffset(field);
Object instance = unsafe.getObject(charbinClass, fieldOffset);
Class lookupClass = MethodHandles.Lookup.class;
Field implLookup = lookupClass.getDeclaredField("IMPL_LOOKUP");
MethodHandles.Lookup trustedLookup = (MethodHandles.Lookup) unsafe.getObject(lookupClass,
UNSAFE.staticFieldOffset(implLookup));
MethodHandles.lookup();
MethodHandle toLowerCase = trustedLookup
.findVirtual(charbinClass, "toUpperCaseEx", MethodType.methodType(int.class, int.class));
return (Integer) toLowerCase.invoke(instance, cp);
}
}
String str1 = new String(new byte[]{(byte) 0xb5}, StandardCharsets.ISO_8859_1);
String str2 = new String(new byte[]{(byte) 0xdf}, StandardCharsets.ISO_8859_1);
System.out.println(str1 + "\t" + str1.toUpperCase());
System.out.println(str2 + "\t" + str2.toUpperCase()); result :
0xb5 and 0xdf call toUpperCaseEx return and input are different, these two codepoints are to ensure that the behavior has not changed. |
The two odd codepoints I was curious about are
I suggested |
i have renamed isLowerCaseEx to hasNotUpperCaseEx, is this ok? |
Now the build has failed, and git rebase can solve the failure. Is there any other way? |
The preferred route is to merge in then push changes from master to your PR branch. |
|
Merge will cause a lot of file changes |
i have renamed hasNotUpperCaseEx to hasUpperCaseMapping. |
…_string_latin1_upper_lower
/integrate |
@wenshao This pull request has not yet been marked as ready for integration. |
@@ -90,6 +90,10 @@ class CharacterDataLatin1 extends CharacterData { | |||
return (getPropertiesEx(ch) & $$maskOtherLowercase) != 0; | |||
} | |||
|
|||
boolean hasUpperCaseMapping(int ch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is incorrectly named; it actually tests that a char has no upper case mapping. I recommend fixing this by keeping the method name while removing the double !
at this return and at the if down below to simplify the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now use a local variable notUpperCaseEx, I prefer this, without adding a method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have preferred @liach's solution (remove the double negative and keeping the method with name unchanged) but this is OK too.
|
@wenshao This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 9 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@cl4es) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@@ -422,8 +422,7 @@ public static String toLowerCase(String str, byte[] value, Locale locale) { | |||
final int len = value.length; | |||
// Now check if there are any characters that need to be changed, or are surrogate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing: , or are surrogate
the mention of surrogate is misleading; this is a latin1 string and cannot contain a surrogate.
Here and in the toLowerCaseEx method.
/integrate |
@@ -608,7 +602,8 @@ public static String toUpperCase(String str, byte[] value, Locale locale) { | |||
// Now check if there are any characters that need to be changed, or are surrogate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix this one too for consistency? I'll sponsor either way after some testing.
/sponsor |
/integrate |
/sponsor |
Going to push as commit f09b7af.
Your commit was automatically rebased without conflicts. |
Benchmark Result
1. aliyun_ecs_c8i.xlarge
2. aliyun_ecs_c8a.xlarge
3. aliyun_ecs_c8y.xlarge
4. MacBookPro M1 Pro
5. Orange Pi 5 Plus
CPU : Rockchip RK3588 (aarch64)
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14751/head:pull/14751
$ git checkout pull/14751
Update a local copy of the PR:
$ git checkout pull/14751
$ git pull https://git.openjdk.org/jdk.git pull/14751/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14751
View PR using the GUI difftool:
$ git pr show -t 14751
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14751.diff
Webrev
Link to Webrev Comment