Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8301119: Support for GB18030-2022 #339

Closed
wants to merge 5 commits into from

Conversation

gnu-andrew
Copy link
Member

@gnu-andrew gnu-andrew commented Jul 12, 2023

This patch modifies GB18030 to handle both the 2000 and the 2022 variant. The 2000 variant is available by setting -Djdk.charset.GB18030=2000. This PR replaces openjdk/jdk8u#45.

With the preceding test changes in place (openjdk/jdk8u#43 and openjdk/jdk8u#44), the changes needed for this are fairly minimal. The biggest divergence from 11u is in the character set providers. The changes in the make directory are not needed as 8u never moved to using a template for GB18030 in the first place (the 11u changes revert it back to being source-based). The change in the SPI.java generator tool moves into ExtendedCharsets.java in the class library, as the file is not auto-generated in 8u. Following additional work by @jerboaa, the alias is now set to 2022 initially, and then replaced in the init() method if jdk.charset.GB18030 is 2000.

In 8u, the standard charsets are generated from a text file by a shell script, while the extended charsets are handled by a standard class. 11u moves GB18030 from extended to standard. I experimented with this in 8u, but it seemed more problematic than just keeping it in the extended set. The only reason I can see for moving it in 11u is it allowed IS_2000 to be package-private to sun.nio.cs. This is resolved by @jerboaa's changes which modify aliasMap appropriately on creation, and so allow the use of a package-private method isGB18030_2000 in ExtendedCharsets instead.

To use the 11u solution would mean major rewrites to the shell script or bringing over the whole change in how the standard charset provider is generated from 11u, which I think, along with moving the package the character set is in, is too risky and unnecessary for this change. The generation changes are necessary because the GB18030 character set needs to provide a different alias, depending on whether it is the 2000 or 2002 variant. The genCharsetProvider.sh would need the alterations we have added to ExtendedCharsets.java to handle this, but converted to awk. I did experiment with this, but saw test failures.

The only adjustment to the GB18030.java changes is copyright headers and the removal of IS_2000 as mentioned above.

With the tests, the adjustments are just due to differing bug IDs, the absence of @modules and the use of constructs (var) and library calls (Set.of) that don't exist in 8u. The List.of and Set.of calls are frequent issues in backports, so I used this as an opportunity to introduce a full set of equivalents into the test library. It should now be possible to just rewrite Set.of to Utils.setOf and List.of to Utils.listOf. The returned collections are expected to be unmodifiable, not contain null and (in the case of sets) not contain duplicates. Simple replacement with a newly constructed ArrayList or HashSet would not ensure this. While this test does not rely on this, others may, so it seemed worth providing a closer replacement for use in future backports.

All sun.nio.cs and java.nio.charset tests pass with this patch applied.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8301119: Support for GB18030-2022 (Enhancement - P4)
  • JDK-8310947: gb18030-2000 not selectable with LANG=zh_CN.GB18030 after JDK-8301119 (Bug - P3)

Reviewers

Contributors

  • Severin Gehwolf <sgehwolf@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk8u-dev.git pull/339/head:pull/339
$ git checkout pull/339

Update a local copy of the PR:
$ git checkout pull/339
$ git pull https://git.openjdk.org/jdk8u-dev.git pull/339/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 339

View PR using the GUI difftool:
$ git pr show -t 339

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk8u-dev/pull/339.diff

Webrev

Link to Webrev Comment

@gnu-andrew
Copy link
Member Author

/contributor add @jerboaa
/issue add JDK-8310947

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 12, 2023

👋 Welcome back andrew! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title Backport 5c4e744dabcf7785c35168db5d0458ccebfd41e6 8301119: Support for GB18030-2022 Jul 12, 2023
@openjdk
Copy link

openjdk bot commented Jul 12, 2023

This backport pull request has now been updated with issue from the original commit.

@openjdk openjdk bot added backport rfr Pull request is ready for review labels Jul 12, 2023
@openjdk
Copy link

openjdk bot commented Jul 12, 2023

@gnu-andrew
Contributor Severin Gehwolf <sgehwolf@openjdk.org> successfully added.

@openjdk
Copy link

openjdk bot commented Jul 12, 2023

@gnu-andrew
Adding additional issue to issue list: 8310947: gb18030-2000 not selectable with LANG=zh_CN.GB18030 after JDK-8301119.

@mlbridge
Copy link

mlbridge bot commented Jul 12, 2023

Webrevs

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrew,

I looked through the patch, and compared it with 11u, too, looks fine. Minor remarks inline, feel free to ignore the style nits.

new String[] {
"gb18030-2000"
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this replacement to the comment above listing the changes done in init?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've updated the comment to document jdk.charset.GB18030.

@@ -1184,6 +1185,8 @@ protected void init() {

String map = AccessController.doPrivileged(
new GetPropertyAction("sun.nio.cs.map"));
boolean isGB18030_2000 = "2000".equals(AccessController.doPrivileged(
new GetPropertyAction("jdk.charset.GB18030")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More user friendly would be to test for "2000" or "2023" and to throw IAE for other values; but since later JDK versions don't do this either this is probably fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but yes this is something that should go into trunk and be backported down. I'm already a little uneasy about fixing 8310947 in 8u first.

@@ -12595,6 +12637,7 @@ private CoderResult encodeArrayLoop(CharBuffer src,
int condensedKey = 0; // expands to a four byte sequence
int hiByte = 0, loByte = 0;
currentState = GB18030_DOUBLE_BYTE;
boolean isGB18030_2000 = ExtendedCharsets.isGB18030_2000();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and in other places: naming these vars IS_2000 would have simplified reviewing (mechanical comparison with 11)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some renaming is intentional as it's a static boolean constant in 11u, but a method and a private instance variable in 8u. I do think the isGB18032_2000 is odd (presumably to avoid a name conflict with the method), so I will alter that to just is_2000.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with making that isGB18030_2000 in the end and the method IS_2000. The latter certainly shortens some of the GB18030.java changes.

if (e3 == null) { throw new NullPointerException("e3"); }
if (e4 == null) { throw new NullPointerException("e4"); }
if (e5 == null) { throw new NullPointerException("e5"); }
if (e6 == null) { throw new NullPointerException("e6"); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and in other places: Objects.requireNonNull may be shorter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely forgot that method existed!

added = s.add(e5);
if (!added) { throw new IllegalArgumentException("duplicate 5"); }
added = s.add(e6);
if (!added) { throw new IllegalArgumentException("duplicate 6"); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be shortened, but thats up to you (if one does not care for the string, a simple if (!(s.add(e1) && s.add(e2) && ... would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I wrote this in such a long-winded way. Maybe there was more to it at one point and I didn't refine it after making further changes :/

Anyway, converted to a single line but kept the message, which can be useful in debugging (this is test output after all)

@gnu-andrew
Copy link
Member Author

Thanks for the quick review. I wasn't expecting you to be able to find time today.

I've made adjustments as discussed above. Tests pass as before.

@tstuefe
Copy link
Member

tstuefe commented Jul 13, 2023

Still good. Thanks!

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

@openjdk
Copy link

openjdk bot commented Jul 13, 2023

@gnu-andrew This change now passes all automated pre-integration checks.

After integration, the commit message for the final commit will be:

8301119: Support for GB18030-2022
8310947: gb18030-2000 not selectable with LANG=zh_CN.GB18030 after JDK-8301119

Co-authored-by: Severin Gehwolf <sgehwolf@openjdk.org>
Reviewed-by: stuefe, sgehwolf

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 13, 2023
@gnu-andrew
Copy link
Member Author

Thanks both. Flagged for approval with jdk8u-fix-request.

@gnu-andrew
Copy link
Member Author

I see jdk8u-fix-yes.
/integrate

@openjdk
Copy link

openjdk bot commented Jul 13, 2023

Going to push as commit 4dc1305.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 13, 2023
@openjdk openjdk bot closed this Jul 13, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 13, 2023
@openjdk
Copy link

openjdk bot commented Jul 13, 2023

@gnu-andrew Pushed as commit 4dc1305.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport integrated Pull request has been integrated
3 participants