Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8186801: Add regression test to test mapping based charsets (generated at build time) #43

Closed
wants to merge 2 commits into from

Conversation

gnu-andrew
Copy link
Member

@gnu-andrew gnu-andrew commented Jun 16, 2023

This is a pre-requisite for JDK-8301119 ("Support for GB18030-2022") and so is being proposed for inclusion in 8u382 during rampdown, so that the changes are in place for when GB18030-2022 enforcement begins in August. It introduces GB18030.map, containing the character mappings for GB18030, and tests to verify the correct mappings happen at run-time.

A number of changes were necessary for 8u. One main reason was the inclusion of JDK-8186803 "Update Cp1140-Cp1149 EBEDIC euro charset to map \u000A to EBCDIC 0x15" as part of this fix in OpenJDK 10+. I have removed these changes in the 8u version so as to avoid making potentially incompatible library changes and focus on testing the current character mappings in 8u.

Another is that the character set data is spread across three files - dbcs, sbcs and extsbcs - in 8u, whereas it was amalgamated into a single file, charsets, during the introduction of modules. The coding test (TestCharsetMapping.java) has been adapted to use the 8u data format.

The detailed changes from the OpenJDK 10 patch are as follows:

  1. Drop the introduction of the IBM114x.nr files which implement JDK-8186803.
  2. Drop the change to charsets which doesn't exist in 8u and any equivalent change may lead to compatibility issues
  3. EUC_TW.java has slightly different context in 8u (no pkg argument) but the filename change is the same
  4. Drop the alias changes in MS932_0213.java & x-MS932_0213 to avoid a compatibility risk.
  5. Changes to EuroConverter.java are dropped as they relate to JDK-8186803.
  6. Detect the IBM0114x character sets in TestCharsetMapping.java and expect an additional 0xA -> 25 mapping rather than counting this as an error
  7. Remove the parsing and checking of aliases in TestCharsetMapping.java as the 8u data files don't store them
  8. Handle the internal character sets in TestCharsetMapping.java as they are not marked as such in the 8u data files
  9. Change the data file parsing in TestCharsetMapping.java to handle the three 8u data files. dbcs has additional fields to the other two, but the first five fields that we actually use are mostly the same (dbcs has a type field, the other two assume a type of sbcs).
  10. TestEBCDICLineFeed.java is modified to handle IBM0114x as is, without the JDK-8186803 change

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8186801: Add regression test to test mapping based charsets (generated at build time) (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk8u.git pull/43/head:pull/43
$ git checkout pull/43

Update a local copy of the PR:
$ git checkout pull/43
$ git pull https://git.openjdk.org/jdk8u.git pull/43/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 43

View PR using the GUI difftool:
$ git pr show -t 43

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk8u/pull/43.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 16, 2023

👋 Welcome back andrew! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title Backport cfe34ed89c4f6ef9a49dceef30da1e43b418b152 8186801: Add regression test to test mapping based charsets (generated at build time) Jun 16, 2023
@openjdk
Copy link

openjdk bot commented Jun 16, 2023

This backport pull request has now been updated with issue from the original commit.

@openjdk openjdk bot added backport rfr Pull request is ready for review labels Jun 16, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 16, 2023

Webrevs

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

First review round. (Please note that formally, I am not jdk8u reviewer).

looked at the delta between two patches (yours and the JDK 10 one) and the delta on the file system between the frozen jdk10 repo and jdk8u-dev with your patch applied.

GB18030.map
idendical after patch

Okay.

EUC_TW.map:

  • identical after patch
  • renamed uppercase in 8
  • I looked for remnants of lower case "euc_tw" but all I found was the historical alias.

Okay.

Looked at standard charset definitions. So, 8 (dbcs sbcs extsbcs) => 10 (charsets), and they changed the file format too.

Looked at the EBCDIC linefeed test. Arguably, you would not have to do this, or could do this in a separate patch. It has no bearing on the upcoming GB18030. But it's good to have. JDK-8186803 was a brainteaser, though, and it's annoying they did not do an individual patch.

Left to review is the Testcase itself. Will do after lunch.

Cheers, Thomas

System.out.printf(" error: %s c2b u+000A -> %x%n",
cs, bb[0] & 0xff);
errs++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took some thinking, but I think this is okay.

Are we sure the old (pre-8186803) translations are always "U+000A" -> 25? I am not sure, it looked to me like it could be either mapped to E15 or E25.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes I've made to detect IBM0114x were necessary to get the test to pass. Prior to that, it was failing because those character sets were producing 0x25.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to including this test, it seemed appropriate. The intention is for this backport to add tests to ensure the status quo, rather than change any behaviour (as Severin notes). So this version of TestEBCDICLineFeed.java is ensuring that the 8186803 bug is present in 8u.

I agree about it being a bit of a brainteaser. It was a pain to extract the 8186803 changes from the rest and I was surprised that even removing the .nr file additions still led to a mismatch in TestCharsetMappings, fixed by the hack. As you've seen, the IBM114x maps map both 0x15 and 0x25 to U+000a, so going the other way can produce either. In TestCharsetMappings, both are produced but 0x15 succeeds without the hack (i.e. eq is already true). As far as I can work out. the .nr files (no round trip) stop the 0x25 conversion happening, so only 0x15 occurs.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second review round. Looked at TestCharSetMappings exclusively.

Most of my remarks are centered on code readability.

My suggestion would be to test this test:

  • build JDK and let it generate its charset
  • then modify various *.map files and the charset definitions and check if the jtreg test picks up the changes as errors.

Unfortunately, I have no idea how to automate such tests, but as a sanity test that should be fine.

private boolean shiftHackDBCS = false;
// 8u does not have JDK-8186803 so leHackIBM is true for
// IBM1140-1149 charsets that map U+000A to 0x25.
private boolean leHackIBM = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give this a slightly better name? (I first thought le was leetspeak pronoun :)

proposal: ebcdicLFHack or IBM114xLFHack or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not the best name in retrospect. It was starting to read to me as if it was some form of broken French :)

I've switched to ebcdicLFHack.

this.clzName = clzName;
if (csName.endsWith("_Solaris") ||
csName.endsWith("_MS5022X") ||
csName.endsWith("_MS932")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably move this to the charsets() parsing function. To have all the code that deals with different data formats between 8 and later versions in one place.

Also, why the suffix solution? Easier to understand would be just listing the affected 5 charsets by full name, that way one can directly compare with later versions of the charsets file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just laziness :) I noticed there were two common suffixes in use so could cut it down to three checks

I agree that making it explicit is clearer though and also avoids accidentally picking up any later additions with the same suffixes (though I think that's very unlikely to happen)

// sbcs files have fewer fields and a set type of sbcs
Set<CharsetInfo> charsets = charsets(dir.resolve("sbcs"), true);
charsets.addAll(charsets(dir.resolve("extsbcs"), true));
charsets.addAll(charsets(dir.resolve("dbcs"), false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit, can you reorder and parse dbcs first, then comment, then the other two? So its clear that dbcs is excluded from comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I originally expected they'd need a separate method altogether and I guess I didn't clean it up completely when I realised I could simplify it to just a boolean.

cs.type = "sbcs";
} else {
cs.type = tokens[3];
}
Copy link
Member

@tstuefe tstuefe Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentally parsing:

Format 1, dbcs:
clzName csName hisName dbtype pkg ascii b1min b1max b2min b2max

hisname = 2
pkgName = 4
type = 3

Format 2, sbcs and extsbcs:
clzName csName hisName containASCII pkg

hisname = 2
pkgName = 4
type = sbcs

Okay, checks out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a small comment here, listing these two formats for easier code reading:

// dbcs format
// clzName csName hisName dbtype pkg ascii b1min b1max b2min b2max
// sbcs format
//  clzName csName hisName containASCII pkg

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also some sanity tests:

  • pkgname to start with "sun.nio"
  • type one of "basic ebcdic euc_sim dbcsonly sbcs"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a good idea. I was looking at that line in the mapping file as I was writing the code, and it would be good to have it in there directly.

I think the sanity tests are something we should add in mainline first and backport, as both would apply there too (though mainline has a few more types).

if (tokens.length < 5) {
continue;
}
CharsetInfo cs = new CharsetInfo(tokens[1], tokens[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, could you add comments to the constructor args like this:

new CharsetInfo(/*csName*/ tokens[1], /*clzName*/ tokens[0]);

(We are different from the JDK 10 version anyway in this hunk)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, though it should also now be clearer from having the format in there.

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this patch has no bearing on the product code (test only) since none of the added mappings are referenced in either of those dbcs, sdcs or extsbcs files. The main purpose of it would be to have the GB18030.map file available to move it to the cs test files used in JDK-8301119 by CoderTest.java.

On that ground and by virtue of not introducing too much divergence from 11 code, this seems OK to me.

@openjdk
Copy link

openjdk bot commented Jun 22, 2023

@gnu-andrew This change now passes all automated pre-integration checks.

After integration, the commit message for the final commit will be:

8186801: Add regression test to test mapping based charsets (generated at build time)

Reviewed-by: stuefe, sgehwolf

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 22, 2023
@gnu-andrew gnu-andrew mentioned this pull request Jun 23, 2023
3 tasks
* Rename leHackIBM => ebcdicLFHack
* Move internal handling to charsets method
* Use full class name for internal check
* Add field headers in comments to ease code readability
* Parse dbcs first to aid code readability
* Drop redundant @modules from  TestEBCDLineFeed
@gnu-andrew
Copy link
Member Author

My suggestion would be to test this test:

* build JDK and let it generate its charset

* then modify various *.map files and the charset definitions and check if the jtreg test picks up the changes as errors.

Unfortunately, I have no idea how to automate such tests, but as a sanity test that should be fine.

Confirmed by altering the IBM01140 map:

$ git diff
diff --git a/jdk/make/data/charsetmapping/IBM1140.map b/jdk/make/data/charsetmapping/IBM1140.map
index 6a79f6cba5a..0b2f6658bcd 100644
--- a/jdk/make/data/charsetmapping/IBM1140.map
+++ b/jdk/make/data/charsetmapping/IBM1140.map
@@ -240,7 +240,7 @@
 0xee   U+00d3
 0xef   U+00d5
 0xf0   U+0030
-0xf1   U+0031
+0xf1   U+0131
 0xf2   U+0032
 0xf3   U+0033
 0xf4   U+0034

testing: IBM01140
  string de/encoding
    new String()
      Error: new String() failed
    String.getBytes()
  1 byte/char
    decode
      Error: f1 --> U+0031, expected U+0131
    decode (direct)
      Error: f1 --> U+0031, expected U+0131
  1 byte/char
    encode
U+000a --> 25 allowed for IBM0114x
      Error: U+0131 --> 3f, expected f1
    encode (direct)
U+000a --> 25 allowed for IBM0114x
      Error: U+0131 --> 3f, expected f1

...

JavaTest Message: Test threw exception: java.lang.Exception: Errors detected in 1 charset

@gnu-andrew
Copy link
Member Author

I've added jdk8u-critical-request to the bug for this and the follow-on JDK-8241311 now they're reviewed.
#45 is open for the final change which implements GB18030-2022.

@mlbridge
Copy link

mlbridge bot commented Jun 23, 2023

Mailing list message from Thorsten Glaser on jdk8u-dev:

On Fri, 23 Jun 2023, Andrew John Hughes wrote:

Are we sure the old (pre-8186803) translations are always "U+000A" ->
25? I am not sure, it looked to me like it could be either mapped to
E15 or E25.

In case this data point helps: for the porting of mksh to EBCDIC
systems, we have X'15' for U+000A whereas X'25' maps to U+0085,
so this is sort of expected for unix-y things.

bye,
//mirabilos
--
Infrastrukturexperte ? tarent solutions GmbH
Am Dickobskreuz 10, D-53121 Bonn ? http://www.tarent.de/
Telephon +49 228 54881-393 ? Fax: +49 228 54881-235
HRB AG Bonn 5168 ? USt-ID (VAT): DE122264941
Gesch?ftsf?hrer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg

****************************************************
/?\ The UTF-8 Ribbon
??? Campaign against Mit dem tarent-Newsletter nichts mehr verpassen:
??? HTML eMail! Also, https://www.tarent.de/newsletter
??? header encryption!
****************************************************

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jerboaa
Copy link
Contributor

jerboaa commented Jun 23, 2023

Looks good to me!

Thanks for helping out on the review!

@gnu-andrew
Copy link
Member Author

Thanks both. I see jdk8u-critical-yes for this one, so integrating now.
/integrate

@openjdk
Copy link

openjdk bot commented Jun 23, 2023

Going to push as commit 1818eaf.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 23, 2023
@openjdk openjdk bot closed this Jun 23, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 23, 2023
@openjdk
Copy link

openjdk bot commented Jun 23, 2023

@gnu-andrew Pushed as commit 1818eaf.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

DingliZhang pushed a commit to DingliZhang/jdk8u that referenced this pull request Jun 25, 2023
Add somedefine about ConvertSleepToYield/PreInflateSpin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport integrated Pull request has been integrated
3 participants