-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8186801: Add regression test to test mapping based charsets (generated at build time) #43
Conversation
👋 Welcome back andrew! A progress list of the required criteria for merging this PR into |
This backport pull request has now been updated with issue from the original commit. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
First review round. (Please note that formally, I am not jdk8u reviewer).
looked at the delta between two patches (yours and the JDK 10 one) and the delta on the file system between the frozen jdk10 repo and jdk8u-dev with your patch applied.
GB18030.map
idendical after patch
Okay.
EUC_TW.map:
- identical after patch
- renamed uppercase in 8
- I looked for remnants of lower case "euc_tw" but all I found was the historical alias.
Okay.
Looked at standard charset definitions. So, 8 (dbcs sbcs extsbcs) => 10 (charsets), and they changed the file format too.
Looked at the EBCDIC linefeed test. Arguably, you would not have to do this, or could do this in a separate patch. It has no bearing on the upcoming GB18030. But it's good to have. JDK-8186803 was a brainteaser, though, and it's annoying they did not do an individual patch.
Left to review is the Testcase itself. Will do after lunch.
Cheers, Thomas
System.out.printf(" error: %s c2b u+000A -> %x%n", | ||
cs, bb[0] & 0xff); | ||
errs++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took some thinking, but I think this is okay.
Are we sure the old (pre-8186803) translations are always "U+000A" -> 25? I am not sure, it looked to me like it could be either mapped to E15 or E25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes I've made to detect IBM0114x were necessary to get the test to pass. Prior to that, it was failing because those character sets were producing 0x25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to including this test, it seemed appropriate. The intention is for this backport to add tests to ensure the status quo, rather than change any behaviour (as Severin notes). So this version of TestEBCDICLineFeed.java
is ensuring that the 8186803 bug is present in 8u.
I agree about it being a bit of a brainteaser. It was a pain to extract the 8186803 changes from the rest and I was surprised that even removing the .nr
file additions still led to a mismatch in TestCharsetMappings
, fixed by the hack. As you've seen, the IBM114x maps map both 0x15 and 0x25 to U+000a, so going the other way can produce either. In TestCharsetMappings
, both are produced but 0x15 succeeds without the hack (i.e. eq
is already true). As far as I can work out. the .nr
files (no round trip) stop the 0x25 conversion happening, so only 0x15 occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second review round. Looked at TestCharSetMappings exclusively.
Most of my remarks are centered on code readability.
My suggestion would be to test this test:
- build JDK and let it generate its charset
- then modify various *.map files and the charset definitions and check if the jtreg test picks up the changes as errors.
Unfortunately, I have no idea how to automate such tests, but as a sanity test that should be fine.
private boolean shiftHackDBCS = false; | ||
// 8u does not have JDK-8186803 so leHackIBM is true for | ||
// IBM1140-1149 charsets that map U+000A to 0x25. | ||
private boolean leHackIBM = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give this a slightly better name? (I first thought le was leetspeak pronoun :)
proposal: ebcdicLFHack or IBM114xLFHack or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not the best name in retrospect. It was starting to read to me as if it was some form of broken French :)
I've switched to ebcdicLFHack
.
this.clzName = clzName; | ||
if (csName.endsWith("_Solaris") || | ||
csName.endsWith("_MS5022X") || | ||
csName.endsWith("_MS932")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably move this to the charsets()
parsing function. To have all the code that deals with different data formats between 8 and later versions in one place.
Also, why the suffix solution? Easier to understand would be just listing the affected 5 charsets by full name, that way one can directly compare with later versions of the charsets file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just laziness :) I noticed there were two common suffixes in use so could cut it down to three checks
I agree that making it explicit is clearer though and also avoids accidentally picking up any later additions with the same suffixes (though I think that's very unlikely to happen)
// sbcs files have fewer fields and a set type of sbcs | ||
Set<CharsetInfo> charsets = charsets(dir.resolve("sbcs"), true); | ||
charsets.addAll(charsets(dir.resolve("extsbcs"), true)); | ||
charsets.addAll(charsets(dir.resolve("dbcs"), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit, can you reorder and parse dbcs first, then comment, then the other two? So its clear that dbcs is excluded from comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I originally expected they'd need a separate method altogether and I guess I didn't clean it up completely when I realised I could simplify it to just a boolean.
cs.type = "sbcs"; | ||
} else { | ||
cs.type = tokens[3]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentally parsing:
Format 1, dbcs:
clzName csName hisName dbtype pkg ascii b1min b1max b2min b2max
hisname = 2
pkgName = 4
type = 3
Format 2, sbcs and extsbcs:
clzName csName hisName containASCII pkg
hisname = 2
pkgName = 4
type = sbcs
Okay, checks out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a small comment here, listing these two formats for easier code reading:
// dbcs format
// clzName csName hisName dbtype pkg ascii b1min b1max b2min b2max
// sbcs format
// clzName csName hisName containASCII pkg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also some sanity tests:
- pkgname to start with "sun.nio"
- type one of "basic ebcdic euc_sim dbcsonly sbcs"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is a good idea. I was looking at that line in the mapping file as I was writing the code, and it would be good to have it in there directly.
I think the sanity tests are something we should add in mainline first and backport, as both would apply there too (though mainline has a few more types).
if (tokens.length < 5) { | ||
continue; | ||
} | ||
CharsetInfo cs = new CharsetInfo(tokens[1], tokens[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, could you add comments to the constructor args like this:
new CharsetInfo(/*csName*/ tokens[1], /*clzName*/ tokens[0]);
(We are different from the JDK 10 version anyway in this hunk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, though it should also now be clearer from having the format in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this patch has no bearing on the product code (test only) since none of the added mappings are referenced in either of those dbcs
, sdcs
or extsbcs
files. The main purpose of it would be to have the GB18030.map
file available to move it to the cs test files used in JDK-8301119
by CoderTest.java
.
On that ground and by virtue of not introducing too much divergence from 11 code, this seems OK to me.
@gnu-andrew This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
* Rename leHackIBM => ebcdicLFHack * Move internal handling to charsets method * Use full class name for internal check * Add field headers in comments to ease code readability * Parse dbcs first to aid code readability * Drop redundant @modules from TestEBCDLineFeed
Confirmed by altering the IBM01140 map:
|
I've added |
Mailing list message from Thorsten Glaser on jdk8u-dev: On Fri, 23 Jun 2023, Andrew John Hughes wrote:
In case this data point helps: for the porting of mksh to EBCDIC bye, **************************************************** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks for helping out on the review! |
Thanks both. I see |
Going to push as commit 1818eaf. |
@gnu-andrew Pushed as commit 1818eaf. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Add somedefine about ConvertSleepToYield/PreInflateSpin
This is a pre-requisite for JDK-8301119 ("Support for GB18030-2022") and so is being proposed for inclusion in 8u382 during rampdown, so that the changes are in place for when GB18030-2022 enforcement begins in August. It introduces
GB18030.map
, containing the character mappings for GB18030, and tests to verify the correct mappings happen at run-time.A number of changes were necessary for 8u. One main reason was the inclusion of JDK-8186803 "Update Cp1140-Cp1149 EBEDIC euro charset to map \u000A to EBCDIC 0x15" as part of this fix in OpenJDK 10+. I have removed these changes in the 8u version so as to avoid making potentially incompatible library changes and focus on testing the current character mappings in 8u.
Another is that the character set data is spread across three files -
dbcs
,sbcs
andextsbcs
- in 8u, whereas it was amalgamated into a single file,charsets
, during the introduction of modules. The coding test (TestCharsetMapping.java
) has been adapted to use the 8u data format.The detailed changes from the OpenJDK 10 patch are as follows:
IBM114x.nr
files which implement JDK-8186803.charsets
which doesn't exist in 8u and any equivalent change may lead to compatibility issuesEUC_TW.java
has slightly different context in 8u (nopkg
argument) but the filename change is the sameMS932_0213.java
&x-MS932_0213
to avoid a compatibility risk.EuroConverter.java
are dropped as they relate to JDK-8186803.TestCharsetMapping.java
and expect an additional 0xA -> 25 mapping rather than counting this as an errorTestCharsetMapping.java
as the 8u data files don't store themTestCharsetMapping.java
as they are not marked as such in the 8u data filesTestCharsetMapping.java
to handle the three 8u data files.dbcs
has additional fields to the other two, but the first five fields that we actually use are mostly the same (dbcs
has a type field, the other two assume a type ofsbcs
).TestEBCDICLineFeed.java
is modified to handle IBM0114x as is, without the JDK-8186803 changeProgress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk8u.git pull/43/head:pull/43
$ git checkout pull/43
Update a local copy of the PR:
$ git checkout pull/43
$ git pull https://git.openjdk.org/jdk8u.git pull/43/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 43
View PR using the GUI difftool:
$ git pr show -t 43
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk8u/pull/43.diff
Webrev
Link to Webrev Comment