Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8260265: UTF-8 by Default #4733

Closed
wants to merge 16 commits into from
Closed

8260265: UTF-8 by Default #4733

wants to merge 16 commits into from

Conversation

naotoj
Copy link
Member

@naotoj naotoj commented Jul 8, 2021

This is an implementation for the JEP 400: UTF-8 by Default. The gist of the changes is Charset.defaultCharset() returning UTF-8 and file.encoding system property being added in the spec, but another notable modification is in java.io.PrintStream where it continues to use the Console encoding as the default charset instead of UTF-8. Other changes are mostly clarification of the term "default charset" and their links. Corresponding CSR has also been drafted.

JEP 400: https://bugs.openjdk.java.net/browse/JDK-8187041
CSR: https://bugs.openjdk.java.net/browse/JDK-8260266


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4733/head:pull/4733
$ git checkout pull/4733

Update a local copy of the PR:
$ git checkout pull/4733
$ git pull https://git.openjdk.java.net/jdk pull/4733/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4733

View PR using the GUI difftool:
$ git pr show -t 4733

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4733.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 8, 2021

👋 Welcome back naoto! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@naotoj
Copy link
Member Author

naotoj commented Jul 8, 2021

/csr

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Jul 8, 2021
@openjdk
Copy link

openjdk bot commented Jul 8, 2021

@naotoj this pull request will not be integrated until the CSR request JDK-8260266 for issue JDK-8260265 has been approved.

@openjdk
Copy link

openjdk bot commented Jul 8, 2021

@naotoj The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-runtime
  • net

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-runtime hotspot-runtime-dev@openjdk.org core-libs core-libs-dev@openjdk.org net net-dev@openjdk.org labels Jul 8, 2021
@naotoj naotoj marked this pull request as ready for review Jul 8, 2021
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 8, 2021
@mlbridge
Copy link

mlbridge bot commented Jul 8, 2021

@jglick

This comment has been minimized.

@gbaso
Copy link

gbaso commented Jul 14, 2021

Consider an application that creates a java.io.FileWriter with its one-argument constructor and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its one-argument constructor and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application, then the resulting text may be silently corrupted or incomplete, since these APIs replace erroneous input rather than fail.

It's even worse than that, because many OpenSSH installs are configured by default to forward and accept the user locale (see e.g. for RHEL 7).

So a single application, on a single remote machine, can be unknowingly started by a single user with different locales, and therefore different encodings, depending on how the user connected to the remote machine. For example, on Windows connecting via powershell results in LANG=en_US.UTF-8, while using WSL2 results in LANG=C.UTF-8. On Java 11 in a RHEL7 machine, file.encoding results in UTF-8 in the first case, but ANSI_X3.4-1968 in the second, leading to a default charset ASCII.

Worth mentioning is also that Charset.forName("default") is just an alias to ASCII, per sun.nio.cs.StandardCharsets$Aliases.

* or the {@code toString()} method, which uses the platform's default
* character encoding.
* or the {@code toString()} method, which uses the default
* charset.
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fold to previous line.

if (csname != null) {
try {
cs = Charset.forName(csname);
} catch (Exception ignored) { }
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate enhancement...
I've long thought that should be a way to avoid the exception here.
For example, a Charset.forName(csname, default);
The caller might have a default in mind or supply null and then be able to test for null.

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will file an RFE for this.

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*
* <p>
* The {@code FileReader} is meant for reading streams of characters. For reading
* streams of raw bytes, consider using a {@code FileInputStream}.
*
* @see InputStreamReader
* @see FileInputStream
* @see java.nio.charset.Charset#defaultCharset()
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @ see duplicates the link above, the javadoc can do without the @ see.

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove that @see, I don't see the link in See Also section. Am I missing something?

Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view the @ linkplain is sufficient to allow the reader to navigate; but YMMV.

@@ -35,7 +35,7 @@
* An InputStreamReader is a bridge from byte streams to character streams: It
* reads bytes and decodes them into characters using a specified {@link
* java.nio.charset.Charset charset}. The charset that it uses
* may be specified by name or may be given explicitly, or the platform's
* may be specified by name or may be given explicitly, or the
* {@link Charset#defaultCharset() default charset} may be accepted.
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may be accepted" seems like the API has some choice in the matter.
Perhaps "accepted" -> "used".
And in other classes below if there's a suitable replacement.

* bytes using the given encoding or charset, or the platform's default
* character encoding if not specified.
* bytes using the given encoding or charset, or the default
* console charset if not specified.
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JEP 400 doesn't give a rationale for using the console charset for PrintStream.
PrintStreams are used for output to files and other media other than just a tty/console.
The charset of system.out/err should use the console charset.

Copy link

@jglick jglick Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my thinking in #4733 (comment).

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I am now conviced. Modified not to default to Console.charset() for generic PrintStream w/o charset constructor.

@@ -797,6 +797,15 @@ public static native void arraycopy(Object src, int srcPos,
* <td>The module name of the initial/main module</td></tr>
* <tr><th scope="row">{@systemProperty jdk.module.main.class}</th>
* <td>The main class name of the initial module</td></tr>
* <tr><th scope="row">{@systemProperty file.encoding}</th>
* <td>The name of the default charset. Users may specify
* {@code UTF-8} or {@code COMPAT} on the command line to the value.
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording could imply that only those two values can be supplied.
It could be rephrased to say that if the property is supplied on the command line
it overrides the default UTF-8.

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was intentional. Only those two are supported, others continue to work as before (but not supported).

Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still it leaves an uncomfortable feeling, perhaps remedied by an "other values have unspecified behavior"
or the "other values are implementation specific".

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the clarifying sentence to the spec.

@@ -86,17 +88,17 @@
*/
private URLDecoder() {}

// The platform default encoding
// The default charset
static String dfltEncName = URLEncoder.dfltEncName;
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add the value of file.encoding to the StaticProperties either as a string or as the Charset.
That would allow a few different lookups of the property to be simplified.

@@ -161,7 +163,7 @@ public static String encode(String s) {
try {
str = encode(s, dfltEncName);
} catch (UnsupportedEncodingException e) {
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a separate cleanup, the Charset should be cached, not just the name and use the encode(s, charset) method.

* the system property {@code file.encoding} on the command line. If the
* value designates {@code COMPAT}, the default charset is derived from
* the {@code native.encoding} system property, which typically depends
* upon the locale and charset of the underlying operating system.
Copy link
Contributor

@RogerRiggs RogerRiggs Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description in java.lang.System of the file.encoding property does not indicate it is 'implementation specific'.
In that context, it appears to be part of the JavaSE spec.
Having the spec in a single place with references to it from others could avoid duplication.

Copy link
Member Author

@naotoj naotoj Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file.encoding is listed under @implNote tag in System.getProperties(), so it is implementation-specific.

@naotoj
Copy link
Member Author

naotoj commented Jul 14, 2021

Consider an application that creates a java.io.FileWriter with its one-argument constructor and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates a java.io.FileReader with its one-argument constructor and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application, then the resulting text may be silently corrupted or incomplete, since these APIs replace erroneous input rather than fail.

It's even worse than that, because many OpenSSH installs are configured by default to forward and accept the user locale (see e.g. for RHEL 7).

So a single application, on a single remote machine, can be unknowingly started by a single user with different locales, and therefore different encodings, depending on how the user connected to the remote machine. For example, on Windows connecting via powershell results in LANG=en_US.UTF-8, while using WSL2 results in LANG=C.UTF-8. On Java 11 in a RHEL7 machine, file.encoding results in UTF-8 in the first case, but ANSI_X3.4-1968 in the second, leading to a default charset ASCII.

Worth mentioning is also that Charset.forName("default") is just an alias to ASCII, per sun.nio.cs.StandardCharsets$Aliases.

Thanks. Updated the JEP.

Copy link
Contributor

@AlanBateman AlanBateman left a comment

Good work!

One thing we missed when adding Console.charset() was the code example in OutputStreamWriter's class description. It uses System.out and should have been changed to "anOutputStream" to avoid people copying this usage. We can do it as part of this PR or another but would be good to get it consistent with InputStreamReader.

*
* @implNote An implementation may override the default charset with
* the system property {@code file.encoding} on the command line. If the
* value designates {@code COMPAT}, the default charset is derived from
Copy link
Contributor

@AlanBateman AlanBateman Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to say "If the value is COMPAT" and avoid the word "designates" here.

@@ -47,40 +47,40 @@

/**
* Creates a new {@code FileReader}, given the name of the file to read,
* using the platform's
* {@linkplain java.nio.charset.Charset#defaultCharset() default charset}.
* using the {@linkplain java.nio.charset.Charset#defaultCharset() default charset}.
Copy link
Contributor

@AlanBateman AlanBateman Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The java.io classes import java.nio.charset.Charset so don't need to use the fully qualified class name everywhere.

Copy link
Member Author

@naotoj naotoj Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected, as well as other locations.

@@ -163,7 +163,7 @@ static String generateSources() throws Exception {
return commandLineClassName;
}

private static final String defaultEncoding = Charset.defaultCharset().name();
private static final String defaultEncoding = System.getProperty("sun.jnu.encoding");
Copy link
Contributor

@AlanBateman AlanBateman Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to have "defaultEncoding" set to the this property. Maybe the test should be changed to run with -Dfile.encoding=COMPAT or else change the two usages of defaultEncoding in the test.

Copy link
Member Author

@naotoj naotoj Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the test is asserting UTF-8 path names, it is not affected by file.encoding being COMPAT or not. To make it less confusing, I renamed the field name to filePathEncoding.

@naotoj
Copy link
Member Author

naotoj commented Jul 20, 2021

Good work!

One thing we missed when adding Console.charset() was the code example in OutputStreamWriter's class description. It uses System.out and should have been changed to "anOutputStream" to avoid people copying this usage. We can do it as part of this PR or another but would be good to get it consistent with InputStreamReader.

I included the leftover fix as well.

* one from {@code native.encoding} system property during runtime startup.
* Specifying it to {@code UTF-8}, or no value is set, defaults to use
* {@code UTF-8}. Other values have unspecified behavior.
* </td></tr>
Copy link
Contributor

@AlanBateman AlanBateman Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a suggesting re-wording to consider:

The name of the default charset, defaults to "UTF-8". The property may be set on the command line to the value "UTF-8" or "COMPAT". If set on the command line to the value "COMPAT" then the value is replaced with the value of the native.encoding property during startup. Setting the property to a value other than "UFT-8" or "COMPAT" leads to unspecified behavior.

Copy link
Member Author

@naotoj naotoj Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Alan. Updated the description as suggested.

Copy link
Contributor

@AlanBateman AlanBateman left a comment

Thanks for incorporating the suggestion for the getProperties text. I think it looks good now.

@openjdk openjdk bot removed the csr Pull request needs approved CSR before integration label Jul 22, 2021
@openjdk
Copy link

openjdk bot commented Jul 22, 2021

@naotoj This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8260265: UTF-8 by Default

Reviewed-by: alanb, rriggs

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the master branch:

  • 1fb798d: 8272915: (doc) package-info typo in extLink
  • 5116784: 8273091: Doc of [Strict]Math.floorDiv(long,int) erroneously documents int in @return tag

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 22, 2021
@naotoj
Copy link
Member Author

naotoj commented Aug 30, 2021

/integrate

@openjdk
Copy link

openjdk bot commented Aug 30, 2021

Going to push as commit 7fc8540.
Since your change was applied there have been 17 commits pushed to the master branch:

  • 3204853: 8272343: Remove MetaspaceClosure::FLAG_MASK
  • fecefb8: 8271302: Regex Test Refresh
  • f18c0fa: 8271560: sun/security/ssl/DHKeyExchange/LegacyDHEKeyExchange.java still fails due to "An established connection was aborted by the software in your host machine"
  • 5aaa20f: 8272861: Add a micro benchmark for vector api
  • 7a01ba6: 8272093: Extract evacuation failure injection from G1CollectedHeap
  • 98b9d98: 8272797: Mutex with rank safepoint_check_never imply allow_vm_block
  • f11e099: 8272651: G1 heap region info print order changed by JDK-8269914
  • fbffa54: 8270438: "Cores to use" output in configure is misleading
  • 5185dbd: 8273098: Unnecessary Vector usage in java.naming
  • 276b07b: 8271490: [ppc] [s390]: Crash in JavaThread::pd_get_top_frame_for_profiling
  • ... and 7 more: https://git.openjdk.java.net/jdk/compare/e66c8afb59b57c4546656efa97f723f084964330...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Aug 30, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 30, 2021
@openjdk
Copy link

openjdk bot commented Aug 30, 2021

@naotoj Pushed as commit 7fc8540.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated net net-dev@openjdk.org
5 participants