New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8260265: UTF-8 by Default #4733
Conversation
|
👋 Welcome back naoto! A progress list of the required criteria for merging this PR into |
|
/csr |
|
@naotoj this pull request will not be integrated until the CSR request JDK-8260266 for issue JDK-8260265 has been approved. |
|
@naotoj The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
This comment has been minimized.
This comment has been minimized.
It's even worse than that, because many OpenSSH installs are configured by default to forward and accept the user locale (see e.g. for RHEL 7). So a single application, on a single remote machine, can be unknowingly started by a single user with different locales, and therefore different encodings, depending on how the user connected to the remote machine. For example, on Windows connecting via powershell results in Worth mentioning is also that |
| * or the {@code toString()} method, which uses the platform's default | ||
| * character encoding. | ||
| * or the {@code toString()} method, which uses the default | ||
| * charset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fold to previous line.
| if (csname != null) { | ||
| try { | ||
| cs = Charset.forName(csname); | ||
| } catch (Exception ignored) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate enhancement...
I've long thought that should be a way to avoid the exception here.
For example, a Charset.forName(csname, default);
The caller might have a default in mind or supply null and then be able to test for null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will file an RFE for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * | ||
| * <p> | ||
| * The {@code FileReader} is meant for reading streams of characters. For reading | ||
| * streams of raw bytes, consider using a {@code FileInputStream}. | ||
| * | ||
| * @see InputStreamReader | ||
| * @see FileInputStream | ||
| * @see java.nio.charset.Charset#defaultCharset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @ see duplicates the link above, the javadoc can do without the @ see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove that @see, I don't see the link in See Also section. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view the @ linkplain is sufficient to allow the reader to navigate; but YMMV.
| @@ -35,7 +35,7 @@ | |||
| * An InputStreamReader is a bridge from byte streams to character streams: It | |||
| * reads bytes and decodes them into characters using a specified {@link | |||
| * java.nio.charset.Charset charset}. The charset that it uses | |||
| * may be specified by name or may be given explicitly, or the platform's | |||
| * may be specified by name or may be given explicitly, or the | |||
| * {@link Charset#defaultCharset() default charset} may be accepted. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"may be accepted" seems like the API has some choice in the matter.
Perhaps "accepted" -> "used".
And in other classes below if there's a suitable replacement.
| * bytes using the given encoding or charset, or the platform's default | ||
| * character encoding if not specified. | ||
| * bytes using the given encoding or charset, or the default | ||
| * console charset if not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JEP 400 doesn't give a rationale for using the console charset for PrintStream.
PrintStreams are used for output to files and other media other than just a tty/console.
The charset of system.out/err should use the console charset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my thinking in #4733 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I am now conviced. Modified not to default to Console.charset() for generic PrintStream w/o charset constructor.
| @@ -797,6 +797,15 @@ public static native void arraycopy(Object src, int srcPos, | |||
| * <td>The module name of the initial/main module</td></tr> | |||
| * <tr><th scope="row">{@systemProperty jdk.module.main.class}</th> | |||
| * <td>The main class name of the initial module</td></tr> | |||
| * <tr><th scope="row">{@systemProperty file.encoding}</th> | |||
| * <td>The name of the default charset. Users may specify | |||
| * {@code UTF-8} or {@code COMPAT} on the command line to the value. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording could imply that only those two values can be supplied.
It could be rephrased to say that if the property is supplied on the command line
it overrides the default UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was intentional. Only those two are supported, others continue to work as before (but not supported).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still it leaves an uncomfortable feeling, perhaps remedied by an "other values have unspecified behavior"
or the "other values are implementation specific".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the clarifying sentence to the spec.
| @@ -86,17 +88,17 @@ | |||
| */ | |||
| private URLDecoder() {} | |||
|
|
|||
| // The platform default encoding | |||
| // The default charset | |||
| static String dfltEncName = URLEncoder.dfltEncName; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add the value of file.encoding to the StaticProperties either as a string or as the Charset.
That would allow a few different lookups of the property to be simplified.
| @@ -161,7 +163,7 @@ public static String encode(String s) { | |||
| try { | |||
| str = encode(s, dfltEncName); | |||
| } catch (UnsupportedEncodingException e) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a separate cleanup, the Charset should be cached, not just the name and use the encode(s, charset) method.
| * the system property {@code file.encoding} on the command line. If the | ||
| * value designates {@code COMPAT}, the default charset is derived from | ||
| * the {@code native.encoding} system property, which typically depends | ||
| * upon the locale and charset of the underlying operating system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description in java.lang.System of the file.encoding property does not indicate it is 'implementation specific'.
In that context, it appears to be part of the JavaSE spec.
Having the spec in a single place with references to it from others could avoid duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file.encoding is listed under @implNote tag in System.getProperties(), so it is implementation-specific.
Thanks. Updated the JEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work!
One thing we missed when adding Console.charset() was the code example in OutputStreamWriter's class description. It uses System.out and should have been changed to "anOutputStream" to avoid people copying this usage. We can do it as part of this PR or another but would be good to get it consistent with InputStreamReader.
| * | ||
| * @implNote An implementation may override the default charset with | ||
| * the system property {@code file.encoding} on the command line. If the | ||
| * value designates {@code COMPAT}, the default charset is derived from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to say "If the value is COMPAT" and avoid the word "designates" here.
| @@ -47,40 +47,40 @@ | |||
|
|
|||
| /** | |||
| * Creates a new {@code FileReader}, given the name of the file to read, | |||
| * using the platform's | |||
| * {@linkplain java.nio.charset.Charset#defaultCharset() default charset}. | |||
| * using the {@linkplain java.nio.charset.Charset#defaultCharset() default charset}. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java.io classes import java.nio.charset.Charset so don't need to use the fully qualified class name everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected, as well as other locations.
| @@ -163,7 +163,7 @@ static String generateSources() throws Exception { | |||
| return commandLineClassName; | |||
| } | |||
|
|
|||
| private static final String defaultEncoding = Charset.defaultCharset().name(); | |||
| private static final String defaultEncoding = System.getProperty("sun.jnu.encoding"); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing to have "defaultEncoding" set to the this property. Maybe the test should be changed to run with -Dfile.encoding=COMPAT or else change the two usages of defaultEncoding in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test is asserting UTF-8 path names, it is not affected by file.encoding being COMPAT or not. To make it less confusing, I renamed the field name to filePathEncoding.
I included the leftover fix as well. |
| * one from {@code native.encoding} system property during runtime startup. | ||
| * Specifying it to {@code UTF-8}, or no value is set, defaults to use | ||
| * {@code UTF-8}. Other values have unspecified behavior. | ||
| * </td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggesting re-wording to consider:
The name of the default charset, defaults to "UTF-8". The property may be set on the command line to the value "UTF-8" or "COMPAT". If set on the command line to the value "COMPAT" then the value is replaced with the value of the native.encoding property during startup. Setting the property to a value other than "UFT-8" or "COMPAT" leads to unspecified behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Alan. Updated the description as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for incorporating the suggestion for the getProperties text. I think it looks good now.
|
@naotoj This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 2 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
|
/integrate |
|
Going to push as commit 7fc8540.
Your commit was automatically rebased without conflicts. |
This is an implementation for the
JEP 400: UTF-8 by Default. The gist of the changes isCharset.defaultCharset()returningUTF-8andfile.encodingsystem property being added in the spec, but another notable modification is injava.io.PrintStreamwhere it continues to use theConsoleencoding as the default charset instead ofUTF-8. Other changes are mostly clarification of the term "default charset" and their links. Corresponding CSR has also been drafted.JEP 400: https://bugs.openjdk.java.net/browse/JDK-8187041
CSR: https://bugs.openjdk.java.net/browse/JDK-8260266
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4733/head:pull/4733$ git checkout pull/4733Update a local copy of the PR:
$ git checkout pull/4733$ git pull https://git.openjdk.java.net/jdk pull/4733/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 4733View PR using the GUI difftool:
$ git pr show -t 4733Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4733.diff