Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8195129: System.load() fails to load from unicode paths #4169

Closed
wants to merge 5 commits into from

Conversation

mkartashev
Copy link
Member

@mkartashev mkartashev commented May 24, 2021

Character strings within JVM are produced and consumed in several formats. Strings come from/to Java in the UTF8 format and POSIX APIs (like fprintf() or dlopen()) consume strings also in UTF8. On Windows, however, the situation is far less simple: some new(er) APIs expect UTF16 (wide-character strings), some older APIs can only work with strings in a "platform" format, where not all UTF8 characters can be represented; which ones can depends on the current "code page".

This commit switches the Windows version of native library loading code to using the new UTF16 API LoadLibraryW() and attempts to streamline the use of various string formats in the surrounding code.

Namely, exception messages are made to consume strings explicitly in the UTF8 format, while logging functions (that end up using legacy Windows API) are made to consume "platform" strings in most cases. One exception is JVM_LoadLibrary() logging where the UTF8 name of the library is logged, which can, of course, be fixed, but was considered not worth the additional code (NB: this isn't a new bug).

The test runs in a separate JVM in order to make NIO happy about non-ASCII characters in the file name; tests are executed with LC_ALL=C and that doesn't let NIO work with non-ASCII file names even on Linux or MacOS.

Tested by running test/hotspot/jtreg:tier1 on Linux and jtreg:test/hotspot/jtreg/runtime on Windows 10. The new test ( jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode) was explicitly ran on those platforms as well.

Results from Linux:

Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR   
   jtreg:test/hotspot/jtreg:tier1                     1784  1784     0     0   
==============================
TEST SUCCESS
Building target 'run-test-only' in configuration 'linux-x86_64-server-release'
Test selection 'jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode', will run:
* jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode

Running test 'jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode'
Passed: runtime/jni/loadLibraryUnicode/LoadLibraryUnicodeTest.java
Test results: passed: 1

Results from Windows 10:

Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR
   jtreg:test/hotspot/jtreg/runtime                    746   746     0     0
==============================
TEST SUCCESS
Finished building target 'run-test-only' in configuration 'windows-x86_64-server-fastdebug'
Building target 'run-test-only' in configuration 'windows-x86_64-server-fastdebug'
Test selection 'test/hotspot/jtreg/runtime/jni/loadLibraryUnicode', will run:
* jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode

Running test 'jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode'
Passed: runtime/jni/loadLibraryUnicode/LoadLibraryUnicodeTest.java
Test results: passed: 1

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8195129: System.load() fails to load from unicode paths

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4169/head:pull/4169
$ git checkout pull/4169

Update a local copy of the PR:
$ git checkout pull/4169
$ git pull https://git.openjdk.java.net/jdk pull/4169/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4169

View PR using the GUI difftool:
$ git pr show -t 4169

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4169.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 24, 2021

👋 Welcome back mkartashev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 24, 2021
@openjdk
Copy link

openjdk bot commented May 24, 2021

@mkartashev The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels May 24, 2021
@mlbridge
Copy link

mlbridge bot commented May 24, 2021

Webrevs

*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test sources have Apache license, I thought we always used GPL for tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you are right. Corrected copyright headers in all the test files.

@gerard-ziemski
Copy link

gerard-ziemski commented May 26, 2021

I tried verifying the test on macOS by running:

jtreg -nr -va -jdk:./build/macosx-x86_64-server-fastdebug/images/jdk test/hotspot/jtreg/runtime/jni/loadLibraryUnicode

and I get:

TEST RESULT: Error. Use -nativepath to specify the location of native code

I looked at the test and thought it was handling loading the native lib on its own, without the test needing to set nativepath explicitly. What am I doing wrong?

@mlbridge
Copy link

mlbridge bot commented May 27, 2021

Mailing list message from David Holmes on hotspot-dev:

On 27/05/2021 6:59 am, Gerard Ziemski wrote:

On Tue, 25 May 2021 09:45:34 GMT, Maxim Kartashev <github.com+28651297+mkartashev at openjdk.org> wrote:
I tried verifying the test on **macOS** by running:

`jtreg -nr -va -jdk:./build/macosx-x86_64-server-fastdebug/images/jdk test/hotspot/jtreg/runtime/jni/loadLibraryUnicode`

and I get:

`TEST RESULT: Error. Use -nativepath to specify the location of native code`

I looked at the test and thought it was handling loading the native lib on its own, without the test needing to set **nativepath** explicitly. What am I doing wrong?

You always have to set nativepath explicitly for jtreg. The test then
reads the property that jtreg defines using the value of -nativepath:

String testNativePath =
LoadLibraryUnicodeTest.getSystemProperty("test.nativepath");

Cheers,
David
-----

1 similar comment
@mlbridge
Copy link

mlbridge bot commented May 27, 2021

Mailing list message from David Holmes on hotspot-dev:

On 27/05/2021 6:59 am, Gerard Ziemski wrote:

On Tue, 25 May 2021 09:45:34 GMT, Maxim Kartashev <github.com+28651297+mkartashev at openjdk.org> wrote:
I tried verifying the test on **macOS** by running:

`jtreg -nr -va -jdk:./build/macosx-x86_64-server-fastdebug/images/jdk test/hotspot/jtreg/runtime/jni/loadLibraryUnicode`

and I get:

`TEST RESULT: Error. Use -nativepath to specify the location of native code`

I looked at the test and thought it was handling loading the native lib on its own, without the test needing to set **nativepath** explicitly. What am I doing wrong?

You always have to set nativepath explicitly for jtreg. The test then
reads the property that jtreg defines using the value of -nativepath:

String testNativePath =
LoadLibraryUnicodeTest.getSystemProperty("test.nativepath");

Cheers,
David
-----

@dholmes-ora
Copy link
Member

Hi Maxim,

Please don't force-push changes once you have opened a PR and review has commenced as it messes up the history and comments. Just push each commit directly. The final integration will squash everything into one clean commit.

Thanks,
David

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Maxim,

Overall this seems okay. I've focused mainly on the hotspot parts, including the test.

A few minor changes requested. I do have some concerns about the impact on startup though and the efficiency of the conversion routines.

Thanks,
David

Comment on lines 1460 to 1462
const int flag_source_str_is_null_terminated = -1;
const int flag_estimate_chars_count = 0;
int utf16_chars_count_estimated = MultiByteToWideChar(source_encoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your local naming style is somewhat excessive. You could just comment the values of the flags when you pass them eg:

MultiByteToWideChar(source_encoding,
MB_ERR_INVALID_CHARS,
source_str,
-1, //source is null-terminated
NULL, // no output buffer
0); // calculate required buffer size

Or you could just add a comment before the call:

// Perform a dummy conversion so that we can get the required size of the buffer to
// allocate. The source is null-terminated.

Trying to document parameter semantics by variable naming doesn't work in my opinion - at some point if you want to know you have to RTFM for the API.

And utf16_len is perfectly adequate for the returned size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Corrected.

log_info(os)("attempting shared library load of %s", name);
void * os::dll_load(const char *utf8_name, char *ebuf, int ebuflen) {
LPWSTR utf16_name = NULL;
errno_t err = convert_UTF8_to_UTF16(utf8_name, &utf16_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any figures on the cost of this additional conversion in relation to startup time?

I'm already concerned to see that we have to perform each conversion twice via MultiByteToWideChar/WideCharToMultiByte, once to get the size and then to actually get the characters! This seems potentially very inefficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured time spend converting the library path name relative to the overall time of a (successful) os::dll_load() call. It varies between a fraction of a percent to ~3% (see below for the actual data from a release build). I'll defer to your expertise wrt to these numbers.

What can be done here (without changing os::dll_load() API) is to have a "fast path" for ascii-only path names, in which case no conversion is necessary, and a "slow path" for all the rest. This will complicate things a bit, but not by much, I believe. I am certainly willing to give that a try. Let me know what do you think.

./build/windows-x86_64-server-release/images/jdk/bin/java -jar ./build/windows-x86_64-server-fastdebug/images/jdk/demo/jfc/SwingSet2/SwingSet2.jar
0.050% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\jimage.dll
2.273% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\java.dll
0.177% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\net.dll
0.541% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\nio.dll
0.359% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\zip.dll
3.186% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\jimage.dll
0.075% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\awt.dll
0.232% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\freetype.dll
0.136% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\fontmanager.dll
0.170% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\javajpeg.dll

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core-libs folks have the experience/expertise with these character encoding issues so I will defer to them. But someone should run this through our startup benchmarks to see if it makes a difference.

Thanks,
David

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core-libs folks have the experience/expertise with these character encoding issues so I will defer to them.

Naoto has agreed to look at this.

Comment on lines 40 to 48
if (Platform.isLinux()) {
osDependentLibraryFileName = "libLoadLibraryUnicode.so";
} else if (Platform.isOSX()) {
osDependentLibraryFileName = "libLoadLibraryUnicode.dylib";
} else if (Platform.isWindows()) {
osDependentLibraryFileName = "LoadLibraryUnicode.dll";
} else {
throw new Error("Unsupported OS");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the test library function Platform.sharedLibraryExt() to get the library file extension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Rewrote this piece.
(Side note: since the libraries' prefix differs between platforms also, it'd be nice to have something like Platform.sharedLibraryName(name); this is the way all the code that uses Platform.sharedLibraryExt() is structured anyway. But I guess it's best not to conflate things).

Comment on lines +1489 to +1491
static errno_t convert_UTF8_to_UTF16(char const* utf8_str, LPWSTR* utf16_str) {
return convert_to_UTF16(utf8_str, CP_UTF8, utf16_str);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, utf8_str is the "modified" UTF-8 string in JNI. Using it as the standard UTF-8 (I believe Windows' encoding CP_UTF8 is the one) may end up in incorrect conversions in some corner cases, e.g., for supplementary characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; I changed the code in NativeLibraries.c to pass down true UTF-8 instead of "modified UTF-8". Please, take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can pass non modified UTF-8 through JVM_LoadLibrary(). Probably some VM folks can enlighten here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert by my understanding is that the VM only deals with modified UTF-8, as does JNI. So the incoming string should be modified-UTF8 IMO and then converted to UTF16.

That said, this is shared code being modified on the JDK side so you can't just change the type of string being passed in without updating all the implementations of os::dll_load to support that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to establish some common ground before proceeding further with this fix. It's a bit of a long read; please, bear with me.

The path name starts its life as a jstring in Java_jdk_internal_loader_NativeLibraries_load(), its encoding is irrelevant at this point.

Next, the name has to be passed down to JVM_LoadLibrary() that takes char*. So we need to convert form jstring to char* (point (a)). Following that, os::dll_load() that actually performs loading in a platform-specific manner also receives char*. All platform implementations of os::dll_load() pass the path name down to their respective platform's APIs unmodified, but I think that's just incidental and here we have another possible point of conversion (point (b)). Other consumers of the path name are exception(c) and logging(d) messages; they also take char*, but potentially of a different encoding.

Let me try to enumerate all conceivably valid conversions for JVM_LoadLibrary() consumption (point (a)):

  1. jstring -> platform-specific encoding (status quo meaning possibly lossy encoding on Windows and UTF-8 elsewhere AFAICT),
  2. jstring -> modified UTF-8,
  3. jstring -> UTF-8.

This bug 8195129 occurs because conversion (1) may loose information on Windows if the platform encoding happens to be NOT UTF-8 (which it often - or even always - is). So that's a no-go and we are left with either (2) or (3).

On MacOS and Linux, "platform" encoding already is UTF-8 and since all the platform APIs happily consume UTF-8, no further conversion is necessary (neither for actual library loading, nor for log or exception messages; the latter have to convert to UTF-16, but do that under the hood).

On Windows, we require at least these variants of the path name:

  1. UTF16 for library loading (Unicode Windows API),
  2. "platform" encoding for logging (yes, loosing information here, but that's tolerable),
  3. "platform" (lossy) or UTF8 (lossless) encoding for exception messages (prefer lossless).

This is what's behind my choice of UTF-8 for the path name encoding as it gets passed down to JVM_LoadLibrary(). We can go with modified UTF-8, of course, in which case all platforms - not just Windows - will have to do the conversion on their own, loosing the benefit of the knowledge about the original string encoding (the String.coder field of jstring).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, as it would be the only location in JNI/JVM interface that uses the standard UTF-8. Instead, I would implement convert_UTF8_to_UTF16 or rather convert_mUTF8_to_UTF16 with a fairly simple arithmetic logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkartashev thank you for the detailed explanation.

It is not clear to me that the JDK's conformance to being a Unicode application has significantly changed since the evaluation of JDK-8017274 - @naotoj can you comment on that and related discussion from the CCC for JDK-4958170 ? In particular I'm not sure that using the platform encoding is wrong, nor how we can have a path that cannot be represented by the platform encoding?

Not being an expert in this area I cannot evaluate the affects of these shared code changes on other platforms, and so am reluctant to introduce any change that affects any non-Windows platforms. Also the JVM and JNI work with modified-UTF8 so I do not think we should diverge from that.
I would hate to see windows specific code introduced into the JDK or JVM's shared code for these APIs, but that may be the only choice to avoid potential disruption to other platforms. Though perhaps we could push the initial conversion down into the JVM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8,

AFAICT all platforms except Windows already use standard UTF-8 on that path (from Java_jdk_internal_loader_NativeLibraries_load() to JVM_LoadLibrary()) because the "platform" encoding for those happens to be "UTF-8". So at the current stage this patch actually maintains status quo for all platforms except Windows, the only platform where the bug exists.

But I am not against changing the encoding to modified UTF-8 and updating os::dll_load() for all platforms. Just wanted to have some consensus before proceeding with that change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholmes-ora Sorry, I don't think anything has changed as to the encoding as of JDK-8017274. For some reason, I had the impression that JVM_LoadLibrary() accepts UTF-8 (either modified or standard), but that was not correct. It is using the platform encoded string for the pathname.

@mkartashev As you mentioned in another comment, the only way to fix this issue is to pass UTF-8 down to JVM_LoadLibray, but I don't think it is feasible. One reason is the effort is too great, and the other is that all VM implementations would need to be modified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naotoj Then I guess this bug will have to wait until Windows evolves to the point when its platform encoding is UTF-8. In the mean time, I'm closing this PR.

Thank you all so much for your time!

public static void main(String args[]) throws Exception {
String nativePathSetting = "-Dtest.nativepath=" + getSystemProperty("test.nativepath");
ProcessBuilder pb = ProcessTools.createTestJvm(nativePathSetting, LoadLibraryUnicode.class.getName());
pb.environment().put("LC_ALL", "en_US.UTF-8");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some environments/user configs may not have UTF-8 codeset on the platform. May need to gracefully exit in such a case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added java.nio.charset.Charset.isSupported("UTF-8") check to the test. Hope that's enough for the environments without UTF-8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Charset.isSupported() returns whether Java encoder/decoder supports it or not, not the platform has the codeset. I think we can simply limit the test platform only to Windows in @requires tag in the test. Also, I would see the test case using some supplementary characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply limit the test platform only to Windows in @requires tag in the test. Also, I would see the test case using some supplementary characters.

Done.

…helper function

to get the library name encoded as true UTF-8 (not in "modified UTF-8" form).
Also updated the test to automatically pass if "UTF-8" is not supported by NIO
on the platform.
Comment on lines +1489 to +1491
static errno_t convert_UTF8_to_UTF16(char const* utf8_str, LPWSTR* utf16_str) {
return convert_to_UTF16(utf8_str, CP_UTF8, utf16_str);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, as it would be the only location in JNI/JVM interface that uses the standard UTF-8. Instead, I would implement convert_UTF8_to_UTF16 or rather convert_mUTF8_to_UTF16 with a fairly simple arithmetic logic.

Comment on lines +659 to +680
static jboolean utf8EncodingSupported(JNIEnv *env) {
if (utf8Encoding == NULL) { // Cache jstring "UTF-8"
jstring utf8Str = (*env)->NewStringUTF(env, "UTF-8");
if (utf8Str == NULL) {
return JNI_FALSE;
}
utf8Encoding = (jstring)(*env)->NewGlobalRef(env, utf8Str);
(*env)->DeleteLocalRef(env, utf8Str);
}
static jboolean isUTF8EncodingSupported = JNI_FALSE;
if (isUTF8EncodingSupported == JNI_TRUE) {
return JNI_TRUE;
}
jboolean sawException;
isUTF8EncodingSupported = (jboolean) JNU_CallStaticMethodByName (
env, &sawException,
"java/nio/charset/Charset",
"isSupported",
"(Ljava/lang/String;)Z",
utf8Encoding).z;
return isUTF8EncodingSupported;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be moot if we choose not to modify the JVM_ interface to standard UTF-8, but this function is not necessary. UTF-8 encoding is guaranteed in every Java implementation.

@mkartashev
Copy link
Member Author

I came to realize that changing os::dll_load() to accept UTF-8 (standard or modified) will break all the users of that function except JVM_LoadLibrary(). Consider os::native_java_library() that still operates with the platform encoding on Windows and works correctly if CWD contains Latin-1 characters (assuming 1252 code page). With this change, java will fail to start if its path name contains, say, Æ because os::dll_load() will expect it to be encoded as c3 86 (UTF-8), but will get c6 (Latin-1) instead.

One possible solution is to update all the call sites of os::dll_load() (quite laborous), another is to introduce os::dll_load_utf8() and change only JVM_LoadLibrary() at this point in time.

Advice is welcome.

@mkartashev mkartashev closed this Jun 8, 2021
@mkartashev mkartashev deleted the JDK-8195129 branch February 10, 2022 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

5 participants