-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8195129: System.load() fails to load from unicode paths #4169
Conversation
👋 Welcome back mkartashev! A progress list of the required criteria for merging this PR into |
@mkartashev The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test sources have Apache license, I thought we always used GPL for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure you are right. Corrected copyright headers in all the test files.
I tried verifying the test on macOS by running:
and I get:
I looked at the test and thought it was handling loading the native lib on its own, without the test needing to set nativepath explicitly. What am I doing wrong? |
Mailing list message from David Holmes on hotspot-dev: On 27/05/2021 6:59 am, Gerard Ziemski wrote:
You always have to set nativepath explicitly for jtreg. The test then String testNativePath = Cheers, |
1 similar comment
Mailing list message from David Holmes on hotspot-dev: On 27/05/2021 6:59 am, Gerard Ziemski wrote:
You always have to set nativepath explicitly for jtreg. The test then String testNativePath = Cheers, |
Hi Maxim, Please don't force-push changes once you have opened a PR and review has commenced as it messes up the history and comments. Just push each commit directly. The final integration will squash everything into one clean commit. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Maxim,
Overall this seems okay. I've focused mainly on the hotspot parts, including the test.
A few minor changes requested. I do have some concerns about the impact on startup though and the efficiency of the conversion routines.
Thanks,
David
const int flag_source_str_is_null_terminated = -1; | ||
const int flag_estimate_chars_count = 0; | ||
int utf16_chars_count_estimated = MultiByteToWideChar(source_encoding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your local naming style is somewhat excessive. You could just comment the values of the flags when you pass them eg:
MultiByteToWideChar(source_encoding,
MB_ERR_INVALID_CHARS,
source_str,
-1, //source is null-terminated
NULL, // no output buffer
0); // calculate required buffer size
Or you could just add a comment before the call:
// Perform a dummy conversion so that we can get the required size of the buffer to
// allocate. The source is null-terminated.
Trying to document parameter semantics by variable naming doesn't work in my opinion - at some point if you want to know you have to RTFM for the API.
And utf16_len is perfectly adequate for the returned size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Corrected.
log_info(os)("attempting shared library load of %s", name); | ||
void * os::dll_load(const char *utf8_name, char *ebuf, int ebuflen) { | ||
LPWSTR utf16_name = NULL; | ||
errno_t err = convert_UTF8_to_UTF16(utf8_name, &utf16_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any figures on the cost of this additional conversion in relation to startup time?
I'm already concerned to see that we have to perform each conversion twice via MultiByteToWideChar/WideCharToMultiByte, once to get the size and then to actually get the characters! This seems potentially very inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I measured time spend converting the library path name relative to the overall time of a (successful) os::dll_load()
call. It varies between a fraction of a percent to ~3% (see below for the actual data from a release
build). I'll defer to your expertise wrt to these numbers.
What can be done here (without changing os::dll_load() API) is to have a "fast path" for ascii-only path names, in which case no conversion is necessary, and a "slow path" for all the rest. This will complicate things a bit, but not by much, I believe. I am certainly willing to give that a try. Let me know what do you think.
./build/windows-x86_64-server-release/images/jdk/bin/java -jar ./build/windows-x86_64-server-fastdebug/images/jdk/demo/jfc/SwingSet2/SwingSet2.jar
0.050% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\jimage.dll
2.273% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\java.dll
0.177% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\net.dll
0.541% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\nio.dll
0.359% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\zip.dll
3.186% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\jimage.dll
0.075% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\awt.dll
0.232% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\freetype.dll
0.136% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\fontmanager.dll
0.170% for C:\cygwin64\home\maxim\work\OpenJDK\jdk\build\windows-x86_64-server-release\images\jdk\bin\javajpeg.dll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core-libs folks have the experience/expertise with these character encoding issues so I will defer to them. But someone should run this through our startup benchmarks to see if it makes a difference.
Thanks,
David
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core-libs folks have the experience/expertise with these character encoding issues so I will defer to them.
Naoto has agreed to look at this.
if (Platform.isLinux()) { | ||
osDependentLibraryFileName = "libLoadLibraryUnicode.so"; | ||
} else if (Platform.isOSX()) { | ||
osDependentLibraryFileName = "libLoadLibraryUnicode.dylib"; | ||
} else if (Platform.isWindows()) { | ||
osDependentLibraryFileName = "LoadLibraryUnicode.dll"; | ||
} else { | ||
throw new Error("Unsupported OS"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the test library function Platform.sharedLibraryExt()
to get the library file extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Rewrote this piece.
(Side note: since the libraries' prefix differs between platforms also, it'd be nice to have something like Platform.sharedLibraryName(name)
; this is the way all the code that uses Platform.sharedLibraryExt()
is structured anyway. But I guess it's best not to conflate things).
static errno_t convert_UTF8_to_UTF16(char const* utf8_str, LPWSTR* utf16_str) { | ||
return convert_to_UTF16(utf8_str, CP_UTF8, utf16_str); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, utf8_str
is the "modified" UTF-8 string in JNI. Using it as the standard UTF-8 (I believe Windows' encoding CP_UTF8
is the one) may end up in incorrect conversions in some corner cases, e.g., for supplementary characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; I changed the code in NativeLibraries.c to pass down true UTF-8 instead of "modified UTF-8". Please, take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we can pass non modified UTF-8
through JVM_LoadLibrary()
. Probably some VM folks can enlighten here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert by my understanding is that the VM only deals with modified UTF-8, as does JNI. So the incoming string should be modified-UTF8 IMO and then converted to UTF16.
That said, this is shared code being modified on the JDK side so you can't just change the type of string being passed in without updating all the implementations of os::dll_load to support that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to establish some common ground before proceeding further with this fix. It's a bit of a long read; please, bear with me.
The path name starts its life as a jstring
in Java_jdk_internal_loader_NativeLibraries_load()
, its encoding is irrelevant at this point.
Next, the name has to be passed down to JVM_LoadLibrary()
that takes char*
. So we need to convert form jstring
to char*
(point (a)). Following that, os::dll_load()
that actually performs loading in a platform-specific manner also receives char*
. All platform implementations of os::dll_load()
pass the path name down to their respective platform's APIs unmodified, but I think that's just incidental and here we have another possible point of conversion (point (b)). Other consumers of the path name are exception(c) and logging(d) messages; they also take char*
, but potentially of a different encoding.
Let me try to enumerate all conceivably valid conversions for JVM_LoadLibrary()
consumption (point (a)):
- jstring -> platform-specific encoding (status quo meaning possibly lossy encoding on Windows and UTF-8 elsewhere AFAICT),
- jstring -> modified UTF-8,
- jstring -> UTF-8.
This bug 8195129 occurs because conversion (1) may loose information on Windows if the platform encoding happens to be NOT UTF-8 (which it often - or even always - is). So that's a no-go and we are left with either (2) or (3).
On MacOS and Linux, "platform" encoding already is UTF-8 and since all the platform APIs happily consume UTF-8, no further conversion is necessary (neither for actual library loading, nor for log or exception messages; the latter have to convert to UTF-16, but do that under the hood).
On Windows, we require at least these variants of the path name:
- UTF16 for library loading (Unicode Windows API),
- "platform" encoding for logging (yes, loosing information here, but that's tolerable),
- "platform" (lossy) or UTF8 (lossless) encoding for exception messages (prefer lossless).
This is what's behind my choice of UTF-8 for the path name encoding as it gets passed down to JVM_LoadLibrary()
. We can go with modified UTF-8, of course, in which case all platforms - not just Windows - will have to do the conversion on their own, loosing the benefit of the knowledge about the original string encoding (the String.coder field of jstring).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, as it would be the only location in JNI/JVM interface that uses the standard UTF-8. Instead, I would implement convert_UTF8_to_UTF16
or rather convert_mUTF8_to_UTF16
with a fairly simple arithmetic logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkartashev thank you for the detailed explanation.
It is not clear to me that the JDK's conformance to being a Unicode application has significantly changed since the evaluation of JDK-8017274 - @naotoj can you comment on that and related discussion from the CCC for JDK-4958170 ? In particular I'm not sure that using the platform encoding is wrong, nor how we can have a path that cannot be represented by the platform encoding?
Not being an expert in this area I cannot evaluate the affects of these shared code changes on other platforms, and so am reluctant to introduce any change that affects any non-Windows platforms. Also the JVM and JNI work with modified-UTF8 so I do not think we should diverge from that.
I would hate to see windows specific code introduced into the JDK or JVM's shared code for these APIs, but that may be the only choice to avoid potential disruption to other platforms. Though perhaps we could push the initial conversion down into the JVM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8,
AFAICT all platforms except Windows already use standard UTF-8 on that path (from Java_jdk_internal_loader_NativeLibraries_load()
to JVM_LoadLibrary()
) because the "platform" encoding for those happens to be "UTF-8". So at the current stage this patch actually maintains status quo for all platforms except Windows, the only platform where the bug exists.
But I am not against changing the encoding to modified UTF-8 and updating os::dll_load() for all platforms. Just wanted to have some consensus before proceeding with that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dholmes-ora Sorry, I don't think anything has changed as to the encoding as of JDK-8017274. For some reason, I had the impression that JVM_LoadLibrary() accepts UTF-8 (either modified or standard), but that was not correct. It is using the platform encoded string for the pathname.
@mkartashev As you mentioned in another comment, the only way to fix this issue is to pass UTF-8 down to JVM_LoadLibray, but I don't think it is feasible. One reason is the effort is too great, and the other is that all VM implementations would need to be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naotoj Then I guess this bug will have to wait until Windows evolves to the point when its platform encoding is UTF-8. In the mean time, I'm closing this PR.
Thank you all so much for your time!
public static void main(String args[]) throws Exception { | ||
String nativePathSetting = "-Dtest.nativepath=" + getSystemProperty("test.nativepath"); | ||
ProcessBuilder pb = ProcessTools.createTestJvm(nativePathSetting, LoadLibraryUnicode.class.getName()); | ||
pb.environment().put("LC_ALL", "en_US.UTF-8"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some environments/user configs may not have UTF-8
codeset on the platform. May need to gracefully exit in such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added java.nio.charset.Charset.isSupported("UTF-8")
check to the test. Hope that's enough for the environments without UTF-8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Charset.isSupported()
returns whether Java encoder/decoder supports it or not, not the platform has the codeset. I think we can simply limit the test platform only to Windows in @requires
tag in the test. Also, I would see the test case using some supplementary characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simply limit the test platform only to Windows in @requires tag in the test. Also, I would see the test case using some supplementary characters.
Done.
…helper function to get the library name encoded as true UTF-8 (not in "modified UTF-8" form). Also updated the test to automatically pass if "UTF-8" is not supported by NIO on the platform.
supplementary plane in the path name.
static errno_t convert_UTF8_to_UTF16(char const* utf8_str, LPWSTR* utf16_str) { | ||
return convert_to_UTF16(utf8_str, CP_UTF8, utf16_str); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am hesitant to change the JVM interface from modified UTF-8 to standard UTF-8, as it would be the only location in JNI/JVM interface that uses the standard UTF-8. Instead, I would implement convert_UTF8_to_UTF16
or rather convert_mUTF8_to_UTF16
with a fairly simple arithmetic logic.
static jboolean utf8EncodingSupported(JNIEnv *env) { | ||
if (utf8Encoding == NULL) { // Cache jstring "UTF-8" | ||
jstring utf8Str = (*env)->NewStringUTF(env, "UTF-8"); | ||
if (utf8Str == NULL) { | ||
return JNI_FALSE; | ||
} | ||
utf8Encoding = (jstring)(*env)->NewGlobalRef(env, utf8Str); | ||
(*env)->DeleteLocalRef(env, utf8Str); | ||
} | ||
static jboolean isUTF8EncodingSupported = JNI_FALSE; | ||
if (isUTF8EncodingSupported == JNI_TRUE) { | ||
return JNI_TRUE; | ||
} | ||
jboolean sawException; | ||
isUTF8EncodingSupported = (jboolean) JNU_CallStaticMethodByName ( | ||
env, &sawException, | ||
"java/nio/charset/Charset", | ||
"isSupported", | ||
"(Ljava/lang/String;)Z", | ||
utf8Encoding).z; | ||
return isUTF8EncodingSupported; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be moot if we choose not to modify the JVM_ interface to standard UTF-8, but this function is not necessary. UTF-8 encoding is guaranteed in every Java implementation.
I came to realize that changing One possible solution is to update all the call sites of Advice is welcome. |
Character strings within JVM are produced and consumed in several formats. Strings come from/to Java in the UTF8 format and POSIX APIs (like fprintf() or dlopen()) consume strings also in UTF8. On Windows, however, the situation is far less simple: some new(er) APIs expect UTF16 (wide-character strings), some older APIs can only work with strings in a "platform" format, where not all UTF8 characters can be represented; which ones can depends on the current "code page".
This commit switches the Windows version of native library loading code to using the new UTF16 API
LoadLibraryW()
and attempts to streamline the use of various string formats in the surrounding code.Namely, exception messages are made to consume strings explicitly in the UTF8 format, while logging functions (that end up using legacy Windows API) are made to consume "platform" strings in most cases. One exception is
JVM_LoadLibrary()
logging where the UTF8 name of the library is logged, which can, of course, be fixed, but was considered not worth the additional code (NB: this isn't a new bug).The test runs in a separate JVM in order to make NIO happy about non-ASCII characters in the file name; tests are executed with LC_ALL=C and that doesn't let NIO work with non-ASCII file names even on Linux or MacOS.
Tested by running
test/hotspot/jtreg:tier1
on Linux andjtreg:test/hotspot/jtreg/runtime
on Windows 10. The new test (jtreg:test/hotspot/jtreg/runtime/jni/loadLibraryUnicode
) was explicitly ran on those platforms as well.Results from Linux:
Results from Windows 10:
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4169/head:pull/4169
$ git checkout pull/4169
Update a local copy of the PR:
$ git checkout pull/4169
$ git pull https://git.openjdk.java.net/jdk pull/4169/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 4169
View PR using the GUI difftool:
$ git pr show -t 4169
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4169.diff