New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native-image] Getting bad address when writing to socket (PosixJavaNIOSubstitutions) #400
Comments
Hi @sundbp, thank you for the report. This seems like an issue in our Posix NIO implementation. We'll take a look at it. Can you provide the app that you are trying to run so we can reproduce? |
I'll see if I can cut it down to a small repro case with instructions.
…On Thu, 3 May 2018, 23:05 Codrut Stancu, ***@***.***> wrote:
Hi @sundbp <https://github.com/sundbp>, thank you for the report. This
seems like an issue in our Posix NIO implementation. We'll take a look at
it. Can you provide the app that you are trying to run so we can reproduce?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJ3BSo9U-QhAdt4ca90RhLg5MU4wYks5tu38QgaJpZM4Tx1k3>
.
|
A cut down version that replicates it can be found here: https://gitlab.com/sundbp/native-image-grpc-test The README has instructions for how to reproduce. Please let me know if you have issues reproducing it! |
@sundbp thank you for providing a cut down version and for the detailed steps to reproduce the issue. We'll run it and let you know what we find. |
Did you manage to reproduce from that example repo? Thanks |
I was able to reproduce, thank you! Looking at the log output of
These are generated by SubstrateVM and are quite serious. (As a side node: we should increase the visibility of this kind of messages especially when they are intermingled with log messages from the static initializers of the library that we are trying to build, like it happens with the Netty DEBUG log messages). The reason for these messages is that |
Great - thanks for progress report.
Question: Could I have done anything myself to fix the UnsafeUtil
errors/warnings? Short of understanding all of SVM :) I'd love to learn
enough to sort these kinds of things for myself going fwd.
…On Sat, May 12, 2018 at 3:02 AM, Codrut Stancu ***@***.***> wrote:
I was able to reproduce, thank you! Looking at the log output of
./create-exe.sh I see a lot of warnings of the form:
RecomputeFieldValue.FieldOffset automatic substitution failed. The automatic substitution registration was attempted because a call to sun.misc.Unsafe.objectFieldOffset(Field) was detected in the static initializer of com.google.protobuf.UnsafeUtil. Add a RecomputeFieldValue.FieldOffset manual substitution for com.google.protobuf.UnsafeUtil.
These are generated by SubstrateVM and are quite serious. (As a side node:
we should increase the visibility of this kind of messages especially when
they are intermingled with log messages from the static initializers of the
library that we are trying to build, like it happens with the Netty DEBUG
log messages). The reason for these messages is that
com.google.protobuf.UnsafeUtil uses the Java unsafe API and this needs a
special configuration to work on SubstrateVM. Initially I thought that this
might be the reason for the failure: writing at a wrong offset results in
memory corruption. (By the way, we will publish an article next week that
includes details about the use of unsafe on SubstrateVM.) However, after
fixing these I still get the java.io.IOException: Bad address, so I guess
the problem is really in our PosixJavaNIOSubstitutions implementation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJ1zcP7SSR3JoXzg5QJMH40euNQOzks5txkLLgaJpZM4Tx1k3>
.
|
Definitely, we have a mechanism that allows you to do that, although the end goal is to automate the process of detecting and patching unsafe operations. We have a blog article scheduled for next week that goes into details. |
Excellent - Looking forward to reading the blog article! (and follow up on what goes wrong with PosixJavaNio!) |
I've now read the blog post - great! (referring to this post: https://medium.com/graalvm/instant-netty-startup-using-graalvm-native-image-generation-ed6f14ff7692) @cstancu would you mind making a PR with the unsafe substitutions you made for my example repo? I'd love to see if it matches up with how I'd do it after reading the blog post. |
Actually the substitutions that I used were quite simple. For protobuf I cheated and essentially disabled unsafe memory access by just returning
|
Ah! Nifty to disable it all! thanks |
@cstancu I'm struggling to get a build going with those substitutions included. If I use graal as distributed in the 1.0.0-rc1 I don't have the
If I try to use native-image from git (after installing the svm.jar from that build), and using the
Is that somehow related to mixing compiling using graal-sdk.jar from one version (1.0.0-rc1) but native image from another? I haven't worked out how to use the git built graal for both building my jar and doing native image I'm afraid :( I've pushed an update to the repo including these new changes (using 1.0.0-rc1): https://gitlab.com/sundbp/native-image-grpc-test How did you build it including the substitutions? |
Realized that error was because of a recent commit to svm a few days ago. Rolled back graal git to commit I've pushed updates to the reproduction repo at https://gitlab.com/sundbp/native-image-grpc-test Like you describe I still get the same error as before in |
Added a printout to the args passed to writev before it crashes: So that doesn't obviously look "bad". One thing that is a little suspicious to me is that fd always seems to be 27 (regardless if I open some other files, or some nc -lk etc on the same machine). I'm not sure if that is normal or not. Starting to get to the end of my ideas of things to look at. Happy to keep digging if anyone has any input on sensible next steps! |
Another experiment - check lsof of the process before I've created gRPC client, after I've created it but before I've sent any API request over it, and after it has crashed:
After creating client, but before API call:
After crash (before process exits):
Not making me that much wiser, but at least I see how 27 is the "next fd in line" (and obviously FDs are per process, not global as I temporarily was thinking). Given the EFAULT errno (indicated by "Bad address") it seems relatively conclusive to me that it's the pointers passed via the iovec to If I follow the path from netty to SocketChannelImpl to IOUtil to SocketDispatcher and finally to the substitutions - then it seems the most relevant bit is here: That's where ByteBuffers are translated to IOVecWrapper where the native memory lives. The SocketDispatcher after that is just passing it on to writev0: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/ch/SocketDispatcher.java/#51 So the final bit of code I can see is the WordFactory.pointer seems to use WordBoxFactory.box with the address that we passed in. This must be implemented in BoxFactoryImpl in It doesn't strike me that this should have any impact on pointer finally ending up in the So I'm at the end of my road here I think - input most welcome! @cstancu @Peter-B-Kessler |
A quick check that the iovec address doesn't get altered:
gives:
|
Where are your Is the Can you follow the arguments through From there we go directly to What is the returned result from Thanks for your diagnosis of this problem. |
Yes, more exactly line 1355 in I'm not quite sure how to investigate the I'm not a java developer, not using any IDE/debugger - happy for brief instruction of useful tooling you'd use here! The relevant code for setting up the IOVecWrapper must be line 110 to 148 here: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/ch/IOUtil.java/#110 I can't seem to find the source for SocketDispatcher anywhere on the graal side, so I'm not sure how I'd modify be able to observe the values passing through there.. Uio.writev definitely returns n < 0 as we hit the case where we call Sounds like my current tooling may need an upgrade to dig a level deeper.. |
Since you are comfortable with modifying the source and rebuilding, but not debugging, for "tooling" I would recommend adding tracing code. There is nothing for What do you get if you replace
(with the usual caveats about code written in a post, not in an IDE. :-) If Rummaging through the fields of |
@Peter-B-Kessler I ended up first sick and then busy - sorry about delay. I ran through that trace:
The call to As I expected it shows nothing I didn't expect from other traces etc I've done. When I say I'm not using a debugger that doesn't mean I wouldn't be happy to give it whirl. It's just not part of my standard day to day setup when I deal with clojure. If you have suggestions for what to use and rough guidelines for inspecting the iovec struct I'd be happy to give it a go. It does strike me as that's what we'll need to get to the next step of why we get bad address. To then be able to work out why/when/how it ends up "bad" (or if it starts "bad"). |
@Peter-B-Kessler happy to dig more - incl debuggers etc. If you have the time to briefly describe what you'd do next I'm happy to try to run with it. |
I seem to have run into the exact same issue with the latest master build of substratevm:
|
Does this failure happen on every call to Does it only happen when I can try to write some Does anyone who hits this run with |
In my case it's 100% reproductible: I start an http server, and every time I connect to it I get this error. I can try to make the test smaller, but otherwise I can also give you the jar if it helps? |
I haven't tried to exercise the function through other code paths than the
one through netty that's shown in my reproduction project mentioned
elsewhere in this issue. For that reproduction case it's 100% reproducible
as far as I can tell.
…On Fri, Aug 3, 2018 at 9:14 AM Stéphane Épardaud ***@***.***> wrote:
In my case it's 100% reproductible: I start an http server, and every time
I connect to it I get this error. I can try to make the test smaller, but
otherwise I can also give you the jar if it helps?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJ0YgGPfL7dcciTfnkpsGG9gU6rrsks5uNAZVgaJpZM4Tx1k3>
.
|
Awesome. Can you share it in a public git repository?
…On Fri, Aug 3, 2018 at 3:07 PM Stéphane Épardaud ***@***.***> wrote:
I've managed to reproduce this with a very small vertx use-case, where it
only happens if I use the Chunked HTTP encoding.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJ1djqj0s1ZWFSqiofooABe1_h320ks5uNFkQgaJpZM4Tx1k3>
.
|
Cool - and setting `setChunked(false)` instead makes it pass, right?
…On Fri, Aug 3, 2018 at 3:58 PM Stéphane Épardaud ***@***.***> wrote:
Sure: https://github.com/FroMage/native-vertx-chunked-fail
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJ6K4vlSrenPRwfsOn1Am06_58jOgks5uNGUQgaJpZM4Tx1k3>
.
|
When I build an image of
and, needless to say, that warning about missing an I am going to hand this over to our |
A nit on the otherwise exemplary README.txt file: The lines
made me (just a VM engineer) think that "Load http://localhost:9000/" was output from the application, so I stared at it not doing anything until I thought to take a browser and point it at that URL. But I can reliably reproduce the problem. I get (at least) two of them per refresh in the browser. |
I added some tracing to
That is, I have been handed an Looking at contents of the Just for fun (?) I used
Not much news there. Not only is I am hoping that the problem is in the |
I added some tracing to
The I still do not know whose fault that is. Should there be 5 |
I was hoping that #400 (comment) meant I could set
I am obviously out of my depth. |
Sorry about the lack of working version, I just pushed it: vertx.createHttpServer().requestHandler(req -> {
// this doesn't work
req.response().setChunked(true);
// this alternative works
// req.response().putHeader("Content-Length", "16");
req.response().write(Buffer.buffer("Hi from buffer 2"));
req.response().end();
}).listen(9000); Basically if you're not chunking, you have to set the http content length instead. |
I was able to get rid of the unsafe warning by following https://medium.com/graalvm/instant-netty-startup-using-graalvm-native-image-generation-ed6f14ff7692 and adding this substitution: @TargetClass(className = "io.netty.util.internal.shaded.org.jctools.util.UnsafeRefArrayAccess")
final class Target_io_netty_util_internal_shaded_org_jctools_util_UnsafeRefArrayAccess {
@Alias @RecomputeFieldValue(kind = Kind.ArrayIndexShift, declClass = Object[].class)
public static int REF_ELEMENT_SHIFT;
} However, that does not appear related to this particular bad address issue, because it does not fix it. |
The underlying problem is that the netty code constructs I will add a detector so the problem source code will be identified and we will not construct images with |
Thanks. Do you have any pointer to the netty code that does this? |
I only saw places where netty would do that only if |
Also, I assume you guys must have a pretty damn good reason to run static init code during compilation and not run time, right? Why do you do that? Because it sounds very dangerous to me since those static init blocks are expected to be called at run-time in the Java semantics. There must be a ton of code that does IO in static blocks and they must surely break if that is not done at run-time. I'm sure you must get that question a lot, sorry if it's documented somewhere and I didn't see it, but perhaps you can point me to it? |
That we run static initializers during image building is documented as a limitation. There are lots of reasons to want to run static initializers during image building (initialization order, faster startup of the generated image, etc.). We are working on a mechanism to allow static initializers to be delayed until runtime, but it is not ready quite yet. |
Thanks for the explanation. I haven't run any kind of study on how most static init is written in Maven Central, but 20 years of writing Java code tells me that I expect a significant percentage (>20%) of init code that can only work at runtime, so I strongly suspect that the new mechanism to run static init at run-time will be extremely useful in getting most apps running on graal. |
OK, I found it: it was in io.netty.handler.codec.http.HttpObjectEncoder that we had static direct buffers for the last chunks. I was able to fix them with this subst: import static io.netty.buffer.Unpooled.buffer;
import static io.netty.buffer.Unpooled.directBuffer;
import static io.netty.buffer.Unpooled.unreleasableBuffer;
import static io.netty.handler.codec.http.HttpConstants.CR;
import static io.netty.handler.codec.http.HttpConstants.LF;
import com.oracle.svm.core.annotate.Alias;
import com.oracle.svm.core.annotate.RecomputeFieldValue;
import com.oracle.svm.core.annotate.RecomputeFieldValue.Kind;
import io.netty.buffer.Unpooled;
import io.netty.buffer.ByteBuf;
import jdk.vm.ci.meta.MetaAccessProvider;
import jdk.vm.ci.meta.ResolvedJavaField;
@TargetClass(className = "io.netty.handler.codec.http.HttpObjectEncoder")
final class Target_io_netty_handler_codec_http_HttpObjectEncoder {
@Alias @RecomputeFieldValue(kind = Kind.Custom, declClass = Recomputer1.class)
private static ByteBuf CRLF_BUF;
@Alias @RecomputeFieldValue(kind = Kind.Custom, declClass = Recomputer2.class)
private static ByteBuf ZERO_CRLF_CRLF_BUF;
}
final class Recomputer1 implements RecomputeFieldValue.CustomFieldValueComputer {
@Override
public Object compute(MetaAccessProvider metaAccess, ResolvedJavaField original, ResolvedJavaField annotated,
Object receiver) {
return Unpooled.unreleasableBuffer(buffer(2).writeByte(CR).writeByte(LF));
}
}
final class Recomputer2 implements RecomputeFieldValue.CustomFieldValueComputer {
private static final byte[] ZERO_CRLF_CRLF = { '0', CR, LF, CR, LF };
@Override
public Object compute(MetaAccessProvider metaAccess, ResolvedJavaField original, ResolvedJavaField annotated,
Object receiver) {
return Unpooled.unreleasableBuffer(buffer(ZERO_CRLF_CRLF.length)
.writeBytes(ZERO_CRLF_CRLF));
}
} |
I think I see
Starting from the static initialization during image building of (A) Adding a little more tracing to
I think that I do not think we can allow |
Adding some experimental tracing to the image builder to find
The "object" trace is the class hierarchy. I don't know if that is useful. The "parsing" trace shows the method whose bytecodes we were examining when the |
I think it would help a lot more if you could instrument the constructor of |
@FroMage Your workaround that imports from I seem to get errors like this whatever I do using java9 or java10:
I've tried added JVM flag |
I think you have to run a JVM that is the jdk8 from Graal (with compiler extensions) or jdk11. |
If I use the jdk8 from graal (rc5) then I see just:
```
Compiling 7 Java source files...
ERROR:
/Users/sundbp/dev/dotfiles/boot/cache/tmp/Users/sundbp/dev/native-image-grpc-test/jq5/-ru88jt/SVMSubstitutions.java,
line 22: package jdk.vm.ci.meta does not exist
```
Not sure if that package was around at all in jdk8?
Are you using jdk11?
…On Thu, Aug 9, 2018 at 1:09 PM Stéphane Épardaud ***@***.***> wrote:
I think you have to run a JVM that is the jdk8 from Graal (with compiler
extensions) or jdk11.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJxn4rhs17YIZCHCSDtjb23Tyi7MEks5uPCaTgaJpZM4Tx1k3>
.
|
I am indeed using jdk11, but I got that code from the graal API that runs on jdk8. Probably you need a system jar added to your build in java8? |
@FroMage Re #400 (comment) (Re)Initialization at runtime seems like the better answer. That will require some work on the part of the application programmer. Either, as you have found with |
@sundbp Re #400 (comment) |
I worked it out - I think. Before I created an uberjar that included my
substitution classes, and hence to create that I needed these jars on
classpath etc. Now I created the uberjar EXCLUDING the substitution
classes, but added them to classpath only when I run native-image instead -
and they then the svm.jar, jvmci-api.jar etc get added automatically and
all is well.
FWIW - for my gRPC example the workaround didn't do the trick. Not
surprising as gRPC uses http2 codec so patching the http codec doesn't seem
to be the whole story. But I can be inspired by that workaround and grep
around for similar use of DirectByteBuffer and static initializers relating
to http2 and see if I can work out a similar patch for my case.
In terms of the static initializers issue I agree that in an ideal world
you'd avoid them, switch to doing at runtime etc. But in the world we're in
it seems to me that a sufficiently large portion of dependencies you find
in many many many projects have such usage, and hence the limitation means
a lot of projects have little hope of using native-image without extensive
detective work over many dependencies to add a lot of RecomputeFieldValue
(or similar). Compared to e.g. working around use of reflection the work
involved here is quite major (it seems to me at least). Working on ideas to
support this use case seems very worthwhile to me in terms of increasing
the reach of native-image.
…On Thu, Aug 9, 2018 at 6:56 PM Codrut Stancu ***@***.***> wrote:
@sundbp <https://github.com/sundbp> Re #400 (comment)
<#400 (comment)>
jdk.vm.ci.meta.MetaAccessProvider and jdk.vm.ci.meta.ResolvedJavaField
are imported from jvmci-api.jar which, in GraalVM rc5, is under
$GRAALVM_HOME/jre/lib/jvmci/jvmci-api.jar.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#400 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADxJ9JSblrweup2TXpellnRDeRH0uj7ks5uPHe0gaJpZM4Tx1k3>
.
|
After RC6 was released with delayed static initialization I gave my repo another go. When doing so I realized I had forgotten about needing to generate the .class files for my substitutions on my last try (too used to clojure compile on demand). I fixed up my build, as well as moved to RC6 - and I got a working test case! The updated repo is at https://gitlab.com/sundbp/native-image-grpc-test |
I'm trying to make a CLI tool for accessing a grpc API run using native-image. It's using grpc java libs, and hence grpc-netty. I've managed to make an executable, and via a reflection config file I dealt with netty's ReflectiveChannelFactory. I can see that I can actually create the grpc client and see a few TCP packets between my client and server. So far so good.
When I actually try to make an API call I get an error relating to a failed write to the socket. I tried both on OSX and Linux, same exact outcome.
The stacktrace is:
I'm feeling close to getting it all to work, but stumped at this point now. I'm hoping that more experienced eyes may make something out of why the writev() fails here?
As I said, on connecting the grpc client to the server I can see a few packets of negotiations being sent so it seems to have the "ability to write to socket" working somewhere inside the image..
Very encouraged by the startup time after using native-image! Makes clojure cli tools nice to use.
Thanks!
The text was updated successfully, but these errors were encountered: