-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent communication request provokes segfault in Java bindings. #369
Comments
@osvegis I'm unable to get this test to fail for me. Should I be seeing the memory usage of the process increase as the process runs? I increased SIZE to 1048576, and after running 2 procs of this on a single 128GB server, the memory usage (as reported by "top") still only shows 5.1% memory usage for both java processes (it's been 5.1% the whole time):
|
Forgot to mention that I tried both master and the v1.8 branch -- so I'm guessing I'm not triggering the GC, and therefore not triggering the problem. |
I also tried on master and the test crashes. This problem may exist from the very beginning of java bindings integration: ompi-1.7.5 also fails. |
Hmm. Ok. How do I get this test to reproduce, then? Is there a way that I can know for sure that the GC has fired? |
@jsquyres You could insert an explicit call to System.gc() into the for loop (maybe with if( i %1024 == 0) so that it is not called on each iteration). Alternatively, you could use -Xmx java parameter to set maximum heap size (see http://docs.oracle.com/cd/E19900-01/819-4742/abeik/index.html). |
Don't worry about if GC is fired. GC is fired when it is necessary, because on each iteration we allocate memory, so if GC is not fired we get an out of memory error. |
:-( I'm still totally unable to reproduce this bug -- even on my OS X laptop (with only 16GB RAM). I don't doubt that there is a real issue here, but I'm somewhat stymied until I can reproduce it reliably... |
My machine is 3570T, with only 8GB RAM. |
My laptop is the only machine I have access too with only 16GB -- I don't have access to anything with less memory than that. :-( FWIW, I even put in the call to Any suggestions? |
What do you think about of a virtual machine? |
Interesting idea. Let me see what I can fire up around here...
Jeff Squyres |
@osvegis is there any additional debugging you could do on your end? Could you send along the console output and backtrace(s) on the SEGV? I think with Hotspot there is a way to get a more verbose error log: http://www.oracle.com/technetwork/java/javase/felog-138657.html You might also try running your test program with There are some other general troubleshooting suggestions here: http://www.oracle.com/technetwork/java/javase/crashes-137240.html Jeff, you might also have trouble reproducing if you are running a different version of the JVM or are running on a different platform. It might be very difficult for you to directly reproduce this problem yourself. |
El 09/02/15 a las 23:34, Dave Goodell escribió:
The following execution generated an error file: hs_err_pid####.log
Sometimes the crash is different. A SIGSEGV error appears, but the test continues!
|
You might be able to find the problem with a malloc debugger (like DUMA, a fork of Electric Fence: http://duma.sourceforge.net/). |
El 10/02/15 a las 00:15, Dave Goodell escribió:
$ export LD_PRELOAD=libduma.so.0.0.0 $ mpirun -np 2 duma java -cp ~/ompi-install/lib/mpi.jar:. CrashTest DUMA 2.5.15 (shared library, NO_LEAKDETECTION) DUMA 2.5.15 (shared library, NO_LEAKDETECTION) DUMA Aborting: malloc() is not bound to duma. DUMA 2.5.15 (shared library, NO_LEAKDETECTION) DUMA Aborting: malloc() is not bound to duma. |
I think OMPI's malloc hooks for memory registration are getting in the way. Try setting |
I get the same error. |
Hmm. How about trying the same thing with a pure Java program (i.e., non MPI)? That would tell us if Java is interposing its own malloc. |
The same error.
|
Is there a way to tell if this same error happens with multiple different versions of the JVM? |
I have 1.7.0_75.
I remember that Alexander had 1.8.0_25.
|
On my OSX machine:
On my Linux machine:
|
Just curious: does the same problem happen if you only use the TCP BTL (i.e., not the shared memory BTL)?
|
I thought it worked ok, but after 4 attempts:
|
I'm kinda running out of ideas here -- how do we track this down? |
I'd still proceed by trying to get some memory debugger like DUMA, ElectricFence, NJAMD, etc. that will add red zones to all allocations and trap accesses with You could also try running Valgrind on the JVM, but it will probably be tricky unless you have a bit of experience with Valgrind. There's a SO post with some suggestions here: http://stackoverflow.com/questions/9216815/valgrind-and-java |
@hppritcha i think you understood correctly what i meant per the java doc
so unless i am misreading English (which happens quite often ...) this is equivalent to
i made PR #815 and it solves the issue for me, as long as i run with generally speaking, i am wondering whether this fix is enough. bottom line, i think we might have to keep a pointer to the buffer in the Request class instead of the Prequest class. any idea ? in the mean time, can you confirm the test runs just fine on hopper with this PR ? |
Thanks Gilles. |
@ggouaillardet your explanation makes sense to me, but as @osvegis can attest, I am far from a Java expert. 😄 |
I think what we will probably do is rewrite the direct buffer allocator methods in the MPI class to actually use a native method to create the buffer object. That plus using the NewGlobalRef method should prevent the gc from moving/deleting any of the buffers allocated for MPI calls. That hopefully will also solve some of the problems we are seeing trying to reproduce @osvegis results in the ompi java paper. |
Direct buffers never are moved. They reside outside the Java heap, but when there are not any reference to them, then they are destroyed by GC. If we put a reference in Request class, the buffer won't be destroyed. |
I am going to work on a solution that solves the problem by allocation the buffers on the C side as @hppritcha suggested to see if this also fixes the poor performance we have been seeing. Ill create a PR with the changes when I complete them. |
@hppritcha and @nrgraham23, |
Okay, I don't want to use the patch from Gilles as is. It is only a band aid for one test. There are many more places, basically all the non-blocking pt2pt and collectives, where the same problem effectively exists. Also, for some of the functions, like iGatherv, there are two buffers associated with the request. I think we've just been getting lucky with the way java bindings have been used that we've not seen this problem elsewhere. |
Howard and I discussed a more robust solution that uses the idea Gilles suggested. We plan to use an array list to store the buffers so we can store a variable number of buffers, and we will move it to the Request class so the non-blocking pt2pt and collective operations can store the buffers as well. Additionally, instead of modifying or making new constructors, we will add a single method that will need to be called to add the buffers to the array list. Ill try to get this PR up tonight so it can be discussed. |
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
How does this look? |
@nrgraham23 i did not test this yet, but at first glance, that looks good to me. |
At the moment there are not (as far as I am aware). We decided to do an array list in case there comes a time when there are more than two buffers. I think I agree with you on the suggested change, but I will wait for others to weigh in as well. Ill also see if I can find some more information on the performance hit associated with each option. |
@nrgraham23 in order to make the garbage collection more efficient, should the references to the buffers be removed when a non persistent request completes ? |
@ggouaillardet We could do something like that, however there would be performance costs associated with it. We could potentially set them to null in the waitFor method, however we would have to verify that it was not a persistent request. There could also be increased costs in other methods like the testStatus method which could be called multiple times before the request is actually completed. In cases like that we would have to both check that the return was not null and that it was not a persistent request. I do not think it would be worth while to remove the references. |
@nrgraham23 the costs of setting a pointer to null is very small comparing to allocating an object. |
@ggouaillardet Obviously setting a pointer to null costs fairly little, I was more referring to the additional checks we would also have to do. I think that since the Request object is likely to go out of scope shortly after the request completes anyways, it is not worth while to do the extra work. Your suggestion of how to remove the buffers would cause a useless method call in the case of Prequests. It is not a bad idea, but again, since the Request objects will most likely go out of scope shortly after the request completes, I do not think it is worth while. |
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
…out_of_credit_fix btl/openib: queue pending fragments once only when running out of credit
This pull request adds an arraylist of type Buffer to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to this array list so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet Fixes open-mpi#369 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
This pull request adds two Buffer references to the Request class. Whenever a request object is created that has associated buffers, the buffers should be added to these references so the java garbage collector does not dispose of the buffers prematurely. This is a more robust expansion on the idea first proposed by @ggouaillardet, and further discussed in open-mpi#369. Fixes open-mpi#369 Manual cherry-pick from d363b5d Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
MPI corrupts the memory space of Java.
The following example provokes segfault in Java bindings.
Please, see the comments in the example.
The text was updated successfully, but these errors were encountered: