New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGBUS with ~StubRoutines::jlong_disjoint_arraycopy #442

Closed
freakolowsky opened this Issue Feb 2, 2015 · 30 comments

Comments

Projects
None yet
5 participants
@freakolowsky

freakolowsky commented Feb 2, 2015

Using 1.0.6.

After my fileDB passes 6GB i sometimes get the following JRE crash:

A fatal error has been detected by the Java Runtime Environment:

SIGBUS (0x7) at pc=0x00007fb3d904217a, pid=2935, tid=140392178644736

JRE version: Java(TM) SE Runtime Environment (7.0_60-b19) (build 1.7.0_60-b19)
Java VM: Java HotSpot(TM) 64-Bit Server VM (24.60-b09 mixed mode linux-amd64 compressed oops)
Problematic frame:
v ~StubRoutines::jlong_disjoint_arraycopy

This happens when calling commit or compact. After restart the store is normally recovered (all i have to do is remove failed compact files).

DB make code:

        DB dbState = DBMaker
                .newFileDB(new File(location, "state"))
                .asyncWriteEnable()
                .cacheLRUEnable()
                .cacheSize(100)
                .mmapFileEnableIfSupported()
                .closeOnJvmShutdown()
                .make();

I'm trying it now with mmap disabled, but compacting in this state is painfully slow. Will try with mmap partial (after the current compact run finishes :D )

@freakolowsky

This comment has been minimized.

freakolowsky commented Feb 2, 2015

Tried with partial mmap ... too slow to use in my case, so i'm stuck with this bug.

@freakolowsky

This comment has been minimized.

freakolowsky commented Feb 2, 2015

Temporarily I changed main collection on this DB, removed valuesOutsideNodesEnable() and added a more aggressive maintenance, so DB never goes too far past 1G marker... hopefully this will work around the issue.

I'll just have to wait for BTree compacting.

... but i'd still like to know WHY i was getting that error

@jankotek

This comment has been minimized.

Owner

jankotek commented Feb 2, 2015

Hi,

sometimes I am getting similar error when using Unsafe storage in
MapDB 2.
I would bet it is something related to memory mapped files.

This article says it could mean that write into memory mapped file
failed.
http://www.transylvania-jug.org/archives/5794[1]

I will start on 1.0.6, so will investigate it as well.


[1] http://www.transylvania-jug.org/archives/5794

@niels1voo

This comment has been minimized.

niels1voo commented Feb 6, 2015

I got this error when building master. VolumeTest triggered it.

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec - in org.mapdb.UtilsTest
Running org.mapdb.VolumeTest
Run volume tests. Free space: 612564393984
org.mapdb.Volume$ByteArrayVol@559d2ed6
org.mapdb.Volume$SingleByteArrayVol@7d0a2dbe
org.mapdb.Volume$MemoryVol@41da05f2,direct=true
org.mapdb.Volume$MemoryVol@2c5a40f4,direct=false
org.mapdb.Volume$UnsafeVolume@457bf123

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x0000000107fa5062, pid=73621, tid=4867

JRE version: Java(TM) SE Runtime Environment (7.0_71-b14) (build 1.7.0_71-b14)

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.71-b01 mixed mode bsd-amd64 compressed oops)

Problematic frame:

v ~StubRoutines::jlong_disjoint_arraycopy

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/Users//repo/third/MapDB/hs_err_pid73621.log

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

/bin/sh: line 1: 73621 Abort trap: 6 /Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/jre/bin/java -jar /Users//repo/third/MapDB/target/surefire/surefirebooter4802539880473851706.jar /Users//repo/third/MapDB/target/surefire/surefire5886035989891522075tmp /Users//repo/third/MapDB/target/surefire/surefire_01025126588681928282tmp

@jankotek

This comment has been minimized.

Owner

jankotek commented Feb 6, 2015

@niels1voo That is other kind of bug coming from new Unsafe storage. I just fixed it.

@jankotek jankotek added 1.0 bug labels Feb 11, 2015

@jankotek

This comment has been minimized.

Owner

jankotek commented Feb 11, 2015

I started investigating this issue in 1.0.6.

Most likely its caused by Async Writer. There is some shifting in compaction and I think Async Writer somehow writes to file which has been closed. That could cause this error in 1.0.6.

jankotek added a commit that referenced this issue Feb 11, 2015

jankotek added a commit that referenced this issue Feb 11, 2015

jankotek added a commit that referenced this issue Feb 11, 2015

@jankotek

This comment has been minimized.

Owner

jankotek commented Feb 11, 2015

I think I understand whats going on now. MapDB forces DirectByteBuffer unmapping inside compaction, commit and when store is closed. Async Write in theory could try to write into closed file, that would cause JVM crash. I have not found any race conditions. But just in case I disabled DBB unmapping hack if Async Write is enabled. This fix will be in 1.0.7.

@freakolowsky I could not replicate issue. But I think current 1.0.7-SNAPSHOT might fix it. Would you please try to replicate this issue with latest version from snapshot repo? Add this to your POM:

<repositories>
    <repository>
        <id>sonatype-snapshots</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>org.mapdb</groupId>
        <artifactId>mapdb</artifactId>
        <version>1.0.7-SNAPSHOT</version>
    </dependency>
</dependencies>
@freakolowsky

This comment has been minimized.

freakolowsky commented Feb 12, 2015

How stable is this snapshot ... the only way to reproduce this bug in my case is running it on one of our production site and i'm currently still handling the lag from last months crash, so i can't afford to make too many mistakes at the moment.

Will try to reproduce it in test environment, but i'm not promising anything.

@jankotek

This comment has been minimized.

Owner

jankotek commented Feb 12, 2015

It will become 1.0.7 in a few days. Perhaps wait for stable release.

@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Apr 22, 2015

We have also just started experiencing this issue. We've only seen it once in the field, but are trying to reproduce locally on 1.0.6 so we can confirm 1.0.7 does indeed fix the issue.

@freakolowsky Were you able to verify 1.0.7 in your environment?

@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Apr 22, 2015

I have been able to artificially reproduce this (including version 1.0.7) with concurrent access by multiple JVMs. I do not think this is supported because of the lack of file locks, but it does fail in nearly the same way with the SIGBUS. Not sure if that helps or not.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0xa) at pc=0x000000010f0be1d6, pid=14270, tid=23299
#
# JRE version: Java(TM) SE Runtime Environment (8.0_40-b25) (build 1.8.0_40-b25)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.40-b25 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
@jankotek

This comment has been minimized.

Owner

jankotek commented Apr 23, 2015

I can reproduce this as well, when mapdb runs out of disk space. I think it is caused when mmap write fails for some reasons.

@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Apr 29, 2015

We have tested out 1.0.7 and don't believe it addresses the SIGBUS problem. We have ancedotal evidence that suggests this happened much more frequently after the 1.0.5 release though. This is starting to become pretty critical for our deployment of MapDB. What do you think the next steps are?

@jankotek

This comment has been minimized.

Owner

jankotek commented Apr 30, 2015

@mhuffman-r7 I dont have a good answer. I will write more tests and try to isolate this issue.

@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Apr 30, 2015

Thanks @jankotek. We've still been trying to reproduce this with a smaller case without success, so it certainly must be an issue with scale/size/environment that is hard to pinpoint. If we have a breakthrough we'll post it here.

We have considered disabling the memory mapped configuration, but are finding in our usage that degrades performance by about 30-40% overall, so its less than ideal.

@freakolowsky

This comment has been minimized.

freakolowsky commented Apr 30, 2015

@mhuffman-r7 i'm on 1.0.7, but still running with my workaround (agressive maintenence that keeps my DB small i.e. < 2Gb). Haven't seen SIGBUS since.

I also clone DB and overwrite actual with clone every day to deal with node fragmentation (@jankotek until this is handeled internally it would really be useful to have a method for that purpose in org.mapdb.DB).

I've never hit this bug on setups where DB sizes are rather small, so i do my best to keep them that way, as i process a lot of data i.e.: one of my a high load setups is where over 5*10^9 records with up to 20 dependant child records on average are being processed daily, but i retain very little of it. On that setup fragmented nodes can bloat the state file x5 or more. Example from this morning: state file before cloning 1,2G after 0,3G (and cloning just copies from one db to another).

@jankotek i could run my code without the workaround on dummy data in my testing environment if that would help you identify the issue. Just tell me what to look for.

@jankotek

This comment has been minimized.

Owner

jankotek commented Apr 30, 2015

I will prepare som test cases, perhaps if you would run them on your setup.

@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Jun 4, 2015

Has any progress been made on this research?

@jankotek

This comment has been minimized.

Owner

jankotek commented Jun 18, 2015

I started investigating this issue again. So far I can only reproduce it if mmap write fails, for example if disk free space is exhausted. Will report soon.

@jankotek

This comment has been minimized.

Owner

jankotek commented Jun 18, 2015

I got new theory why JVM crashes. MapDB uses sparse files, ByteBuffers are mapped beyond end of file. If data are written into mmaped ByteBuffer beyond EOF, file size expands. If file can not expand for some reason, that could cause JVM crash.

Here is patch, so far it solves JVM crash I had with free space. I will polish it bit and apply on 1.0.8-SNAPSHOT. Feel free to try it on your JVM.

Index: src/main/java/org/mapdb/Volume.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- src/main/java/org/mapdb/Volume.java (revision 9f2b180dff6f28a8fba0942337375c952070ca59)
+++ src/main/java/org/mapdb/Volume.java (revision )
@@ -558,6 +558,20 @@
             try {
                 assert((offset&chunkSizeModMask)==0);
                 assert(offset>=0);
+                //write to end of file, to make sure space is allocated, see #442
+                if(!readOnly) {
+                    long len = fileChannel.size();
+                    long pos = offset+chunkSize*2+1;
+                    pos-=pos%chunkSize;
+                    if(len<pos) {
+                        ByteBuffer b = ByteBuffer.allocate((int) (pos - len));
+                        while (b.remaining()>0) {
+                            len+=fileChannel.write(b, len);
+                        }
+                    }
+
+                }
+
                 ByteBuffer ret = fileChannel.map(mapMode,offset,chunkSize);
                 return ret;
             } catch (IOException e) {
@jankotek

This comment has been minimized.

Owner

jankotek commented Jun 19, 2015

So far I found those reasons to explain JVM crash:

  • mmap write beyond EOF, there is no disk space to expand, JVM crashes
  • write into mmap file after it was unmaped with Cleaner Hack
  • unmap buffer while its write cache still contains some data.

So there are going to be changes (both in 1.0.8 and 2.0 beta1)

  • Unmap Cleaner Hack will be disabled by default, one will have enable it with new settings (mmapFileUnmapHackEnable()). It is better to have File Handle Leak than JVM crash
  • On Windows there will be Warning logged, if mmap files are enabled without Unmap Cleaner Hack
  • Mmaped file will be synced prior closing. This will result in lower performance for Write Ahead Log
  • lot of stress testing
@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Jun 19, 2015

This sounds promising. In our environments we only see this with about 2-3 customers in production and haven't been able to reproduce locally. FWIW, we have only see this issue on Linux. Customers experiencing this sometimes also have file-too-large issues, and other edge-cases, so this seems to support your analysis. Thanks.

@jankotek

This comment has been minimized.

Owner

jankotek commented Jul 4, 2015

Unmap on close is now disabled by default in 2.0 branch. There is option to enable it: DBMaker.fileMmapCleanerHackEnable()

For 1.0 I dont want to disable it. Too big change in default behavior, so I will add new option to disable unmap hack. That should prevent JVM crash.

jankotek added a commit that referenced this issue Jul 5, 2015

Volume: preallocate mmap file buffers, write to ByteBuffer does not e…
…xpand file size. It could crash JVM. See #442

jankotek added a commit that referenced this issue Jul 5, 2015

Volume: preallocate mmap file buffers, write to ByteBuffer does not e…
…xpand file size. It could crash JVM. See #442

jankotek added a commit that referenced this issue Jul 5, 2015

@jankotek

This comment has been minimized.

Owner

jankotek commented Jul 5, 2015

MapDB 1.0 snapshot now has option DBMaker.mmapFileCleanerHackDisable() to disable cleaner hack. That should prevent JVM crashes.

MapDB 2.0 has now Cleaner Hack disabled by default. There is option DBMaker.fileMmapCleanerHackEnable() in 2.0 to enable Cleaner Hack.

I consider my work on this bug done, so current version of both repos will make it to final release.

jankotek added a commit that referenced this issue Jul 5, 2015

@jankotek

This comment has been minimized.

Owner

jankotek commented Jul 9, 2015

MapDB 2.0-beta1 and 1.0.8 with solutions were just released. In 1.0.8 you need to disable Cleaner Hack with DBMaker.mmapFileCleanerHackDisable().

Please let me know your result. Right now I do not know for sure if this issue is fixed. I will leave this open for couple of weeks, and eventually close it, if nobody can replicate any longer.

jankotek added a commit that referenced this issue Jul 14, 2015

@jankotek

This comment has been minimized.

Owner

jankotek commented Jul 14, 2015

Correction 2.0-beta2 does not have all fixes, and could still crash. I just added fix to master branch

jankotek added a commit that referenced this issue Jul 15, 2015

@jankotek

This comment has been minimized.

Owner

jankotek commented Jul 27, 2015

No complains, so closing this issue.

@jankotek jankotek closed this Jul 27, 2015

@mhuffman-r7

This comment has been minimized.

mhuffman-r7 commented Dec 9, 2015

We have yet to try out the new changes in production, and may wait a while to attempt to do so. I'll let you know if we see any issues with this attempted fix.

@paulhoule

This comment has been minimized.

paulhoule commented Jun 1, 2016

I ran into this error today and here are my observations.

I have a system that I run all the time on an m3.large instance because the SSD is not big enough to hold all the data generated by a batch job on an m3.medium. This batch job is also doing mmap I/O to another database, which is a Jena TDB database.

Anyhow, I changed my scripts that spin up the AWS instance and accidentally put in 'm3.medium' and found that in that condition I get this error consistently. If has worked with 'm3.large' and has worked on my Windows machine and never had this problem.

This makes me wonder if this message is a roundabout way of saying "disk full" or can be in some situations.

@jankotek

This comment has been minimized.

Owner

jankotek commented Jun 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment