Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in thread 'OrientDB WAL Flush Task' #9270

Closed
quanghaisoft opened this issue May 28, 2020 · 42 comments
Closed

Exception in thread 'OrientDB WAL Flush Task' #9270

quanghaisoft opened this issue May 28, 2020 · 42 comments
Assignees
Labels
Milestone

Comments

@quanghaisoft
Copy link

OrientDB Version: 3.0.31

Java Version: java version "1.8.0_211"

OS: ubuntu

Expected behavior

Upsert success, not raise any error

Actual behavior

Exception in thread 'OrientDB WAL Flush Task'
java.nio.BufferOverflowException
	at java.nio.Buffer.nextPutIndex(Buffer.java:527)
	at java.nio.DirectByteBuffer.putInt(DirectByteBuffer.java:706)
	at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog$RecordsWriter.run(OCASDiskWriteAheadLog.java:1920)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Can not update/insert data any more

Steps to reproduce

My OrientDB system has over 100 classes with over 50GB data, setup in cluster mode with 6 node, on 6 different IP.
When I upsert big data into 1 master node orientdb in distributed ,
Error appear in logs

Exception in thread 'OrientDB WAL Flush Task'
java.nio.BufferOverflowException
	at java.nio.Buffer.nextPutIndex(Buffer.java:527)
	at java.nio.DirectByteBuffer.putInt(DirectByteBuffer.java:706)
	at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog$RecordsWriter.run(OCASDiskWriteAheadLog.java:1920)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@tglman tglman added the bug label Jun 1, 2020
@tglman tglman added this to the 3.0.x milestone Jun 1, 2020
@laa
Copy link
Member

laa commented Jun 3, 2020

Hi, @quanghaisoft what is the WAL page size in your environment? You can find size of the WAL page in the log. Do you set any non-default settings? Thank you for the bug report.

@quanghaisoft
Copy link
Author

thanks you for your answer , this is logs, recent day, this error not appear
config in server.sh
-Dstorage.wal.maxSegmentSize=128 -Dstorage.wal.maxSize=16384

2020-06-03 17:01:01:421 INFO Allocation of 972800 pages. [OEngineLocalPaginated]
2020-06-03 17:01:03:978 INFO WAL maximum segment size is set to 6,144 MB [OrientDBDistributed]
2020-06-03 17:01:03:978 INFO Databases directory: /data/orientdb/databases [OServer]
2020-06-03 17:01:04:039 INFO Direct IO for WAL located in /data/orientdb/databases/OSystem is allowed with block size 4096 bytes. [OCASDiskWriteAheadLog]
2020-06-03 17:01:04:040 INFO Page size for WAL located in /data/orientdb/databases/OSystem is set to 4096 bytes. [OCASDiskWriteAheadLog]
2020-06-03 17:01:04:104 WARNI Storage 'OSystem' was not closed properly. Will try to recover from write ahead log [OLocalPaginatedStorage]

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 4, 2020

We have written a lot of data simultaneously

@laa
Copy link
Member

laa commented Jun 4, 2020

Hi @quanghaisoft . I am not sure that I understood you. Do you mean that now you can not reproduce this error on your tests?

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 6, 2020

This error does not show again. We are running in production, so I cannot test it. If it appears I will open this issue

@quanghaisoft
Copy link
Author

this bug in my production so we copy data and rebuild cluster immediately.if it appears again,we will send it to you

@laa
Copy link
Member

laa commented Jun 9, 2020

I see thank you, I will try to reproduce with settings which you sent to me.

@laa laa removed the waiting reply label Jun 12, 2020
@laa
Copy link
Member

laa commented Jun 18, 2020

Hi @quanghaisoft what amount of RAM do you have on the server ?

@quanghaisoft
Copy link
Author

we using more than 150G ram , CPU 24 core on node , with 6 node in cluster

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 19, 2020

when we using multi write upsert to one vertex in 3 master node, it slow down , can you help me a recommend Config to active fast upsert with bigdata, avoid lock record when confict happen,avoid not response .etc
this is current config:

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 19, 2020

orient-server.log.1_master_node.txt
when we writting upsert command on 3 node master (multiwrite) it displays too many errror ,please see attactment
orient-server.log.txt

@laa
Copy link
Member

laa commented Jun 19, 2020

@quanghaisoft issue which you had with WAL was caused by integer overflow. It is already fixed. I can not help you with the issue with distributed storage because I am working on another set of components. Could you please create a separate issue for your problem?

@laa laa closed this as completed Jun 19, 2020
@quanghaisoft
Copy link
Author

I have just see this error again, i you give me a snapshot of 3.0.31 version

Jun 19 18:45:00 ograph06-76 server.sh[29054]: Exception in thread 'OrientDB WAL Flush Task'
Jun 19 18:45:00 ograph06-76 server.sh[29054]: java.nio.BufferOverflowException
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.nio.Buffer.nextPutIndex(Buffer.java:527)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.nio.DirectByteBuffer.putInt(DirectByteBuffer.java:706)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog$RecordsWriter.run(OCASDiskWriteAheadLog.java:1920)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.lang.Thread.run(Thread.java:748)

@laa laa reopened this Jun 19, 2020
@quanghaisoft
Copy link
Author

in others node have this error: Error during WAL writing
java.nio.file.NoSuchFileException: /data/orientdb/databases/OSystem
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:111)
at sun.nio.fs.UnixFileStore.getUsableSpace(UnixFileStore.java:124)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.checkFreeSpace(OCASDiskWriteAheadLog.java:1439)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.access$4000(OCASDiskWriteAheadLog.java:52)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog$RecordsWriter.run(OCASDiskWriteAheadLog.java:2053)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

@laa
Copy link
Member

laa commented Jun 19, 2020

I see, interesting. I will fix another issue and then will be back to yours.

@laa
Copy link
Member

laa commented Jun 19, 2020

@quanghaisoft about directory : /data/orientdb/databases/OSystem does it really exist ?

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 19, 2020

I'm sorry, my server is running in production, so I copied the dir database in a master node to all nodes and restarted the distributed cluster, all I had was the log. dir

@quanghaisoft
Copy link
Author

one server full of log file with text:
Exception in thread 'OrientDB WAL Flush Task' java.nio.BufferOverflowException at java.nio.Buffer.nextPutIndex(Buffer.java:527)....

@laa
Copy link
Member

laa commented Jun 19, 2020

I did not get it. Did you copy file system data under live server and put it under the replicas or did you shutdown replicas? The correct way to do that is to remove broken databases from nodes and start "empty" replicas, the system will copy all the data correctly.

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 19, 2020

I have 5 servers 71,72,73,74,75,76, i writing data to a master node 71
,but it suddenly error , and all client writing error and stopped, so i login ssh and view log,
server 71,72,73,74,75 ok
server 76 has full of error: java.nio.BufferOverflowException

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 19, 2020

@quanghaisoft about directory : /data/orientdb/databases/OSystem does it really exist ?

I''m sorry this error is old,it is last 6 day ago

@quanghaisoft
Copy link
Author

quanghaisoft commented Jun 19, 2020

@laa Please help me with this error , can you give me a .jar at 3.0.31 thanks a lot

Jun 19 18:45:00 ograph06-76 server.sh[29054]: Exception in thread 'OrientDB WAL Flush Task'
Jun 19 18:45:00 ograph06-76 server.sh[29054]: java.nio.BufferOverflowException
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at java.nio.Buffer.nextPutIndex(Buffer.java:527)
Jun 19 18:45:00 ograph06-76 server.sh[29054]: at

@laa
Copy link
Member

laa commented Jun 19, 2020

@quanghaisoft you can download https://orientdb.org/download 3.0.32 changes are minor only small fixes in few components.

@laa
Copy link
Member

laa commented Jun 19, 2020

@quanghaisoft could you send me log directory from the 76 node of the cluster ?

@quanghaisoft
Copy link
Author

log.zip
this is logs directory after i recopy data and restart server.

@quanghaisoft
Copy link
Author

@quanghaisoft you can download https://orientdb.org/download 3.0.32 changes are minor only small fixes in few components.

OK thank you

@quanghaisoft
Copy link
Author

I have upgraded to version 3.0.32 and am watching, if I run a problem again I will notify you, thanks a lot

@laa
Copy link
Member

laa commented Jun 24, 2020

@quanghaisoft what version of JDK do you use?

@laa
Copy link
Member

laa commented Jun 24, 2020

Hi, @quanghaisoft I have introduced a small fix, which should resolve your issue. Is it possible for you to apply new build which I will send to you?

@quanghaisoft
Copy link
Author

Please send me new build you have fixed, thanks you very much

we using java :

java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

@laa
Copy link
Member

laa commented Jun 26, 2020

@quanghaisoft I am going to add more precautions, it will take couple of days, then I will provide you with new distribution.

@laa
Copy link
Member

laa commented Jul 17, 2020

@quanghaisoft
Copy link
Author

@quanghaisoft could you try this build https://drive.google.com/file/d/1wtQpyzbfeD3w07NpurG1wmGIxScggGXI/view?usp=sharing

Thanks you very much, please give me access to file with email bumha... or xuanhuong...

@quanghaisoft
Copy link
Author

I have installed and when start service it throw an exception below

2020-07-17 22:22:43:223 INFO Limit of open files for disk cache will be set to 64000. [OEngineLocalPaginated]
2020-07-17 22:22:43:242 INFO Loading configuration from: /app/orientdb/config/orientdb-server-config.xml... [OServerConfigurationLoaderXml]
2020-07-17 22:22:43:436 INFO OrientDB Server v3.0.33-SNAPSHOT - Veloce (build 2b80661, branch laa_3.0.x) is starting up... [OServer]
2020-07-17 22:22:43:456 INFO OrientDB config DISKCACHE=120,000MB [orientechnologies]
2020-07-17 22:22:43:528 INFO System is started under an effective user : orientdb [OEngineLocalPaginated]
2020-07-17 22:22:43:528 INFO Allocation of 1824000 pages. [OEngineLocalPaginated]
2020-07-17 22:22:43:579 INFO WAL maximum segment size is set to 6,144 MB [OrientDBEmbedded]
2020-07-17 22:22:43:580 INFO Databases directory: /app/orientdb/databases [OServer]
2020-07-17 22:22:43:636 INFO Direct IO for WAL located in /app/orientdb/databases/OSystem is allowed with block size 4096 bytes. [OCASDiskWriteAheadLog]
2020-07-17 22:22:43:636 INFO Page size for WAL located in /app/orientdb/databases/OSystem is set to 4096 bytes. [OCASDiskWriteAheadLog]Exception 68A09ED8 in storage plocal:/app/orientdb/databases/OSystem: 3.0.33-SNAPSHOT - Veloce (build 2b80661, branch laa_3.0.x)
java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
at com.orientechnologies.common.io.OIOUtils.readByteBuffer(OIOUtils.java:368)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.readMasterRecord(OCASDiskWriteAheadLog.java:481)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.readLastCheckpointInfo(OCASDiskWriteAheadLog.java:406)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.(OCASDiskWriteAheadLog.java:286)
at com.orientechnologies.orient.core.storage.disk.OLocalPaginatedStorage.initWalAndDiskCache(OLocalPaginatedStorage.java:574)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:392)
at com.orientechnologies.orient.core.db.OrientDBEmbedded.openNoAuthorization(OrientDBEmbedded.java:259)
at com.orientechnologies.orient.core.db.OrientDBEmbedded.openNoAuthorization(OrientDBEmbedded.java:56)
at com.orientechnologies.orient.server.OSystemDatabase.openSystemDatabase(OSystemDatabase.java:97)
at com.orientechnologies.orient.server.OSystemDatabase.checkServerId(OSystemDatabase.java:174)
at com.orientechnologies.orient.server.OSystemDatabase.init(OSystemDatabase.java:162)
at com.orientechnologies.orient.server.OSystemDatabase.(OSystemDatabase.java:49)
at com.orientechnologies.orient.server.OServer.initSystemDatabase(OServer.java:1212)
at com.orientechnologies.orient.server.OServer.activate(OServer.java:421)
at com.orientechnologies.orient.server.OServerMain$1.run(OServerMain.java:48)
JVM error was thrown
java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
at com.orientechnologies.common.io.OIOUtils.readByteBuffer(OIOUtils.java:368)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.readMasterRecord(OCASDiskWriteAheadLog.java:481)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.readLastCheckpointInfo(OCASDiskWriteAheadLog.java:406)
at com.orientechnologies.orient.core.storage.impl.local.paginated.wal.cas.OCASDiskWriteAheadLog.(OCASDiskWriteAheadLog.java:286)
at com.orientechnologies.orient.core.storage.disk.OLocalPaginatedStorage.initWalAndDiskCache(OLocalPaginatedStorage.java:574)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:392)
at com.orientechnologies.orient.core.db.OrientDBEmbedded.openNoAuthorization(OrientDBEmbedded.java:259)
at com.orientechnologies.orient.core.db.OrientDBEmbedded.openNoAuthorization(OrientDBEmbedded.java:56)
at com.orientechnologies.orient.server.OSystemDatabase.openSystemDatabase(OSystemDatabase.java:97)
at com.orientechnologies.orient.server.OSystemDatabase.checkServerId(OSystemDatabase.java:174)
at com.orientechnologies.orient.server.OSystemDatabase.init(OSystemDatabase.java:162)
at com.orientechnologies.orient.server.OSystemDatabase.(OSystemDatabase.java:49)
at com.orientechnologies.orient.server.OServer.initSystemDatabase(OServer.java:1212)
at com.orientechnologies.orient.server.OServer.activate(OServer.java:421)
at com.orientechnologies.orient.server.OServerMain$1.run(OServerMain.java:48)

@laa
Copy link
Member

laa commented Jul 17, 2020

I see it was built under higher JDK, the same API but now returns the current instance of the buffer. I will rebuild it next week.

@quanghaisoft
Copy link
Author

I see it was built under higher JDK, the same API but now returns the current instance of the buffer. I will rebuild it next week.

OK thanks you very much .

@quanghaisoft
Copy link
Author

@quanghaisoft
Copy link
Author

Guys, could you try https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-community/3.0.33-SNAPSHOT/orientdb-community-3.0.33-20200717.164059-15.zip this build?

I have upgraded to this build and it works fine, thanks you very much .

@laa
Copy link
Member

laa commented Jul 21, 2020

Hi @quanghaisoft it could be one more reason why you experience this issue. It is inside the code of the distributed database. We are planning to port it soon to all the versions. I will let you know.

@quanghaisoft
Copy link
Author

Hi @quanghaisoft it could be one more reason why you experience this issue. It is inside the code of the distributed database. We are planning to port it soon to all the versions. I will let you know.

Yes, thanks a lot, we'll waiting for it.

@laa laa removed the in progress label Aug 5, 2020
@laa
Copy link
Member

laa commented Aug 5, 2020

Fixed. Should be delivered in next hotfix. Please reopen if the issue will still persist on your side.

@laa laa closed this as completed Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants