Deadlock in BeginRo #5125

kaikash · 2022-08-20T16:16:13Z

System information

Erigon version: ./erigon --version
2022.99.99-dev-1303023c

OS & Version: Windows/Linux/OSX
ubuntu 20.04

Commit hash:
1303023

Expected behaviour

No deadlock

Actual behaviour

Deadlock :( It stucks in semaphore.Acquire call.

Steps to reproduce the behaviour

Send about 5k eth_call requests simultaneously via 2 websocket connections.

Logs

pkill -SIGUSR1 erigon

goroutine 56608 [select, 10 minutes]:
golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0001500f0, {0x200d3b0, 0xc0cb6bbe90}, 0x1)
	golang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/semaphore/semaphore.go:60 +0x345
github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc0002791e0, {0x200d3b0?, 0xc0cb6bbe90?})
	github.com/ledgerwatch/erigon-lib@v0.0.0-20220819042719-4c744ae88a33/kv/mdbx/kv_mdbx.go:425 +0x85
github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0a8768800, {0x2017408, 0xc0cb6a1390})
	github.com/ledgerwatch/erigon-lib@v0.0.0-20220819042719-4c744ae88a33/kv/remotedbserver/server.go:84 +0x77
github.com/ledgerwatch/erigon-lib/gointerfaces/remote._KV_Tx_Handler({0x181dda0?, 0xc0a8768800}, {0x20128b8?, 0xc0cb6c24e0})
	github.com/ledgerwatch/erigon-lib@v0.0.0-20220819042719-4c744ae88a33/gointerfaces/remote/kv_grpc.pb.go:195 +0x9f
github.com/grpc-ecosystem/go-grpc-middleware/recovery.StreamServerInterceptor.func1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?}, 0x77ce0c149e98?, 0xc0cb69d4e8?)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/recovery/interceptors.go:51 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?})
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1({0x181dda0, 0xc0a8768800}, {0x20128b8, 0xc0cb6c24e0}, 0x1689740?, 0xc0cb6a1380?)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58 +0xbf
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0xc0b2165f20, 0x2b4e0e0, 0x0)
	google.golang.org/grpc@v1.48.0/server.go:1565 +0xf26
google.golang.org/grpc.(*Server).handleStream(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0x0)
	google.golang.org/grpc@v1.48.0/server.go:1640 +0x9d6
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc@v1.48.0/server.go:932 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc@v1.48.0/server.go:930 +0x28a

pkill.log

The text was updated successfully, but these errors were encountered:

AskAlexSharov · 2022-08-21T03:23:23Z

Same if increase —db.read.concurrency ?

kaikash · 2022-08-21T09:58:55Z

Tried to increase from 32 to 200, didn't help. Can you reproduce it on your side?

kaikash · 2022-08-21T10:07:16Z

Tried to run with —db.read.concurrency=5000, seems to be working. Is it ok to run node with this setting?

But still this is not quite the correct behavior of the program if it deadlocks during a burst of load..

kaikash · 2022-08-21T10:17:35Z

Eventually it stucked :( Even with this setting. First time I send ~5k requests and it responds successfully, then it stucks a second time I send them. So, I think it doesn't release semaphore the same number of times it acquires it, do it?

AskAlexSharov · 2022-08-21T14:19:56Z

do you run separated RPCDaemon? if yes: with --datadir? if no try add --datadir

AskAlexSharov · 2022-08-21T14:20:34Z

"Can you reproduce it on your side?" - I didn't try yet

kaikash · 2022-08-21T16:46:17Z

Yes, I'm running a separated RPCDaemon, they are on the same machine, but in different containers.

Tried to run with --datadir and it seems to be working, even with db.read.concurrency=32.

So, it has to be a deadlock when transaction creates with remotedb?

AskAlexSharov · 2022-08-22T06:08:09Z

ok, then it's a bit known issue with remotedb grpc server architecture.
I advise you work with --datadir now.
don't know when we will able to fix it.

kaikash · 2022-08-22T09:44:30Z

I see, thanks. What is known about this bug at the moment? Maybe a little more context, so I can try to check this out myself?

AskAlexSharov · 2022-08-22T11:23:24Z

what we know:

unclear - is it really deadlock or just --db.read.concurrency limit touched and all next readers just waiting previous one
seems Golang is not very friendly to embedded C databases: when you get IO inside C code - golang see that whole thread is locked (golang's runtime can't schedule any goroutine to this thread) - and golang exclude current thread from it's thread pool. As a result - with much parallel read transactions (which doing some IO) you will have tons of threads (which is in-general bad for performance) and then also will touch golang's limit of 10K threads and app crush. this is reason why we have --db.read.concurrency semaphore.
network-based transactions (remotedb) are slower and you faster getting to the limit.

PeaStew · 2022-08-22T13:37:34Z

--datadir does not solve this for local erigon + local rpcdaemon under high load we can what appears to be a lockup on the rpcd every <20mins

revitteth · 2022-08-22T13:48:42Z

I think I now have something here - will push some code this afternoon so we can try to prove it:

roTxsLimiter is a semaphore which defaults to the runtime.GOMAXPROCS(-1) - which for most machines will be c. 8 (correct me if wrong!)
db.read.concurrency flag sets this roTxsLimiter - and the repros where issues are seen are passing this value as c. 200-10k (as seen on GH)

Given point 2 in @AskAlexSharov comment above it seems plausible this value is just set too high?

revitteth · 2022-08-22T14:14:46Z

I guess in the meantime if there's chance to try again - remove the db.read.concurrency flag altogether, let it default to GOMAXPROCS(-1), and see if the repro produces the same exhaustion/locking?

kaikash · 2022-08-22T14:17:25Z

Could this be due to the fact that all go runtime threads are excluded? So there are no free threads for goroutines to process result form database. I can try to test it with lower db.read.concurrency.

But this hypothesis is refuted by the fact that there are still some working goroutines which print GoodPeers?

revitteth · 2022-08-22T14:19:32Z

Yeah good point on the goodPeers - think I need to instrument some code so we can see more clearly what's going on here - I do think the first part of what you said looks like it 'could' be correct - for me to go prove!

kaikash · 2022-08-22T14:32:19Z

Tried to run with default value (max(10, GOMAXPROCS*2)), db.read.concurrency=2 and db.read.concurrency=1 -- it still stucks.

revitteth · 2022-08-22T17:06:59Z

@kaikash you seem to be able to replicate quickly - do you have an example of the calls you are making?

kaikash · 2022-08-22T22:42:36Z

@revitteth here is a test script https://github.com/kaikash/eth-call-test

kaikash · 2022-08-22T22:50:53Z

It works very unpredictable, sometimes it stucks, sometimes it is not. Try increasing connection pool size (for me it stucks with 10 cons and 10k requests). Maybe this script is not the best because it makes same requests and erigon caching it?..

revitteth · 2022-08-25T16:01:25Z

@kaikash sorry for delay, still working on this one!

PeaStew · 2022-08-26T22:41:46Z

So... we had a thought about this having just run into a similar issue in some code we are building related to the same concept. We discovered that if the application is receiving a huge amount of requests async, it actually never has time for GC (as GC is generally a blocking activity and blocking activities are deferred while a thread is in use) so what may be happening to Erigon too is simply that the thread lifetime is essentially 'forever' while it is in constant use and no GC can occur. I can't say for sure in the code, but perhaps setting a max lifetime on the threads may help release memory and not run into this deadlock.

revitteth · 2022-08-26T22:54:01Z

Thanks for the input! I'm balancing this one with a few other issues so the insight is super appreciated. Will be picking this up next week for sure.

zimbabao · 2022-09-10T17:55:01Z

I removed rpcDaemon and running in single process mode. Still it gets stuck. FWIW, I am running bsc node on erigon:alpha.

revitteth · 2022-09-10T18:06:27Z

Thanks, useful info, currently trying to finish up on another bug and should get to this soon.

zimbabao · 2022-09-10T22:38:48Z

Now it's not stuck but hitting IOPS issue 50K, I am getting throughput of 1000/RPS.
I now suspect issue is with rpcDaemon as separate process and once I moved to single process my main issue is IOPS.

I can reliably fix the problem by running in single process mode on ethereum node too (before this I was running rpcDaemon too).

hexoscott · 2022-09-14T12:46:57Z

@AskAlexSharov I've opened a PR over in erigon-lib which looks promising for helping here ledgerwatch/erigon-lib#639

revitteth · 2022-09-19T04:59:22Z

No longer able to repro after the above fix! Will close and we can reopen if any further problem.

kaikash mentioned this issue Aug 20, 2022

Deadlock of sync & RPC under load. #4543

Closed

revitteth mentioned this issue Aug 22, 2022

press test on external rpc daemon make erigon stop syncing #4916

Closed

revitteth self-assigned this Aug 25, 2022

revitteth added the gateway Issue assigned to gateway team label Aug 25, 2022

revitteth added the status:in-progress Actively being worked on (picked up by developer) label Sep 5, 2022

revitteth closed this as completed Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in BeginRo #5125

Deadlock in BeginRo #5125

kaikash commented Aug 20, 2022

AskAlexSharov commented Aug 21, 2022

kaikash commented Aug 21, 2022

kaikash commented Aug 21, 2022

kaikash commented Aug 21, 2022 •

edited

Loading

AskAlexSharov commented Aug 21, 2022

AskAlexSharov commented Aug 21, 2022

kaikash commented Aug 21, 2022 •

edited

Loading

AskAlexSharov commented Aug 22, 2022

kaikash commented Aug 22, 2022

AskAlexSharov commented Aug 22, 2022

PeaStew commented Aug 22, 2022

revitteth commented Aug 22, 2022

revitteth commented Aug 22, 2022

kaikash commented Aug 22, 2022 •

edited

Loading

revitteth commented Aug 22, 2022

kaikash commented Aug 22, 2022 •

edited

Loading

revitteth commented Aug 22, 2022

kaikash commented Aug 22, 2022

kaikash commented Aug 22, 2022 •

edited

Loading

revitteth commented Aug 25, 2022

PeaStew commented Aug 26, 2022

revitteth commented Aug 26, 2022

zimbabao commented Sep 10, 2022

revitteth commented Sep 10, 2022

zimbabao commented Sep 10, 2022 •

edited

Loading

hexoscott commented Sep 14, 2022

revitteth commented Sep 19, 2022

Deadlock in BeginRo #5125

Deadlock in BeginRo #5125

Comments

kaikash commented Aug 20, 2022

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Logs

AskAlexSharov commented Aug 21, 2022

kaikash commented Aug 21, 2022

kaikash commented Aug 21, 2022

kaikash commented Aug 21, 2022 • edited Loading

AskAlexSharov commented Aug 21, 2022

AskAlexSharov commented Aug 21, 2022

kaikash commented Aug 21, 2022 • edited Loading

AskAlexSharov commented Aug 22, 2022

kaikash commented Aug 22, 2022

AskAlexSharov commented Aug 22, 2022

PeaStew commented Aug 22, 2022

revitteth commented Aug 22, 2022

revitteth commented Aug 22, 2022

kaikash commented Aug 22, 2022 • edited Loading

revitteth commented Aug 22, 2022

kaikash commented Aug 22, 2022 • edited Loading

revitteth commented Aug 22, 2022

kaikash commented Aug 22, 2022

kaikash commented Aug 22, 2022 • edited Loading

revitteth commented Aug 25, 2022

PeaStew commented Aug 26, 2022

revitteth commented Aug 26, 2022

zimbabao commented Sep 10, 2022

revitteth commented Sep 10, 2022

zimbabao commented Sep 10, 2022 • edited Loading

hexoscott commented Sep 14, 2022

revitteth commented Sep 19, 2022

kaikash commented Aug 21, 2022 •

edited

Loading

kaikash commented Aug 21, 2022 •

edited

Loading

kaikash commented Aug 22, 2022 •

edited

Loading

kaikash commented Aug 22, 2022 •

edited

Loading

kaikash commented Aug 22, 2022 •

edited

Loading

zimbabao commented Sep 10, 2022 •

edited

Loading