Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in BeginRo #5125

Closed
kaikash opened this issue Aug 20, 2022 · 27 comments
Closed

Deadlock in BeginRo #5125

kaikash opened this issue Aug 20, 2022 · 27 comments
Assignees
Labels
gateway Issue assigned to gateway team status:in-progress Actively being worked on (picked up by developer)

Comments

@kaikash
Copy link

kaikash commented Aug 20, 2022

System information

Erigon version: ./erigon --version
2022.99.99-dev-1303023c

OS & Version: Windows/Linux/OSX
ubuntu 20.04

Commit hash:
1303023

Expected behaviour

No deadlock

Actual behaviour

Deadlock :( It stucks in semaphore.Acquire call.

Steps to reproduce the behaviour

Send about 5k eth_call requests simultaneously via 2 websocket connections.

Logs

pkill -SIGUSR1 erigon

goroutine 56608 [select, 10 minutes]:
golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0001500f0, {0x200d3b0, 0xc0cb6bbe90}, 0x1)
	golang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/semaphore/semaphore.go:60 +0x345
github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc0002791e0, {0x200d3b0?, 0xc0cb6bbe90?})
	github.com/ledgerwatch/erigon-lib@v0.0.0-20220819042719-4c744ae88a33/kv/mdbx/kv_mdbx.go:425 +0x85
github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0a8768800, {0x2017408, 0xc0cb6a1390})
	github.com/ledgerwatch/erigon-lib@v0.0.0-20220819042719-4c744ae88a33/kv/remotedbserver/server.go:84 +0x77
github.com/ledgerwatch/erigon-lib/gointerfaces/remote._KV_Tx_Handler({0x181dda0?, 0xc0a8768800}, {0x20128b8?, 0xc0cb6c24e0})
	github.com/ledgerwatch/erigon-lib@v0.0.0-20220819042719-4c744ae88a33/gointerfaces/remote/kv_grpc.pb.go:195 +0x9f
github.com/grpc-ecosystem/go-grpc-middleware/recovery.StreamServerInterceptor.func1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?}, 0x77ce0c149e98?, 0xc0cb69d4e8?)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/recovery/interceptors.go:51 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1({0x181dda0?, 0xc0a8768800?}, {0x20128b8?, 0xc0cb6c24e0?})
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1({0x181dda0, 0xc0a8768800}, {0x20128b8, 0xc0cb6c24e0}, 0x1689740?, 0xc0cb6a1380?)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58 +0xbf
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0xc0b2165f20, 0x2b4e0e0, 0x0)
	google.golang.org/grpc@v1.48.0/server.go:1565 +0xf26
google.golang.org/grpc.(*Server).handleStream(0xc000000780, {0x2018460, 0xc0a89a1040}, 0xc0cb6beb40, 0x0)
	google.golang.org/grpc@v1.48.0/server.go:1640 +0x9d6
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc@v1.48.0/server.go:932 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc@v1.48.0/server.go:930 +0x28a

pkill.log

@AskAlexSharov
Copy link
Collaborator

Same if increase —db.read.concurrency ?

@kaikash
Copy link
Author

kaikash commented Aug 21, 2022

Tried to increase from 32 to 200, didn't help. Can you reproduce it on your side?

@kaikash
Copy link
Author

kaikash commented Aug 21, 2022

Tried to run with —db.read.concurrency=5000, seems to be working. Is it ok to run node with this setting?

But still this is not quite the correct behavior of the program if it deadlocks during a burst of load..

@kaikash
Copy link
Author

kaikash commented Aug 21, 2022

Eventually it stucked :( Even with this setting. First time I send ~5k requests and it responds successfully, then it stucks a second time I send them. So, I think it doesn't release semaphore the same number of times it acquires it, do it?

@AskAlexSharov
Copy link
Collaborator

do you run separated RPCDaemon? if yes: with --datadir? if no try add --datadir

@AskAlexSharov
Copy link
Collaborator

"Can you reproduce it on your side?" - I didn't try yet

@kaikash
Copy link
Author

kaikash commented Aug 21, 2022

Yes, I'm running a separated RPCDaemon, they are on the same machine, but in different containers.

Tried to run with --datadir and it seems to be working, even with db.read.concurrency=32.

So, it has to be a deadlock when transaction creates with remotedb?

@AskAlexSharov
Copy link
Collaborator

ok, then it's a bit known issue with remotedb grpc server architecture.
I advise you work with --datadir now.
don't know when we will able to fix it.

@kaikash
Copy link
Author

kaikash commented Aug 22, 2022

I see, thanks. What is known about this bug at the moment? Maybe a little more context, so I can try to check this out myself?

@AskAlexSharov
Copy link
Collaborator

what we know:

  1. unclear - is it really deadlock or just --db.read.concurrency limit touched and all next readers just waiting previous one
  2. seems Golang is not very friendly to embedded C databases: when you get IO inside C code - golang see that whole thread is locked (golang's runtime can't schedule any goroutine to this thread) - and golang exclude current thread from it's thread pool. As a result - with much parallel read transactions (which doing some IO) you will have tons of threads (which is in-general bad for performance) and then also will touch golang's limit of 10K threads and app crush. this is reason why we have --db.read.concurrency semaphore.
  3. network-based transactions (remotedb) are slower and you faster getting to the limit.

@PeaStew
Copy link

PeaStew commented Aug 22, 2022

--datadir does not solve this for local erigon + local rpcdaemon under high load we can what appears to be a lockup on the rpcd every <20mins

@revitteth
Copy link
Collaborator

I think I now have something here - will push some code this afternoon so we can try to prove it:

  • roTxsLimiter is a semaphore which defaults to the runtime.GOMAXPROCS(-1) - which for most machines will be c. 8 (correct me if wrong!)
  • db.read.concurrency flag sets this roTxsLimiter - and the repros where issues are seen are passing this value as c. 200-10k (as seen on GH)

Given point 2 in @AskAlexSharov comment above it seems plausible this value is just set too high?

@revitteth
Copy link
Collaborator

I guess in the meantime if there's chance to try again - remove the db.read.concurrency flag altogether, let it default to GOMAXPROCS(-1), and see if the repro produces the same exhaustion/locking?

@kaikash
Copy link
Author

kaikash commented Aug 22, 2022

Could this be due to the fact that all go runtime threads are excluded? So there are no free threads for goroutines to process result form database. I can try to test it with lower db.read.concurrency.

But this hypothesis is refuted by the fact that there are still some working goroutines which print GoodPeers?

@revitteth
Copy link
Collaborator

Yeah good point on the goodPeers - think I need to instrument some code so we can see more clearly what's going on here - I do think the first part of what you said looks like it 'could' be correct - for me to go prove!

@kaikash
Copy link
Author

kaikash commented Aug 22, 2022

Tried to run with default value (max(10, GOMAXPROCS*2)), db.read.concurrency=2 and db.read.concurrency=1 -- it still stucks.

@revitteth
Copy link
Collaborator

@kaikash you seem to be able to replicate quickly - do you have an example of the calls you are making?

@kaikash
Copy link
Author

kaikash commented Aug 22, 2022

@revitteth here is a test script https://github.com/kaikash/eth-call-test

@kaikash
Copy link
Author

kaikash commented Aug 22, 2022

It works very unpredictable, sometimes it stucks, sometimes it is not. Try increasing connection pool size (for me it stucks with 10 cons and 10k requests). Maybe this script is not the best because it makes same requests and erigon caching it?..

@revitteth revitteth self-assigned this Aug 25, 2022
@revitteth revitteth added the gateway Issue assigned to gateway team label Aug 25, 2022
@revitteth
Copy link
Collaborator

@kaikash sorry for delay, still working on this one!

@PeaStew
Copy link

PeaStew commented Aug 26, 2022

So... we had a thought about this having just run into a similar issue in some code we are building related to the same concept. We discovered that if the application is receiving a huge amount of requests async, it actually never has time for GC (as GC is generally a blocking activity and blocking activities are deferred while a thread is in use) so what may be happening to Erigon too is simply that the thread lifetime is essentially 'forever' while it is in constant use and no GC can occur. I can't say for sure in the code, but perhaps setting a max lifetime on the threads may help release memory and not run into this deadlock.

@revitteth
Copy link
Collaborator

Thanks for the input! I'm balancing this one with a few other issues so the insight is super appreciated. Will be picking this up next week for sure.

@revitteth revitteth added the status:in-progress Actively being worked on (picked up by developer) label Sep 5, 2022
@zimbabao
Copy link
Contributor

I removed rpcDaemon and running in single process mode. Still it gets stuck. FWIW, I am running bsc node on erigon:alpha.

@revitteth
Copy link
Collaborator

Thanks, useful info, currently trying to finish up on another bug and should get to this soon.

@zimbabao
Copy link
Contributor

zimbabao commented Sep 10, 2022

Now it's not stuck but hitting IOPS issue 50K, I am getting throughput of 1000/RPS.
I now suspect issue is with rpcDaemon as separate process and once I moved to single process my main issue is IOPS.

I can reliably fix the problem by running in single process mode on ethereum node too (before this I was running rpcDaemon too).

@hexoscott
Copy link
Collaborator

@AskAlexSharov I've opened a PR over in erigon-lib which looks promising for helping here ledgerwatch/erigon-lib#639

@revitteth
Copy link
Collaborator

No longer able to repro after the above fix! Will close and we can reopen if any further problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gateway Issue assigned to gateway team status:in-progress Actively being worked on (picked up by developer)
Projects
None yet
Development

No branches or pull requests

6 participants