-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in BeginRo #5125
Comments
Same if increase —db.read.concurrency ? |
Tried to increase from 32 to 200, didn't help. Can you reproduce it on your side? |
Tried to run with But still this is not quite the correct behavior of the program if it deadlocks during a burst of load.. |
Eventually it stucked :( Even with this setting. First time I send ~5k requests and it responds successfully, then it stucks a second time I send them. So, I think it doesn't release semaphore the same number of times it acquires it, do it? |
do you run separated RPCDaemon? if yes: with --datadir? if no try add --datadir |
"Can you reproduce it on your side?" - I didn't try yet |
Yes, I'm running a separated RPCDaemon, they are on the same machine, but in different containers. Tried to run with So, it has to be a deadlock when transaction creates with remotedb? |
ok, then it's a bit known issue with remotedb grpc server architecture. |
I see, thanks. What is known about this bug at the moment? Maybe a little more context, so I can try to check this out myself? |
what we know:
|
--datadir does not solve this for local erigon + local rpcdaemon under high load we can what appears to be a lockup on the rpcd every <20mins |
I think I now have something here - will push some code this afternoon so we can try to prove it:
Given point 2 in @AskAlexSharov comment above it seems plausible this value is just set too high? |
I guess in the meantime if there's chance to try again - remove the db.read.concurrency flag altogether, let it default to GOMAXPROCS(-1), and see if the repro produces the same exhaustion/locking? |
Could this be due to the fact that all go runtime threads are excluded? So there are no free threads for goroutines to process result form database. I can try to test it with lower But this hypothesis is refuted by the fact that there are still some working goroutines which print |
Yeah good point on the goodPeers - think I need to instrument some code so we can see more clearly what's going on here - I do think the first part of what you said looks like it 'could' be correct - for me to go prove! |
Tried to run with default value ( |
@kaikash you seem to be able to replicate quickly - do you have an example of the calls you are making? |
@revitteth here is a test script https://github.com/kaikash/eth-call-test |
It works very unpredictable, sometimes it stucks, sometimes it is not. Try increasing connection pool size (for me it stucks with 10 cons and 10k requests). Maybe this script is not the best because it makes same requests and erigon caching it?.. |
@kaikash sorry for delay, still working on this one! |
So... we had a thought about this having just run into a similar issue in some code we are building related to the same concept. We discovered that if the application is receiving a huge amount of requests async, it actually never has time for GC (as GC is generally a blocking activity and blocking activities are deferred while a thread is in use) so what may be happening to Erigon too is simply that the thread lifetime is essentially 'forever' while it is in constant use and no GC can occur. I can't say for sure in the code, but perhaps setting a max lifetime on the threads may help release memory and not run into this deadlock. |
Thanks for the input! I'm balancing this one with a few other issues so the insight is super appreciated. Will be picking this up next week for sure. |
I removed rpcDaemon and running in single process mode. Still it gets stuck. FWIW, I am running bsc node on erigon:alpha. |
Thanks, useful info, currently trying to finish up on another bug and should get to this soon. |
Now it's not stuck but hitting IOPS issue 50K, I am getting throughput of 1000/RPS. I can reliably fix the problem by running in single process mode on ethereum node too (before this I was running rpcDaemon too). |
@AskAlexSharov I've opened a PR over in erigon-lib which looks promising for helping here ledgerwatch/erigon-lib#639 |
No longer able to repro after the above fix! Will close and we can reopen if any further problem. |
System information
Erigon version:
./erigon --version
2022.99.99-dev-1303023c
OS & Version: Windows/Linux/OSX
ubuntu 20.04
Commit hash:
1303023
Expected behaviour
No deadlock
Actual behaviour
Deadlock :( It stucks in
semaphore.Acquire
call.Steps to reproduce the behaviour
Send about 5k
eth_call
requests simultaneously via 2 websocket connections.Logs
pkill -SIGUSR1 erigon
pkill.log
The text was updated successfully, but these errors were encountered: