-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock of sync & RPC under load. #4543
Comments
Not enough logs. |
The node has been upgraded for about 3 days. No DB migration logs are seen. The migration should be completed |
Show more logs |
We are also facing the same issue. #4510 After I removed the services that access the RPC, it has been working without an issue for the past week. |
I will take a look, we suffer from that as well. |
@AskAlexSharov there is definitely a deadlock somewhere in the built-in RPC service. It breaks both Engine API and RPCs. I am debugging. |
I can't reproduce. Try |
I can't reproduce that easily, it is pretty random. I think it happened during some hive tests and it definitely did happen on our validator nodes. As soon as I have one stuck, I will let you know. |
@hritique @wetezos can you share your RPC load (methods thats are called)? Because it might be something with one of the methods, I didn't manage to pinpoint yet. If you can share which RPCs are being called, it would be much appreciated. Or if you are ready to help, I can give you a branch with extra log output, so we can see the last PRC before the deadlock. |
@mandrigin I'm happy to deploy debugging code on our Eirgon instance. I work with @hritique in the same team. |
@kucharskim are you comfortable building from sources or should I pre-build a Docker image for you? |
anyway, here is the branch with extra logs: https://github.com/ledgerwatch/erigon/tree/issue4543 |
Got it , let me have a try. |
|
I reproduced this problem just, this the last logs. |
Yeah, from source is fine. I just built it, asked @hritique to deploy it and see how it goes. |
okay, at some point it always times out it looks like, every requets takes 50s. @wetezos if you still have the node running, can you do same to you @kucharskim when you reproduce the bug also also, please run it with --pprof flag and try to get me the goroutine trace like that
after you reproduce. I have more data to test now, but it will still be useful because you seem to be able to easily reproduce it. |
if you still have the node running, can you do pkill -SIGUSR1 erigon and show the output? Yes, our node is still running, but it has been restarted. The following is the log after pkill execution |
@wetezos ah, but it syncs/works, |
|
oh, sorry, then |
also, this is necessary only when you reproduce the deadlock/rpc issue |
Sure, this the out file. |
If I see correctly, we seem to be leaking quite a bit of goroutines
|
@wetezos you are running the RPC daemon separately, right? what if you can do the same but with the built-in one? |
@wetezos @kucharskim I updated the branch with a possible fix, the commit is Try to upgrade and see if it resolves your issue |
@wetezos unfortunately, that is not the full log of goroutines, I see it is trimmed quite a bit (the top half is missing). Maybe you can try to |
@mandrigin Here is the full log. |
Okay, there is the proof that it is the same issue
Actually, there are at leat 7 goroutines stuck at this part, for a long time (see 1391 minutes and more). |
I will need to check out more error scenarios |
Sure, thanks for your hard work. My version is right? erigon --version git branch
When i restart i got an error: |
@wetezos show you cli command |
Hey @mandrigin @revittm We have been running for 3 days, and now it is normal |
@hritique how are we doing? I see 4 days of uptime of a process running on:
|
Thanks so much, seems like it may have done the trick! @mandrigin are you happy this is resolved? |
Yep, awesome! |
Hey! I'm facing the same issue :( I have forked version of erigon@e04401491fa310ec7bc54d0cc530cbaded6d70e8 From
|
It happens when I try to do many |
tried to run fresh erigon version, this problem remains :( |
Created new issue #5125 |
System information
Erigon version:
./erigon --version
erigon version 2022.06.6-alpha-77240899
OS & Version: Windows/Linux/OSX
Linux version 4.15.0-163-generic (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021
Commit hash :
(HEAD detached at v2022.06.06)
Issue
When I upgrade to the latest version, When the number of accesses per second reaches 5, the node does not synchronize the data,and the RPC is not accessible
logs
Jun 26 03:11:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:11:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:13:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:13:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:15:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:15:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:17:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:17:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:19:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:19:05.029] [p2p] GoodPeers eth66=100 Jun 26 03:21:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:21:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:23:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:23:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:25:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:25:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:27:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:27:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:29:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:29:05.029] [p2p] GoodPeers eth66=100
The text was updated successfully, but these errors were encountered: