Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock of sync & RPC under load. #4543

Closed
wetezos opened this issue Jun 26, 2022 · 61 comments · Fixed by #4667 or #4712
Closed

Deadlock of sync & RPC under load. #4543

wetezos opened this issue Jun 26, 2022 · 61 comments · Fixed by #4667 or #4712
Assignees

Comments

@wetezos
Copy link

wetezos commented Jun 26, 2022

System information

Erigon version: ./erigon --version
erigon version 2022.06.6-alpha-77240899
OS & Version: Windows/Linux/OSX
Linux version 4.15.0-163-generic (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021
Commit hash :
(HEAD detached at v2022.06.06)

Issue

When I upgrade to the latest version, When the number of accesses per second reaches 5, the node does not synchronize the data,and the RPC is not accessible

logs

Jun 26 03:11:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:11:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:13:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:13:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:15:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:15:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:17:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:17:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:19:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:19:05.029] [p2p] GoodPeers eth66=100 Jun 26 03:21:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:21:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:23:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:23:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:25:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:25:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:27:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:27:05.029] [p2p] GoodPeers eth66=99 Jun 26 03:29:05 Ubuntu-1804-bionic-64-minimal erigon[31704]: [INFO] [06-26|03:29:05.029] [p2p] GoodPeers eth66=100

@AskAlexSharov
Copy link
Collaborator

Not enough logs.
maybe it does DB migration?

@wetezos
Copy link
Author

wetezos commented Jun 26, 2022

The node has been upgraded for about 3 days. No DB migration logs are seen. The migration should be completed

@AskAlexSharov
Copy link
Collaborator

Show more logs

@hritique
Copy link

hritique commented Jul 2, 2022

We are also facing the same issue. #4510
When there are clients connecting to the RPC, it stops syncing after a while and the RPC becomes unreachable
And the logs(pasted in the issue) keeps repeating forever.

After I removed the services that access the RPC, it has been working without an issue for the past week.

@wetezos
Copy link
Author

wetezos commented Jul 5, 2022

Show more logs

I had a test today, which reproduced this problem.
image

Jul 05 10:36:09 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:09.776] Commit cycle in=5.361719327s Jul 05 10:36:09 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:09.779] RPC Daemon notified of new headers from=15081437 to=15081438 header sending=12.143µs log sending=2.141746ms Jul 05 10:36:09 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:09.779] [1/16 Headers] Waiting for headers... from=15081438 Jul 05 10:36:18 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:18.606] [1/16 Headers] Processed highest inserted=15081439 age=14s Jul 05 10:36:18 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:18.687] [6/16 Execution] Completed on block=15081439 Jul 05 10:36:18 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:18.822] Timings (slower than 50ms) Headers=8.827s Execution=66ms IntermediateHashes=66ms Jul 05 10:36:18 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:18.822] Tables PlainState=68.8GB AccountChangeSet=155.2GB StorageChangeSet=226.9GB BlockTransaction=9.0GB TransactionLog=412.6GB FreeList=112.6MB ReclaimableSpace=112.6GB Jul 05 10:36:23 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:23.519] Commit cycle in=4.697331827s Jul 05 10:36:23 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:23.521] RPC Daemon notified of new headers from=15081438 to=15081439 header sending=10.67µs log sending=1.26173ms Jul 05 10:36:23 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:23.521] [1/16 Headers] Waiting for headers... from=15081439 Jul 05 10:36:25 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:25.882] [1/16 Headers] Processed highest inserted=15081440 age=7s Jul 05 10:36:25 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:25.927] [6/16 Execution] Completed on block=15081440 Jul 05 10:36:25 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:25.992] Timings (slower than 50ms) Headers=2.361s Jul 05 10:36:25 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:25.992] Tables PlainState=68.8GB AccountChangeSet=155.2GB StorageChangeSet=226.9GB BlockTransaction=9.0GB TransactionLog=412.6GB FreeList=112.6MB ReclaimableSpace=112.6GB Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.429] Commit cycle in=3.436907386s Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.430] RPC Daemon notified of new headers from=15081439 to=15081440 header sending=10.349µs log sending=729.224µs Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.430] [1/16 Headers] Waiting for headers... from=15081440 Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.434] [1/16 Headers] Processed highest inserted=15081441 age=3s Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.441] [6/16 Execution] Completed on block=15081441 Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.461] Commit cycle in=7.518964ms Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.461] RPC Daemon notified of new headers from=15081440 to=15081441 header sending=4.679µs log sending=91.17µs Jul 05 10:36:29 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:29.461] [1/16 Headers] Waiting for headers... from=15081441 Jul 05 10:36:33 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:33.809] [1/16 Headers] Processed highest inserted=15081442 age=5s Jul 05 10:36:33 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:33.887] [6/16 Execution] Completed on block=15081442 Jul 05 10:36:34 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:34.026] Timings (slower than 50ms) Headers=4.348s Execution=58ms IntermediateHashes=67ms Jul 05 10:36:34 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:34.026] Tables PlainState=68.8GB AccountChangeSet=155.2GB StorageChangeSet=226.9GB BlockTransaction=9.0GB TransactionLog=412.6GB FreeList=112.6MB ReclaimableSpace=112.6GB Jul 05 10:36:38 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:38.506] Commit cycle in=4.480156732s Jul 05 10:36:38 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:38.508] RPC Daemon notified of new headers from=15081441 to=15081442 header sending=25.558µs log sending=2.164869ms Jul 05 10:36:38 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:38.508] [1/16 Headers] Waiting for headers... from=15081442 Jul 05 10:36:42 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:42.873] [1/16 Headers] Processed highest inserted=15081443 age=9s Jul 05 10:36:42 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:42.920] [6/16 Execution] Completed on block=15081443 Jul 05 10:36:43 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:43.000] Timings (slower than 50ms) Headers=4.364s Jul 05 10:36:43 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:43.000] Tables PlainState=68.8GB AccountChangeSet=155.2GB StorageChangeSet=226.9GB BlockTransaction=9.0GB TransactionLog=412.6GB FreeList=112.6MB ReclaimableSpace=112.6GB Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.048] Commit cycle in=3.047941094s Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.048] RPC Daemon notified of new headers from=15081442 to=15081443 header sending=7.634µs log sending=825.414µs Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.048] [1/16 Headers] Waiting for headers... from=15081443 Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.053] [1/16 Headers] Processed highest inserted=15081444 age=3s Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.053] [6/16 Execution] Completed on block=15081444 Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.054] Commit cycle in=354.903µs Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.055] RPC Daemon notified of new headers from=15081443 to=15081444 header sending=3.316µs log sending=2.184µs Jul 05 10:36:46 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:46.055] [1/16 Headers] Waiting for headers... from=15081444 Jul 05 10:36:58 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:36:58.466] [p2p] GoodPeers eth66=43 Jul 05 10:37:00 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:37:00.210] [1/16 Headers] Processed highest inserted=15081445 age=16s Jul 05 10:37:00 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:37:00.383] [6/16 Execution] Completed on block=15081445 Jul 05 10:37:00 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:37:00.571] Timings (slower than 50ms) Headers=14.155s Execution=109ms IntermediateHashes=88ms Jul 05 10:37:00 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:37:00.571] Tables PlainState=68.8GB AccountChangeSet=155.2GB StorageChangeSet=226.9GB BlockTransaction=9.0GB TransactionLog=412.6GB FreeList=112.6MB ReclaimableSpace=112.6GB Jul 05 10:37:01 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:37:01.075] Commit cycle in=504.002941ms Jul 05 10:37:48 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:37:48.917] RPC Daemon notified of new headers from=15081444 to=15081445 header sending=8.707µs log sending=2.081634ms Jul 05 10:38:58 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:38:58.466] [p2p] GoodPeers eth66=42 Jul 05 10:40:58 Ubuntu-1804-bionic-64-minimal erigon[26301]: [INFO] [07-05|10:40:58.466] [p2p] GoodPeers eth66=42

@mandrigin mandrigin changed the title When the traffic is large, the node data synchronization is abnormal and the RPC is not accessible Deadlock of sync & RPC under load. Jul 5, 2022
@mandrigin mandrigin self-assigned this Jul 5, 2022
@mandrigin
Copy link
Collaborator

I will take a look, we suffer from that as well.

@mandrigin
Copy link
Collaborator

@AskAlexSharov there is definitely a deadlock somewhere in the built-in RPC service. It breaks both Engine API and RPCs. I am debugging.

@AskAlexSharov
Copy link
Collaborator

I can't reproduce. Try pkill -SIGUSR1 erigon

@mandrigin
Copy link
Collaborator

I can't reproduce that easily, it is pretty random. I think it happened during some hive tests and it definitely did happen on our validator nodes. As soon as I have one stuck, I will let you know.

@mandrigin
Copy link
Collaborator

@hritique @wetezos can you share your RPC load (methods thats are called)? Because it might be something with one of the methods, I didn't manage to pinpoint yet. If you can share which RPCs are being called, it would be much appreciated.

Or if you are ready to help, I can give you a branch with extra log output, so we can see the last PRC before the deadlock.

@kucharskim
Copy link

@mandrigin I'm happy to deploy debugging code on our Eirgon instance. I work with @hritique in the same team.

@mandrigin
Copy link
Collaborator

@kucharskim are you comfortable building from sources or should I pre-build a Docker image for you?

@mandrigin
Copy link
Collaborator

mandrigin commented Jul 5, 2022

anyway, here is the branch with extra logs: https://github.com/ledgerwatch/erigon/tree/issue4543

@wetezos
Copy link
Author

wetezos commented Jul 6, 2022

@hritique @wetezos can you share your RPC load (methods thats are called)? Because it might be something with one of the methods, I didn't manage to pinpoint yet. If you can share which RPCs are being called, it would be much appreciated.

Or if you are ready to help, I can give you a branch with extra log output, so we can see the last PRC before the deadlock.

anyway, here is the branch with extra logs: https://github.com/ledgerwatch/erigon/tree/issue4543

Got it , let me have a try.

@wetezos
Copy link
Author

wetezos commented Jul 6, 2022

I have a test just. This is the RPC log file. At present, the node is still running normally

test.log

@wetezos
Copy link
Author

wetezos commented Jul 6, 2022

I reproduced this problem just, this the last logs.

test.log

@kucharskim
Copy link

Yeah, from source is fine. I just built it, asked @hritique to deploy it and see how it goes.

@mandrigin
Copy link
Collaborator

okay, at some point it always times out it looks like, every requets takes 50s.

@wetezos if you still have the node running, can you do pkill -SIGUSR1 erigon and show the output?

same to you @kucharskim when you reproduce the bug

also also, please run it with --pprof flag and try to get me the goroutine trace like that

curl http://127.0.0.1:6060/debug/pprof/goroutines > goroutines.out

after you reproduce. I have more data to test now, but it will still be useful because you seem to be able to easily reproduce it.

@wetezos
Copy link
Author

wetezos commented Jul 6, 2022

if you still have the node running, can you do pkill -SIGUSR1 erigon and show the output?

Yes, our node is still running, but it has been restarted. The following is the log after pkill execution

pkill.log

@mandrigin
Copy link
Collaborator

@wetezos ah, but it syncs/works, pkill is useful when it stops syncing/responding to RPC, right now it doesn't show anything suspicious.

@wetezos
Copy link
Author

wetezos commented Jul 6, 2022

okay, at some point it always times out it looks like, every requets takes 50s.

@wetezos if you still have the node running, can you do pkill -SIGUSR1 erigon and show the output?

same to you @kucharskim when you reproduce the bug

also also, please run it with --pprof flag and try to get me the goroutine trace like that

curl http://127.0.0.1:6060/debug/pprof/goroutines > goroutines.out

after you reproduce. I have more data to test now, but it will still be useful because you seem to be able to easily reproduce it.

root@Ubuntu-1804-bionic-64-minimal:~# netstat -lntp | grep 60 tcp 0 0 127.0.0.1:6060 0.0.0.0:* LISTEN 24391/rpcdaemon root@Ubuntu-1804-bionic-64-minimal:~# curl http://127.0.0.1:6060/debug/pprof/goroutines Unknown profile

@mandrigin
Copy link
Collaborator

oh, sorry, then curl http://127.0.0.1:6060/debug/pprof/goroutine :) typo

@mandrigin
Copy link
Collaborator

also, this is necessary only when you reproduce the deadlock/rpc issue

@wetezos
Copy link
Author

wetezos commented Jul 6, 2022

also, this is necessary only when you reproduce the deadlock/rpc issue

Sure, this the out file.

goroutines.out.txt

@mandrigin
Copy link
Collaborator

If I see correctly, we seem to be leaking quite a bit of goroutines

-----------+-------------------------------------------------------
      2954   runtime.gopark
             runtime.selectgo
             google.golang.org/grpc.newClientStreamWithParams.func4
-----------+-------------------------------------------------------
      2897   runtime.gopark
             runtime.selectgo
             google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader
             google.golang.org/grpc/internal/transport.(*Stream).RecvCompress (inline)
             google.golang.org/grpc.(*csAttempt).recvMsg
             google.golang.org/grpc.(*clientStream).RecvMsg.func1
             google.golang.org/grpc.(*clientStream).withRetry
             google.golang.org/grpc.(*clientStream).RecvMsg
             github.com/ledgerwatch/erigon-lib/gointerfaces/remote.(*kVTxClient).Recv
             github.com/ledgerwatch/erigon-lib/kv/remotedb.(*RemoteKV).BeginRo
             github.com/ledgerwatch/erigon/cmd/rpcdaemon/commands.(*APIImpl).Call
             reflect.Value.call
             reflect.Value.Call
             github.com/ledgerwatch/erigon/rpc.(*callback).call
             github.com/ledgerwatch/erigon/rpc.(*handler).runMethod
             github.com/ledgerwatch/erigon/rpc.(*handler).handleCall
             github.com/ledgerwatch/erigon/rpc.(*handler).handleCallMsg
             github.com/ledgerwatch/erigon/rpc.(*handler).handleMsg.func1
             github.com/ledgerwatch/erigon/rpc.(*handler).startCallProc.func1
-----------+-------------------------------------------------------

@mandrigin
Copy link
Collaborator

@wetezos you are running the RPC daemon separately, right? what if you can do the same but with the built-in one?

@mandrigin
Copy link
Collaborator

anyway, here is the branch with extra logs: https://github.com/ledgerwatch/erigon/tree/issue4543

@wetezos @kucharskim I updated the branch with a possible fix, the commit is 991451d

Try to upgrade and see if it resolves your issue

@mandrigin
Copy link
Collaborator

@wetezos unfortunately, that is not the full log of goroutines, I see it is trimmed quite a bit (the top half is missing). Maybe you can try to tee Erigon log to a file, so I can see the full log with all 2 days of load?

@wetezos
Copy link
Author

wetezos commented Jul 11, 2022

@mandrigin Here is the full log.
pkill.log

@mandrigin
Copy link
Collaborator

Okay, there is the proof that it is the same issue

(1)

Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: goroutine 86556807 [select, 1375 minutes]:
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc00110a0a0, {0x1f3b970, 0xc0ca3fe150})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/mdbx/kv_mdbx.go:395 +0xeb
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0005c9c20, {0x1f44748, 0xc08f888a90})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/remotedbserver/server.go:77 +0x77

(2)
ul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: goroutine 86159493 [select, 1391 minutes]:
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc00110a0a0, {0x1f3b970, 0xc105f45b60})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/mdbx/kv_mdbx.go:395 +0xeb
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0005c9c20, {0x1f44748, 0xc105f56660})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/remotedbserver/server.go:77 +0x77


(3)
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc00110a0a0, {0x1f3b970, 0xc10eef4e40})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/mdbx/kv_mdbx.go:395 +0xeb
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0005c9c20, {0x1f44748, 0xc0f4857360})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/remotedbserver/server.go:77 +0x77


(4)
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: goroutine 86323352 [select, 1385 minutes]:
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc00110a0a0, {0x1f3b970, 0xc1588b1620})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/mdbx/kv_mdbx.go:395 +0xeb
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0005c9c20, {0x1f44748, 0xc158649770})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/remotedbserver/server.go:77 +0x77


(5)
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: goroutine 86146784 [select, 1391 minutes]:
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc00110a0a0, {0x1f3b970, 0xc0f8adcf60})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/mdbx/kv_mdbx.go:395 +0xeb
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]: github.com/ledgerwatch/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc0005c9c20, {0x1f44748, 0xc0a41450a0})
Jul 11 03:19:04 Ubuntu-1804-bionic-64-minimal erigon[25204]:         github.com/ledgerwatch/erigon-lib@v0.0.0-20220707065230-95e361fa1ed7/kv/remotedbserver/server.go:77 +0x77

Actually, there are at leat 7 goroutines stuck at this part, for a long time (see 1391 minutes and more).

@mandrigin
Copy link
Collaborator

I will need to check out more error scenarios

@revitteth
Copy link
Collaborator

@wetezos we've pushed changes which implement a semaphore on the transaction limit (#4712) please could you test? (Thank you for the repro btw, much appreciated!!!)

@wetezos
Copy link
Author

wetezos commented Jul 15, 2022

#4712

Sure, thanks for your hard work.

My version is right?

erigon --version
erigon version 2022.99.99-dev-e1dec529

git branch

  • devel

When i restart i got an error:
flag provided but not defined: -snapshots

@AskAlexSharov
Copy link
Collaborator

@wetezos show you cli command

@revitteth
Copy link
Collaborator

@wetezos forgive me if I'm missing something, but e1dec52 looks to be a commit from May, I assume the version should be at the tip of the devel branch, i.e. erigon version 2022.99.99-dev-92e2311

@wetezos
Copy link
Author

wetezos commented Jul 15, 2022

@wetezos forgive me if I'm missing something, but e1dec52 looks to be a commit from May, I assume the version should be at the tip of the devel branch, i.e. erigon version 2022.99.99-dev-92e2311

Great, i have upgraded to 2022.99.99-dev-92e2311e, let's have a try.

@wetezos
Copy link
Author

wetezos commented Jul 19, 2022

Hey @mandrigin @revittm We have been running for 3 days, and now it is normal

@kucharskim
Copy link

@hritique how are we doing? I see 4 days of uptime of a process running on:

# docker exec -ti 2839c5e592af cat /build-info
REPO_URL=https://github.com/ledgerwatch/erigon.git
REPO_BRANCH=devel
REPO_HASH=07e00b878c1ea38d39bf39f824c6bff433c94800

@revitteth
Copy link
Collaborator

Hey @mandrigin @revittm We have been running for 3 days, and now it is normal

Thanks so much, seems like it may have done the trick! @mandrigin are you happy this is resolved?

@mandrigin
Copy link
Collaborator

Yep, awesome!

@kaikash
Copy link

kaikash commented Aug 19, 2022

Hey! I'm facing the same issue :( I have forked version of erigon@e04401491fa310ec7bc54d0cc530cbaded6d70e8

From pkill -SIGUSR1 I can see many stucked goroutines

golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0000b3ef0, {0x1f60b60, 0xc0859249c0}, 0x1)
	golang.org/x/sync@v0.0.0-20220601150217-0de741cfad7f/semaphore/semaphore.go:60 +0x345
github.com/ledgerwatch/erigon-lib/erigon-lib/kv/mdbx.(*MdbxKV).BeginRo(0xc0000d7b80, {0x1f60b60, 0xc0859249c0})
	github.com/ledgerwatch/erigon-lib/erigon-lib@v0.0.0/kv/mdbx/kv_mdbx.go:399 +0x95
github.com/ledgerwatch/erigon-lib/erigon-lib/kv/remotedbserver.(*KvServer).Tx(0xc001180390, {0x1f69b18, 0xc086178760})
	github.com/ledgerwatch/erigon-lib/erigon-lib@v0.0.0/kv/remotedbserver/server.go:77 +0x77
github.com/ledgerwatch/erigon-lib/erigon-lib/gointerfaces/remote._KV_Tx_Handler({0x1770720?, 0xc001180390}, {0x1f65778?, 0xc0539ecdd0})
	github.com/ledgerwatch/erigon-lib/erigon-lib@v0.0.0/gointerfaces/remote/kv_grpc.pb.go:179 +0x9f
github.com/grpc-ecosystem/go-grpc-middleware/recovery.StreamServerInterceptor.func1({0x1770720?, 0xc001180390?}, {0x1f65778?, 0xc0539ecdd0?}, 0x77ca1c6c9428?, 0xc0539d4c00?)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/recovery/interceptors.go:51 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1({0x1770720?, 0xc001180390?}, {0x1f65778?, 0xc0539ecdd0?})
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1({0x1770720, 0xc001180390}, {0x1f65778, 0xc0539ecdd0}, 0x16045c0?, 0xc086178750?)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58 +0xbf
google.golang.org/grpc.(*Server).processStreamingRPC(0xc05d06d6c0, {0x1f6a8a0, 0xc0a192a9c0}, 0xc0539d9200, 0xc0a12209f0, 0x2a75fa0, 0x0)
	google.golang.org/grpc@v1.46.2/server.go:1549 +0xebb
google.golang.org/grpc.(*Server).handleStream(0xc05d06d6c0, {0x1f6a8a0, 0xc0a192a9c0}, 0xc0539d9200, 0x0)
	google.golang.org/grpc@v1.46.2/server.go:1624 +0x9d6
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc@v1.46.2/server.go:922 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc@v1.46.2/server.go:920 +0x28a```

@kaikash
Copy link

kaikash commented Aug 19, 2022

It happens when I try to do many eth_call simultaneously via websocket rpc.

@kaikash
Copy link

kaikash commented Aug 20, 2022

tried to run fresh erigon version, this problem remains :(

@kaikash
Copy link

kaikash commented Aug 20, 2022

Created new issue #5125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants