Probable `goroutine` leak running `1.0.2-beta` #1002

maoueh · 2023-09-14T13:29:32Z

System information

Bor client version: v1.0.2-beta
Heimdall client version:
OS & Version: Linux
Environment: Polygon Mumbai
Type of node: Full node
Additional Information: Going via cmd/geth (see #details for extra provided details)

Overview of the problem

At some point in time, the process starts to drift unable to import new chain segments fast enough to keep up with the network. This problem happened yesterday at ~13:00 EST and again this morning at ~06:00 EST.

We have two full nodes syncing, both starts lagging (but a bit differently, timing are not exactly that same, ~1h apart).

Reproduction Steps

Sync with:

bor 
--networkid=80001 
--datadir=/data 
--ipcpath=/sf-data/reader/ipc 
--bor-mumbai 
--bor.heimdall=<heimdall_url>
--metrics 
--metrics.port=6061 
--metrics.addr=0.0.0.0 
--port=30303 
--syncmode=full 
--snapshot=false
--cache.snapshot=0 
--cache=8096 
--bootnodes=enode://320553cda00dfc003f499a3ce9598029f364fbb3ed1222fdc20a94d97dcc4d8ba0cd0bfa996579dcc6d17a534741fb0a5da303a90579431259150de66b597251@54.147.31.250:30303,enode://f0f48a8781629f95ff02606081e6e43e4aebd503f3d07fc931fad7dd5ca1ba52bd849a6f6c3be0e375cf13c9ae04d859c4a9ae3546dc8ed4f10aa5dbb47d4998@34.226.134.117:30303 
--http 
--http.port=8545 
--http.api=eth,net,web3 
--http.addr=0.0.0.0 
--http.vhosts=* 
--pprof 
--pprof.port=6062

Details

Note

So a disclaimer, we are still using cmd/geth entrypoint right now for legacy reasons, I don't think the problem is related. But in all cases, I'm going to test using cmd/cli as the entrypoint to see if there is a difference today.

So, this morning when the problem happened, I took at pprof profiles of different element namely the goroutine dump.

It appears there is ~200K goroutine active when the node is crawling down to its knee. 50% of them are coming from reportMetrics and 50% from github.com/JekaMas/workerpool.(*WorkerPool).dispatch

Probably that both are related together I imagine. Stack for reportMetrics:

rpc.(*SafePool).reportMetrics
metrics.(*StandardHistogram).Update
metrics.(*ExpDecaySample).Update
metrics.(*ExpDecaySample).update
sync.(*Mutex).Lock
sync.(*Mutex).lockSlow
sync.runtime_SemacquireMutex
runtime.semacquire1
runtime.goparkunlock
runtime.gopark

Source:

github.com/ethereum/go-ethereum/rpc.(*SafePool).reportMetrics
github.com/ethereum/go-ethereum/rpc/execution_pool.go

Logs / Traces / Output / Error Messages

Not attaching anything for now, let me know what you need, I have:

sf_problem_mumbai_0.log (node's log)
sf_problem_mumbai_0_goroutine.pb.gz
sf_problem_mumbai_0_goroutine_full.txt
sf_problem_mumbai_0_heap.pb.gz
sf_problem_mumbai_0_iostat.txt
sf_problem_mumbai_0_profile.pb.gz

The text was updated successfully, but these errors were encountered:

maoueh · 2023-09-14T13:31:00Z

Also, this node is not exposed publicly to clients, furthermore, it receives little to 0 RPC calls.

cffls · 2023-09-15T00:07:53Z

Thanks @maoueh for reporting this issue. A potential fix PR is created: #1005 Could you test it out?

maoueh · 2023-09-15T05:34:16Z

Perfect @cffls, By the way, removing the --metrics flag removed the reportMetrics leak.

However, (*WorkerPool).dispatch is still visible, going to test out the PR in a few hours and report back, while put back all my flags (e.g. adding back --metrics).

github.com/JekaMas/workerpool.(*WorkerPool).dispatch
github.com/JekaMas/workerpool@v1.1.8/workerpool.go

maoueh · 2023-09-15T17:45:35Z

Ok, fix deployed, I'll let you know in a few hours

maoueh · 2023-09-15T18:16:33Z

I don't see anymore worker pool and reportMetrics as outliers in the goroutine dump, so at first sight after 30m of running the fix, it seems good.

Will report back again later to see how it goes.

cffls · 2023-09-15T19:07:42Z

That's a great news. Thank you @maoueh for testing it out!

maoueh · 2023-09-18T18:58:59Z

It ran full weekend without showing any signs of goroutine leak, the fix was the good one, thanks guys!

cffls · 2023-09-18T19:09:30Z

Thanks @maoueh for the update. Will close this issue.

cffls closed this as completed Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probable `goroutine` leak running `1.0.2-beta` #1002

Probable `goroutine` leak running `1.0.2-beta` #1002

maoueh commented Sep 14, 2023 •

edited

maoueh commented Sep 14, 2023

cffls commented Sep 15, 2023 •

edited

maoueh commented Sep 15, 2023

maoueh commented Sep 15, 2023

maoueh commented Sep 15, 2023

cffls commented Sep 15, 2023

maoueh commented Sep 18, 2023

cffls commented Sep 18, 2023

Probable goroutine leak running 1.0.2-beta #1002

Probable goroutine leak running 1.0.2-beta #1002

Comments

maoueh commented Sep 14, 2023 • edited

System information

Overview of the problem

Reproduction Steps

Details

Logs / Traces / Output / Error Messages

maoueh commented Sep 14, 2023

cffls commented Sep 15, 2023 • edited

maoueh commented Sep 15, 2023

maoueh commented Sep 15, 2023

maoueh commented Sep 15, 2023

cffls commented Sep 15, 2023

maoueh commented Sep 18, 2023

cffls commented Sep 18, 2023

Probable `goroutine` leak running `1.0.2-beta` #1002

Probable `goroutine` leak running `1.0.2-beta` #1002

maoueh commented Sep 14, 2023 •

edited

cffls commented Sep 15, 2023 •

edited