Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probable goroutine leak running 1.0.2-beta #1002

Closed
maoueh opened this issue Sep 14, 2023 · 8 comments
Closed

Probable goroutine leak running 1.0.2-beta #1002

maoueh opened this issue Sep 14, 2023 · 8 comments

Comments

@maoueh
Copy link
Contributor

maoueh commented Sep 14, 2023

System information

Bor client version: v1.0.2-beta
Heimdall client version:
OS & Version: Linux
Environment: Polygon Mumbai
Type of node: Full node
Additional Information: Going via cmd/geth (see #details for extra provided details)

Overview of the problem

At some point in time, the process starts to drift unable to import new chain segments fast enough to keep up with the network. This problem happened yesterday at ~13:00 EST and again this morning at ~06:00 EST.

We have two full nodes syncing, both starts lagging (but a bit differently, timing are not exactly that same, ~1h apart).

Reproduction Steps

Sync with:

bor 
--networkid=80001 
--datadir=/data 
--ipcpath=/sf-data/reader/ipc 
--bor-mumbai 
--bor.heimdall=<heimdall_url>
--metrics 
--metrics.port=6061 
--metrics.addr=0.0.0.0 
--port=30303 
--syncmode=full 
--snapshot=false
--cache.snapshot=0 
--cache=8096 
--bootnodes=enode://320553cda00dfc003f499a3ce9598029f364fbb3ed1222fdc20a94d97dcc4d8ba0cd0bfa996579dcc6d17a534741fb0a5da303a90579431259150de66b597251@54.147.31.250:30303,enode://f0f48a8781629f95ff02606081e6e43e4aebd503f3d07fc931fad7dd5ca1ba52bd849a6f6c3be0e375cf13c9ae04d859c4a9ae3546dc8ed4f10aa5dbb47d4998@34.226.134.117:30303 
--http 
--http.port=8545 
--http.api=eth,net,web3 
--http.addr=0.0.0.0 
--http.vhosts=* 
--pprof 
--pprof.port=6062

Details

Note

So a disclaimer, we are still using cmd/geth entrypoint right now for legacy reasons, I don't think the problem is related. But in all cases, I'm going to test using cmd/cli as the entrypoint to see if there is a difference today.

So, this morning when the problem happened, I took at pprof profiles of different element namely the goroutine dump.

It appears there is ~200K goroutine active when the node is crawling down to its knee. 50% of them are coming from reportMetrics and 50% from github.com/JekaMas/workerpool.(*WorkerPool).dispatch

image image

Probably that both are related together I imagine. Stack for reportMetrics:

rpc.(*SafePool).reportMetrics
metrics.(*StandardHistogram).Update
metrics.(*ExpDecaySample).Update
metrics.(*ExpDecaySample).update
sync.(*Mutex).Lock
sync.(*Mutex).lockSlow
sync.runtime_SemacquireMutex
runtime.semacquire1
runtime.goparkunlock
runtime.gopark

Source:

github.com/ethereum/go-ethereum/rpc.(*SafePool).reportMetrics
github.com/ethereum/go-ethereum/rpc/execution_pool.go

Logs / Traces / Output / Error Messages

Not attaching anything for now, let me know what you need, I have:

  • sf_problem_mumbai_0.log (node's log)
  • sf_problem_mumbai_0_goroutine.pb.gz
  • sf_problem_mumbai_0_goroutine_full.txt
  • sf_problem_mumbai_0_heap.pb.gz
  • sf_problem_mumbai_0_iostat.txt
  • sf_problem_mumbai_0_profile.pb.gz
@maoueh
Copy link
Contributor Author

maoueh commented Sep 14, 2023

Also, this node is not exposed publicly to clients, furthermore, it receives little to 0 RPC calls.

@cffls
Copy link
Contributor

cffls commented Sep 15, 2023

Thanks @maoueh for reporting this issue. A potential fix PR is created: #1005 Could you test it out?

@maoueh
Copy link
Contributor Author

maoueh commented Sep 15, 2023

Perfect @cffls, By the way, removing the --metrics flag removed the reportMetrics leak.

However, (*WorkerPool).dispatch is still visible, going to test out the PR in a few hours and report back, while put back all my flags (e.g. adding back --metrics).

github.com/JekaMas/workerpool.(*WorkerPool).dispatch
github.com/JekaMas/workerpool@v1.1.8/workerpool.go

@maoueh
Copy link
Contributor Author

maoueh commented Sep 15, 2023

Ok, fix deployed, I'll let you know in a few hours

@maoueh
Copy link
Contributor Author

maoueh commented Sep 15, 2023

I don't see anymore worker pool and reportMetrics as outliers in the goroutine dump, so at first sight after 30m of running the fix, it seems good.

Will report back again later to see how it goes.

@cffls
Copy link
Contributor

cffls commented Sep 15, 2023

That's a great news. Thank you @maoueh for testing it out!

@maoueh
Copy link
Contributor Author

maoueh commented Sep 18, 2023

It ran full weekend without showing any signs of goroutine leak, the fix was the good one, thanks guys!

@cffls
Copy link
Contributor

cffls commented Sep 18, 2023

Thanks @maoueh for the update. Will close this issue.

@cffls cffls closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants