Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

unreasonably high memory usage (without crash) and won't shut down #10821

Closed
iFA88 opened this issue Jun 30, 2019 · 48 comments
Closed

unreasonably high memory usage (without crash) and won't shut down #10821

iFA88 opened this issue Jun 30, 2019 · 48 comments
Labels
F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. M4-core ⛓ Core client code / Rust.
Milestone

Comments

@iFA88
Copy link

iFA88 commented Jun 30, 2019

Greetings, sadly my Parity-Ethereum/v2.4.6-stable-94164e1-20190514/x86_64-linux-gnu/rustc1.34.1 node eats unreasonably high memory.
Node log and process statistics in CSV : https://www.fusionsolutions.io/doc/memlog.tar.gz

Start parameters are:

--ipc-apis all --reserved-peers /own/config/archiveEthNode.txt --no-serve-light --no-periodic-snapshot --jsonrpc-allow-missing-blocks --no-persistent-txqueue --jsonrpc-server-threads 8 --ipc-path=/own/sockets/ethNode.ipc --min-gas-price=10000000 --tx-queue-mem-limit=4096 --tx-queue-size=256000 --reseal-on-txs=all --force-sealing --base-path "/mnt/node-1/eth" --rpcport 8548 --port 30306 --no-ws --no-secretstore --cache-size 4096 --log-file /own/log/nodes/eth/parity_eth_$DATE.log"

The memory usage will not be higher as 12gb.

On 16:20:20 have killed the process with KILLSIG, this is the only way that I can shut down the process.

I glad help with any trace parameters or statistics.

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 30, 2019

How did you collect the memory stats showed in the csv?

@iFA88
Copy link
Author

iFA88 commented Jun 30, 2019

Like in python:

import psutil
process = psutil.Process(PID)
print(process.memory_info().rss)

Gives the same as the htop:
kép

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 30, 2019

So it's RSS, perfect! :)

The number of pending txs is pretty high, is that a normal amount in your setup?

@iFA88
Copy link
Author

iFA88 commented Jun 30, 2019

Yeah i'm parsing pending transactions to my DB. Check my start parameters :)

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 30, 2019

Other than staying in sync, what is the node doing? I.e. what kind of RPC traffic is it used for?

@iFA88
Copy link
Author

iFA88 commented Jun 30, 2019

I'm using for every new block this RPC's: trace_block eth_getBlockByNumber eth_getUncleByBlockHashAndIndex eth_blockNumber eth_getTransactionReceipt.
And for pending transactions every minute: parity_allTransactionHashes eth_getTransactionByHash.

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 30, 2019

I've been running a recent master build with your params now for ~6h and memory usage seems stable. While it's possible that this has been fixed in master, it is more probable that the leak is somewhere in the RPC layer. I need to set up some kind of load testing script to debug this further.

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 30, 2019

@iFA88 Do you have the possibility to confirm my findings by running a node without RPC traffic, just to check that it is indeed the RPC layer causing issues? Also, if you have a load testing script or something similar already written, that'd be helpful too ofc. Thanks!

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 1, 2019

In the log I see you must be running with --tracing on but that's not present in the the startup params from the original ticket. Are you using a config.toml file too? Can you post the full config please?

@iFA88
Copy link
Author

iFA88 commented Jul 1, 2019

I have run the node without any RPC call, but the memory has increased continuously. There is the log, but please ignore the peer and pending TX values:

https://www.fusionsolutions.io/doc/memlog2.tar.gz

Without any RPC call the shutdown works very fast.

@iFA88
Copy link
Author

iFA88 commented Jul 1, 2019

In the log I see you must be running with --tracing on but that's not present in the the startup params from the original ticket. Are you using a config.toml file too? Can you post the full config please?

You have right, tracing was ON while I synced from scratch, and after at it is automatic enabled.
No, i don't use any configuration file, only the parameters what I given in the first ticket.

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 1, 2019

Ok, not a problem. It explains why I couldn't repeat it. I'd have to slow sync the whole chain to repeat now I think so I'm going to try using the Goerli testnet and see if I can see the issue there. If you have the means to do so it would be great if you could try on Goerli as well using 2.4.x.

Thanks!

@iFA88
Copy link
Author

iFA88 commented Jul 1, 2019

@dvdplm Sadly that not, but if you wish I can set some trace parameter.

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 1, 2019

Is your node synched?

@iFA88
Copy link
Author

iFA88 commented Jul 2, 2019

Is your node synched?

Ofc, and you have seen that in the logs.

@iFA88
Copy link
Author

iFA88 commented Jul 2, 2019

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 2, 2019

Yeah, still synching Kovan here with traces. Goerli is synched and after 12+ hours show no signs of memory leaks.

@ordian
Copy link
Collaborator

ordian commented Jul 2, 2019

@dvdplm are you testing this on macOS? The problem could be related to heapsize, which uses jemallocator only on macOS.
@iFA88 could you test it with a recent master build? We've removed heapsize in #10432.

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 2, 2019

@ordian yes, and yes it is possible that this is a platform issue, but we'll see. For now I'm trying to rule out the obvious stuff. I'm not sure how long it takes to slow-sync mainnet with tracing on, but judging how long it takes on Kovan I think it could take weeks so I was hoping to find an easier way to reproduce this.

@iFA88
Copy link
Author

iFA88 commented Jul 2, 2019

@ordian I will upgrade my parity to https://github.com/paritytech/parity-ethereum/releases/tag/v2.4.9 I see that this build has the commit.

I need to KILLSIG the process, because they dont shut down.
I have an another node on the classic chain which are not infected with the issue: (Same parity version, but it is archive node with trace)
kép

@ordian
Copy link
Collaborator

ordian commented Jul 2, 2019

@iFA88 I don't think so, #10432 wasn't backported to stable and beta.

@iFA88
Copy link
Author

iFA88 commented Jul 2, 2019

@ordian Thats not the commit?:
v2.4.9...master
kép
Sorry when I'm wrong.

@ordian
Copy link
Collaborator

ordian commented Jul 2, 2019

@iFA88 you're comparing v2.4.9 with master, so it shows you the difference, i.e. the commits that are in master and not in 2.4.9.

@iFA88
Copy link
Author

iFA88 commented Jul 2, 2019

@ordian yes, i was wrong! If you can build the current master branch for linux, then I can use that, sadly i don't have any build tools now.

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 3, 2019

@iFA88 I think you can download a recent nightly from here (click the "Download" button on the right). It would be great if you could repeat the problem using that.

An update on my end: Goerli is synched and does not leak any memory. Kovan is still synching (and has been really stable, but that is irrelevant here).

@iFA88
Copy link
Author

iFA88 commented Jul 3, 2019

@dvdplm Alright, I ran now that binary. Idk why, but the classic chain works flawless.

I have a trace about the shutdown, please look at it:
https://www.fusionsolutions.io/doc/shutdownerror.tar.gz

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 3, 2019

@dvdplm Alright, I ran now that binary. Idk why, but the classic chain works flawless.

You mean running with --chain classic using the master build does not leak memory? Or using stable?

I have a trace about the shutdown, please look at it: https://www.fusionsolutions.io/doc/shutdownerror.tar.gz

That is 2.4.6 so the latest fixes for shutdown problems are not included. Best would be to debug this further using the latest releases (or master builds). For shutdown issues it'd be good to enable shutdown=trace level logging. I don't think logging is going to provide enough info here, but best keep it on.

@iFA88
Copy link
Author

iFA88 commented Jul 3, 2019

@dvdplm Yes, i have a classic node which runs in archive trace mode and the RES usage does not goes up as ~1.3gb, even not with Parity-Ethereum/v2.4.6-stable-94164e1-20190514/x86_64-linux-gnu/rustc1.34.1 or Parity-Ethereum/v2.4.9-stable-691580c-20190701/x86_64-linux-gnu/rustc1.35.0

I let the shutdown trace parameter on now and running Parity-Ethereum/v2.6.0-nightly-b4af8df-20190702/x86_64-linux-gnu/rustc1.35.0.

@iFA88
Copy link
Author

iFA88 commented Jul 3, 2019

Sadly the new parity (Parity-Ethereum/v2.6.0-nightly-b4af8df-20190702/x86_64-linux-gnu/rustc1.35.0) doesn't solved the memory issue:
https://www.fusionsolutions.io/doc/memlog3.tar.gz
kép

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 3, 2019

Parity-Ethereum/v2.6.0-nightly

Ok, and just to be clear: you ran it with mainnet with tracing on just like before, same settings except for shutdown logging?

Did you also experience shutdown problems with Parity-Ethereum/v2.6.0-nightly?

@iFA88
Copy link
Author

iFA88 commented Jul 3, 2019

@dvdplm yes and yes :(

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 3, 2019

Ok, so @ordian, this that tells us that this is not related to jemalloc, do you agree?

@iFA88
Copy link
Author

iFA88 commented Jul 4, 2019

The Parity-Ethereum/v2.6.0-nightly-b4af8df-20190702/x86_64-linux-gnu/rustc1.35.0 has been crashed at the night, the process runs, I can communicate trough RPC, they current block height is 8080446, so the syncing has been stopped. There was no incident even in kernel log or syslog. Free space was more than enough. I switch back to Parity-Ethereum/v2.4.9-stable-691580c-20190701/x86_64-linux-gnu/rustc1.35.0.
Last log:

2019-07-04 00:42:30  Verifier #7 INFO import  Imported #8081174 0xc041…70bb (92 txs, 7.64 Mgas, 78 ms, 24.82 KiB)
2019-07-04 00:42:33  Verifier #8 INFO import  Imported #8081175 0x38ca…d29f (50 txs, 7.98 Mgas, 73 ms, 17.97 KiB)
2019-07-04 00:42:43  IO Worker #0 INFO import    35/50 peers    208 MiB chain  145 MiB db  0 bytes queue    7 MiB sync  RPC:  0 conn,    0 req/s,    0 µs
2019-07-04 00:42:43  Verifier #6 INFO import  Import

@dvdplm
Copy link
Collaborator

dvdplm commented Jul 4, 2019

Ouch that doesn't sound good. When you say "crashed" do you mean that the process hung in some way or did it actually crash? I mean, you write that you could still query the node over RPC right?

I am still synching mainnet, am about half-way through but I anticipate it'll take a long while still.

I wonder if there's anyway you could share your database with us to speed up the investigation?

@iFA88
Copy link
Author

iFA88 commented Jul 4, 2019

Ouch that doesn't sound good. When you say "crashed" do you mean that the process hung in some way or did it actually crash? I mean, you write that you could still query the node over RPC right?

I called it crashed, because the logging and the syncing has stopped. Maybe the main thread has been hanged?! Yeah i have queried the block number to check for the sync works or not.

I wonder if there's anyway you could share your database with us to speed up the investigation?

I would glad to help, but i don't see any possibility how can we speed up this. If you wish i can set some parameter for the party. If you have any ideas share it.

@iFA88
Copy link
Author

iFA88 commented Jul 14, 2019

There is anything what I can do? The two parity which runs on main network eats my all of my RAM after 1-2 days. Daily restart is not the best solution :(

@andrewheadricke
Copy link

I am facing similar issues with latest Parity releases. I used to be able to sync easily and run other applications, however now after an hour or two of syncing consumes all my RAM and running other applications is not possible, even Parity alone causing the computer to lock up.

Parity used to be faster to sync and lighter on the RAM than Geth, but now I can control the RAM usage in Geth, so am looking to switch back.

@jam10o-new jam10o-new added F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. M4-core ⛓ Core client code / Rust. labels Jul 15, 2019
@iFA88
Copy link
Author

iFA88 commented Aug 3, 2019

I suggest the shutdown problem comes true when i send a shutdown signal to the node, but the node still accepts RPC calls and that prevents the shutdown process..

@iFA88
Copy link
Author

iFA88 commented Aug 11, 2019

I have discovered, when I don't use the --cache-size parameter, then the parity RES usage doesn't goes up as 2gb. When I use that parameter with ANY number then the memory usage goes up to 14GB (probably more but i don't have more for free) in 24 hours.

@iFA88
Copy link
Author

iFA88 commented Sep 14, 2019

Hey @dvdplm ! Can you please check my last comment with the --cache-size issue? Thank you!

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 14, 2019

@iFA88 apologies for the late answer. I have not been able to reproduce the problem with ram usage and --cache-size and I have tried many different versions and chains. On my machine, running macOS and 32Gb, memory usage is very stable. I know this is kind of useless and it's much more interesting to see what happens on a machine with less ram.
What happens on your end if you run with the other caching-related switches, i.e. this is what I am running currently: --cache-size-db=32096 --cache-size-blocks=2048 --cache-size-queue=32512 --cache-size-state=16096 (don't read too much into the specific numbers, I mostly picked them at random tbh). Do you still see RES ballooning after a while?

@iFA88
Copy link
Author

iFA88 commented Sep 14, 2019

@dvdplm Do we have any command to get cache statuses (usable/limit) or any debug level/trace?

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 14, 2019

No, not that I know of. It would be quite useful.

@iFA88
Copy link
Author

iFA88 commented Sep 14, 2019

I'm running now with --cache-size-blocks=128 --cache-size-db=2048 parameters. I dont use --cache-size now.

@iFA88
Copy link
Author

iFA88 commented Sep 16, 2019

The node uses now 9150mb RES after 2 days with the above parameters.

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 16, 2019

So I think I'm seeing something similar here: omitting the --cache* parameters seem to keep memory usage within limits. What I see here is that the sync speed slows down significantly as memory usage goes up (after restart the sync speed goes back up). So until we fix the bug I'd say the best work-around seems to be to avoid using those params.

@iFA88
Copy link
Author

iFA88 commented Sep 16, 2019

I can not measure the importing speed because every block has very different EVM calls. I will try now using the --cache-size again to check the issue.

@iFA88
Copy link
Author

iFA88 commented Sep 21, 2019

Ok, It seems the issue is somehow solved, When I'm using --cache-size the process RES usage doens't goes up 7-9 GB ( with --cache-size 2048 ). If i face with this issue again i will reopen the thread. Thanks for the support!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

No branches or pull requests

5 participants