requesting node metrics from the API in a short interval causes a high CPU load #7528
Labels
effort/days
Estimated to take multiple days, but less than a week
exp/expert
Having worked on the specific codebase is important
kind/enhancement
A net-new feature or improvement to an existing feature
P3
Low: Not priority right now
status/ready
Ready to be worked
topic/perf
Performance
Version information:
Description:
I'm running ipfs on a new server with an SSD storage. I'm writing a lot of individual files with
ipfs add --chunker 'buzhash' --cid-version 1 --hash 'blake2b-256'
to the node, copy them to the right location in the MFS and unpin them again (since ipfs files write doesn't support setting a non-standard chunker).Afterwards, the MFS-folder-CID is pinned on ipfs-cluster, which runs on the same node.
ipfs-cluster shows that all cluster-pins are locally pinned, which are part of the pinset.
Another remote server has also all pins of the cluster set pinned, two other servers still catch up - so they are receiving blocks from the local node.
The low bandwidth use, while it should send a somewhat large folder to two other nodes brought a possible issue to my attention - the outgoing network speed was shown as around 4 MBit/s which is extremely slow for a server basically doing nothing else.
The CPU usage (around 200%) is extremely high for the network usage, so I thought it might still publish CIDs, and went to sleep.
System specs:
4 dedicated cores for the VM from an AMD EPYC 7702P 64-Core Processor; 16 GB of memory.
There are no background tasks running, just ipfs and ipfs-cluster. ipfs-cluster uses like no CPU resources at all.
I tried changing the dht type to
dhtclient
, but this resulted in no change. Restarting the service also resulted in no change, the CPU usage just jumps up again to around 200%.The debug data (I forgot to collect the last ones) - and the binary since it's built from the master. When I read the cpu-profile right, it leads to lot of CPU-time being used by go-ds-badger and go-ipfs-blockstore and functions called by them (flame graph). The debug data was collected some minutes after a restart of the IPFS-daemon, while the ipfs-cluster-service was turned off.
debug.tar.gz
Here are some performance numbers collected on the system, which basically shows no difference in load, while there's only very low network traffic.
Config
DisableBandwidthMetrics
andDisableNatPortMap
are true,EnableAutoRelay
andEnableRelayHop
are false. I use the server-profile and routing.type isdhtclient
. I use the badgerds,StorageGCWatermark
is 90,StorageMax
is 280GB.I use the systemd-hardening.service file from the repo, but changed the ExecStart to
/usr/bin/ipfs daemon --enable-gc --enable-pubsub-experiment --enable-namesys-pubsub
The text was updated successfully, but these errors were encountered: