Improve performance syncing daemon with another daemon on the same local machine #7913

woodser · 2021-08-31T15:56:40Z

Syncing a daemon with latest blocks from the network is currently one of the slowest parts of using Monero.

When syncing the daemon, we should expect most time to be spent waiting on network transmission or other external factors.

However, we find that syncing one daemon from another daemon on the same local machine exhibits the same slowness with very low resource utilization of the host.

This suggests daemon sync speed can be improved dramatically by better utilizing resources of the host machine.

This issue requests investigating and improving performance of syncing one daemon from another local daemon in order to improve daemon sync speed.

Specifically requested is a list breaking down top time consumers when syncing locally, including references to related code in order to inform where optimizations are needed and focus developer effort.

One can sync locally in order to analyze performance by starting a local blockchain with 2 daemons syncing with each other:

Start daemon 1: ./monerod --stagenet --no-igd --hide-my-port --data-dir node1 --p2p-bind-ip 127.0.0.1 --p2p-bind-port 48080 --rpc-bind-port 48081 --zmq-rpc-bind-port 48082 --add-exclusive-node 127.0.0.1:38080 --rpc-login superuser:abctesting123 --rpc-access-control-origins http://localhost:8080 --fixed-difficulty 10
Start daemon 2: ./monerod --stagenet --no-igd --hide-my-port --data-dir node2 --p2p-bind-ip 127.0.0.1 --rpc-bind-ip 0.0.0.0 --confirm-external-bind --add-exclusive-node 127.0.0.1:48080 --rpc-login superuser:abctesting123 --rpc-access-control-origins http://localhost:8080 --fixed-difficulty 10
In either daemon, mine at least several hundred blocks, e.g.: start_mining 52aPELZwrwvVBNK4pvRZPNj4U5EEkZBsNTR2jozCLYyrhQySvYbWebTQEdt7RS9nFnRY9r88eFpt6UcsHKnVpCQDAFKu1Az 1
Stop daemon 1.
Pop some blocks from daemon 1's blockchain, e.g.: ./monero-blockchain-import --stagenet --data-dir ./node1 --pop-blocks 100
Restart daemon 1: ./monerod --stagenet --no-igd --hide-my-port --data-dir node1 --p2p-bind-ip 127.0.0.1 --p2p-bind-port 48080 --rpc-bind-port 48081 --zmq-rpc-bind-port 48082 --add-exclusive-node 127.0.0.1:38080 --rpc-login superuser:abctesting123 --rpc-access-control-origins http://localhost:8080 --fixed-difficulty 10
Observe that daemon 1 syncs relatively slowly and with very low resource utilization despite all operations being local.

The text was updated successfully, but these errors were encountered:

selsta · 2021-08-31T19:26:04Z

When syncing the daemon, we should expect most time to be spent waiting on network transmission or other external factors.

As far as I know daemon sync bottleneck is usually block verification and not networking related.

woodser · 2021-08-31T19:42:48Z

As far as I know daemon sync bottleneck is usually block verification and not networking related.

Since block verification is a local operation, it should ideally consume 100% cpu if it's the bottleneck. All cpu cores are currently underutilized so there should be room for optimization as a local operation.

selsta · 2021-08-31T19:55:34Z

Since block verification is a local operation, it should ideally consume 100% cpu if it's the bottleneck.

That implies that your disk storage has instant IO speeds. Verification requires looking up older transactions / blocks so a lot of non sequential reading. That's why SSD/NVMe speed up sync noticeably.

selsta · 2021-08-31T19:57:19Z

I'm sure things can be optimized, e.g. by using @vtnerd's ASM ECC lib (https://github.com/monero-project/supercop/tree/monero), which is currently only used for wallet scanning and not daemon sync.

cirocosta · 2021-08-31T22:55:25Z

I just recorded a run of two monerod instances (compiled out of current master 8fde011) with a setup pretty much the same as @woodser suggested but synced with stagenet and popping 7k blocks

using a mix of perf for capturing the samples an doing the first reporting

, then flamegraph and speedscope to visualize:

focusing on the [unkown] (symbol not found):

it appears that the lib might help but not substantially as functions like fe_mul don't account for all that much once we're past the checkpoints zone.

is that right? not familiar with this part of the codebase

selsta · 2021-08-31T22:57:07Z

Do you have stats for the checkpoints zone? Speeding that part up would also be interesting, as it accounts for the majority in a fresh sync.

cirocosta · 2021-08-31T23:13:20Z

@selsta, just gave it a run here - starting the sync from scratch this time, we can see that most of the samples are simply cn_slow_hash (90% of total):

to generate these:

# record the samples
#
        sudo perf record -a -F 500 -g -p $pid -- sleep $RECORD_DURATION


# do some terminal-based exploration
#
        sudo perf report

# output in a way that flamegraph can consume and then generate svg
# see https://github.com/brendangregg/FlameGraph
# 
        sudo perf script >perf.script
        stackcollapse-perf.pl perf.script > ./stacks-collapsed
        cat ./stacks-collapsed | flamegraph.pl >flamegraph.svg     # (or, open `stacks-collapsed on https://www.speedscope.app/)

vtnerd · 2021-08-31T23:17:52Z

I've already been looking at that this for a few months - I've done some recent PRs to help with data copying that occurs on both sending and receiving side of the p2p protocols. They help a bit, but as @cirocosta points out they do not constitute the majority of the time spent syncing. I think there's some improvements to be done in the "block_span" code with copying as well, but there's still that wall of cryptography.

@selsta the ECC library was not used with synchronization to reduce the possibility of a chain fork with different implementations. Perhaps it could be used only during synchronization but not after?

selsta · 2021-08-31T23:18:50Z

@selsta the ECC library was not used with synchronization to reduce the possibility of a chain fork with different implementations. Perhaps it could be used only during synchronization but not after?

Yes, that was my intention. Only use it for historical sync.

vtnerd · 2021-08-31T23:22:07Z

#7803 is the outstanding work on data copying, but if you look at the numbers they aren't the dominating time for a ryzen3 desktop chip. My expectation is that CPUs with smaller caches will benefit more with that patch (less cache thrashing).

@selsta thats an intriguing thought that I will look into. The major issue is that the code is focused on wallet scanning, and will need some more functions for bulletproofs, etc.

woodser · 2021-09-02T16:15:28Z

@selsta thats an intriguing thought that I will look into. The major issue is that the code is focused on wallet scanning, and will need some more functions for bulletproofs, etc.

Throwing in my support. Switching the ECC library for historical sync could improve sync time and UX substantially.

boogerlad · 2021-09-22T16:45:26Z

Since block verification is a local operation, it should ideally consume 100% cpu if it's the bottleneck.

That implies that your disk storage has instant IO speeds. Verification requires looking up older transactions / blocks so a lot of non sequential reading. That's why SSD/NVMe speed up sync noticeably.

would 3d xpoint or a ramdrive be even faster then?

selsta · 2021-09-22T16:46:57Z

The whole blockchain in RAM would speed sync up, yes. Not familiar with "3d xpoint".

boogerlad · 2021-09-22T17:12:51Z

instead of NAND flash used in most SSDs, it uses a kind of phase change memory. See here: https://ark.intel.com/content/www/us/en/ark/products/123623/intel-optane-ssd-900p-series-280gb-2-5in-pcie-x4-20nm-3d-xpoint.html

I have one of those fancy SSDs and wouldn't mind running benchmarks if there are some bash scripts to do so.

How much faster would the whole blockchain in RAM be? Would the bottleneck still be i/o or would it now be the CPU?

woodser mentioned this issue Sep 2, 2021

Provide reports on Haveno's performance[ɱ2] haveno-dex/haveno#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance syncing daemon with another daemon on the same local machine #7913

Improve performance syncing daemon with another daemon on the same local machine #7913

woodser commented Aug 31, 2021 •

edited

selsta commented Aug 31, 2021

woodser commented Aug 31, 2021 •

edited

selsta commented Aug 31, 2021

selsta commented Aug 31, 2021 •

edited

cirocosta commented Aug 31, 2021

selsta commented Aug 31, 2021 •

edited

cirocosta commented Aug 31, 2021 •

edited

vtnerd commented Aug 31, 2021

selsta commented Aug 31, 2021

vtnerd commented Aug 31, 2021

woodser commented Sep 2, 2021

boogerlad commented Sep 22, 2021

selsta commented Sep 22, 2021

boogerlad commented Sep 22, 2021 •

edited

Improve performance syncing daemon with another daemon on the same local machine #7913

Improve performance syncing daemon with another daemon on the same local machine #7913

Comments

woodser commented Aug 31, 2021 • edited

selsta commented Aug 31, 2021

woodser commented Aug 31, 2021 • edited

selsta commented Aug 31, 2021

selsta commented Aug 31, 2021 • edited

cirocosta commented Aug 31, 2021

selsta commented Aug 31, 2021 • edited

cirocosta commented Aug 31, 2021 • edited

vtnerd commented Aug 31, 2021

selsta commented Aug 31, 2021

vtnerd commented Aug 31, 2021

woodser commented Sep 2, 2021

boogerlad commented Sep 22, 2021

selsta commented Sep 22, 2021

boogerlad commented Sep 22, 2021 • edited

woodser commented Aug 31, 2021 •

edited

woodser commented Aug 31, 2021 •

edited

selsta commented Aug 31, 2021 •

edited

selsta commented Aug 31, 2021 •

edited

cirocosta commented Aug 31, 2021 •

edited

boogerlad commented Sep 22, 2021 •

edited