Improving sync parachain performance for workers #2888

silva-fj · 2024-07-11T16:53:07Z

This PR improves the performance for sync with parachains a bit. I've tested it with litentry rococo.

Here are the results:

Current:

[2024-07-11T15:03:34.444Z INFO  litentry_worker::parentchain_handler] [Litentry] Syncing blocks from 0 to 24043
[2024-07-11T15:03:34.444Z DEBUG litentry_worker::parentchain_handler] >>>>>>>>>>>>>>>>>>>> Starting batch sync <<<<<<<<<<<<<<<<<<<<
[2024-07-11T15:03:34.909Z INFO  litentry_worker::parentchain_handler] [Litentry] syncing parentchain to 24043. already synced until block 0. immediate import=true. est. remaining: 3h 6m 3s 526ms 747us 150ns
[2024-07-11T15:03:34.909Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1001 blocks in 464ms 264us 889ns
[2024-07-11T15:03:38.867Z DEBUG litentry_worker::parentchain_handler] >>>>> [Litentry] Synced parentchain batch in 3s 958ms 550us 433ns
[2024-07-11T15:03:38.867Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 1001 out of 24043 finalized parentchain blocks
[2024-07-11T15:03:39.316Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1001 blocks in 447ms 820us 558ns
[2024-07-11T15:03:46.684Z DEBUG litentry_worker::parentchain_handler] >>>>> [Litentry] Synced parentchain batch in 7s 368ms 798us 150ns
[2024-07-11T15:03:46.684Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 2002 out of 24043 finalized parentchain blocks
[2024-07-11T15:03:47.124Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1001 blocks in 439ms 177us 262ns
[2024-07-11T15:03:54.511Z DEBUG litentry_worker::parentchain_handler] >>>>> [Litentry] Synced parentchain batch in 7s 386ms 474us 448ns
[2024-07-11T15:03:54.511Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 3003 out of 24043 finalized parentchain blocks
[2024-07-11T15:03:54.959Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1001 blocks in 447ms 534us 184ns
[2024-07-11T15:04:02.491Z DEBUG litentry_worker::parentchain_handler] >>>>> [Litentry] Synced parentchain batch in 7s 532ms 744us 764ns
[2024-07-11T15:04:02.492Z DEBUG litentry_worker::parentchain_handler] [Litentry] Synced 4004 out of 24043 finalized parentchain blocks

New:

[2024-07-11T14:41:33.548Z INFO  litentry_worker::parentchain_handler] [Litentry] Syncing blocks from 0 to 23933
[2024-07-11T14:41:33.548Z DEBUG litentry_worker::parentchain_handler] >>>>>>>>>>>>>>>>>>>> Starting batch sync <<<<<<<<<<<<<<<<<<<<
[2024-07-11T14:41:33.635Z INFO  litentry_worker::parentchain_handler] [Litentry] syncing parentchain to 23933. already synced until block 0. immediate import=true. est. remaining: 34m 23s 468ms 389us 619ns
[2024-07-11T14:41:33.635Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1000 blocks in 86ms 182us 830ns
[2024-07-11T14:41:33.635Z DEBUG litentry_worker::parentchain_handler] [Litentry] Found 1000 block(s) to sync in this chunk. immediate import=true
[2024-07-11T14:41:37.682Z DEBUG litentry_worker::parentchain_handler] >>>> [Litentry] Synced parentchain batch in 4s 47ms 236us 657ns
[2024-07-11T14:41:37.682Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 1000 out of 23933 finalized parentchain blocks
[2024-07-11T14:41:37.765Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1000 blocks in 82ms 435us 816ns
[2024-07-11T14:41:37.765Z DEBUG litentry_worker::parentchain_handler] [Litentry] Found 1000 block(s) to sync in this chunk. immediate import=true
[2024-07-11T14:41:45.298Z DEBUG litentry_worker::parentchain_handler] >>>> [Litentry] Synced parentchain batch in 7s 533ms 92us 133ns
[2024-07-11T14:41:45.298Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 2000 out of 23933 finalized parentchain blocks
[2024-07-11T14:41:45.384Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1000 blocks in 85ms 803us 899ns
[2024-07-11T14:41:45.384Z DEBUG litentry_worker::parentchain_handler] [Litentry] Found 1000 block(s) to sync in this chunk. immediate import=true
[2024-07-11T14:41:52.704Z DEBUG litentry_worker::parentchain_handler] >>>> [Litentry] Synced parentchain batch in 7s 319ms 210us 969ns
[2024-07-11T14:41:52.704Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 3000 out of 23933 finalized parentchain blocks
[2024-07-11T14:41:52.789Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1000 blocks in 84ms 480us 849ns
[2024-07-11T14:41:52.789Z DEBUG litentry_worker::parentchain_handler] [Litentry] Found 1000 block(s) to sync in this chunk. immediate import=true
[2024-07-11T14:42:00.305Z DEBUG litentry_worker::parentchain_handler] >>>> [Litentry] Synced parentchain batch in 7s 516ms 377us 787ns
[2024-07-11T14:42:00.305Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 4000 out of 23933 finalized parentchain blocks
[2024-07-11T14:42:00.391Z DEBUG litentry_worker::parentchain_handler] [Litentry] Fetched 1000 blocks in 85ms 436us 539ns
[2024-07-11T14:42:00.391Z DEBUG litentry_worker::parentchain_handler] [Litentry] Found 1000 block(s) to sync in this chunk. immediate import=true
[2024-07-11T14:42:07.858Z DEBUG litentry_worker::parentchain_handler] >>>> [Litentry] Synced parentchain batch in 7s 467ms 196us 715ns
[2024-07-11T14:42:07.858Z INFO  litentry_worker::parentchain_handler] [Litentry] Synced 5000 out of 23933 finalized parentchain blocks

Notice how the time to fetch the blocks is significantly reduced by calling the api in parallel. There may be some more room for improvement but we need to measure first to be able to find the bottlenecks, which is not easy task on development when we don't have the same data as in prod.

In the future (once we migrate to the new sgx sdk), we could use the async version of the substrate-api-client. More context in the issue.

linear · 2024-07-11T16:53:09Z

P-859 Sync live parachain is too slow for workers

…kers

silva-fj · 2024-07-12T12:21:22Z

Running lit_test_failed_parentchain_extrinsic.sh works for me locally, I'm not sure what the problem is with CI

Using client binary ./../bin/litentry-cli
Using node uri ws://127.0.0.1:10044
Using trusted-worker uri ws://127.0.0.1:2000

CLIENT is: ./../bin/litentry-cli -p 10044 -P 2000 -u ws://127.0.0.1 -U ws://127.0.0.1
New Account created: 0xd266a81c0d51c9b0ece2f85daa3d659e4bde596d2fd42d555109730e70255f0f
Linking identity to Bob
Finished Linking identity to Bob
Capturing IDGraph Hash of Bob
Initial ID Graph Hash of Bob: 0x8cffcda6d69f27ba44bf82adf4b9a244b0c1560c2c057280f84c15ea381c2818
New Account created: 0x8c4f7ca222e5e30323979918435d7e77899130ee0923a2d30f76510580b13155
Linking new identity to Bob with Eve as delegate signer
Finished Linking identity to Bob
Capturing IDGraph Hash of Bob
Final ID Graph Hash of Bob: 0x8cffcda6d69f27ba44bf82adf4b9a244b0c1560c2c057280f84c15ea381c2818
Failed Parentchain extrinsic not executed

BillyWooo

LGTM
good to know about into_par_iter

…kers

Kailai-Wang · 2024-07-15T08:41:47Z

Running lit_test_failed_parentchain_extrinsic.sh works for me locally, I'm not sure what the problem is with CI

Using client binary ./../bin/litentry-cli
Using node uri ws://127.0.0.1:10044
Using trusted-worker uri ws://127.0.0.1:2000

CLIENT is: ./../bin/litentry-cli -p 10044 -P 2000 -u ws://127.0.0.1 -U ws://127.0.0.1
New Account created: 0xd266a81c0d51c9b0ece2f85daa3d659e4bde596d2fd42d555109730e70255f0f
Linking identity to Bob
Finished Linking identity to Bob
Capturing IDGraph Hash of Bob
Initial ID Graph Hash of Bob: 0x8cffcda6d69f27ba44bf82adf4b9a244b0c1560c2c057280f84c15ea381c2818
New Account created: 0x8c4f7ca222e5e30323979918435d7e77899130ee0923a2d30f76510580b13155
Linking new identity to Bob with Eve as delegate signer
Finished Linking identity to Bob
Capturing IDGraph Hash of Bob
Final ID Graph Hash of Bob: 0x8cffcda6d69f27ba44bf82adf4b9a244b0c1560c2c057280f84c15ea381c2818
Failed Parentchain extrinsic not executed

Hmm if it only fails in this PR while working elsewhere, we need to look into it

silva-fj · 2024-07-15T09:37:37Z

Hmm if it only fails in this PR while working elsewhere, we need to look into it

I can't reproduce it on azure-dev-2, I'll add some logs to see if I can find something on CI

felixfaisal

Thanks @silva-fj
So I'm not convinced with the use of rayon and parallelism to speed up the syncing and I don't think there is enough testing evidence to suggest otherwise, So a few points

There is no control on the amount or level of parallelism, that is we have par_iter which probably takes chunks of 1000 blocks at a time, Does that mean we spawn a 1000 threads each time?
If we don't have a millisecond latency in response times from the Node then we potentially have a 1000 threads idling which is eating up resources. (even if there is a 60s timeout)
So in the CI, the indirect call execution takes more time than anticipated, I increased the sleep time and it passed - https://github.com/litentry/litentry-parachain/actions/runs/9943514750 which would indicate that OS is having a tough time scheduling threads
Need more testing to show that other parts of the system such as vc-task and link-identity (via stf) are not having to queue up tasks due to non-availability of threads.

Can we control the number of threads used by rayon and limit it?

silva-fj · 2024-07-15T21:50:24Z

Thanks @silva-fj So I'm not convinced with the use of rayon and parallelism to speed up the syncing and I don't think there is enough testing evidence to suggest otherwise, So a few points

There is no control on the amount or level of parallelism, that is we have par_iter which probably takes chunks of 1000 blocks at a time, Does that mean we spawn a 1000 threads each time?

If we don't have a millisecond latency in response times from the Node then we potentially have a 1000 threads idling which is eating up resources. (even if there is a 60s timeout)

So in the CI, the indirect call execution takes more time than anticipated, I increased the sleep time and it passed - https://github.com/litentry/litentry-parachain/actions/runs/9943514750 which would indicate that OS is having a tough time scheduling threads

Need more testing to show that other parts of the system such as vc-task and link-identity (via stf) are not having to queue up tasks due to non-availability of threads.

Can we control the number of threads used by rayon and limit it?

By default, Rayon uses the same number of threads as the number of CPUs available. On systems with hyperthreading enabled this equals the number of logical cores and not the physical ones.

Yes, we can control the number of threads used but also please note that this is used during initialisation, so I think it should be fine if it takes some resources. If this theory is correct, it probably just means that the resources available on CI are a bit limited which is understandable. The test runs fine on dev-servers, so lack of resources on CI could explain it and if that's the case I think we should probably wait until the initialisation is done (maybe calling the /is_initialized endpoint) before calling the cli commands in the test.

I'll try to test it a bit, if your theory is correct it would explain a lot, thank you for looking into it 👍🏼

So I'm not convinced with the use of rayon and parallelism to speed up the syncing and I don't think there is enough testing evidence to suggest otherwise

Can you elaborate on this? Ideally we should be using the async api to make these requests but unfortunately we can't use it yet. Nevertheless, calling these blocking functions in parallel does help to improve the sync a bit

felixfaisal · 2024-07-16T03:03:51Z

Can you elaborate on this? Ideally we should be using the async api to make these requests but unfortunately we can't use it yet. Nevertheless, calling these blocking functions in parallel does help to improve the sync a bit

Yeah I would prefer async over multi-threading since these are mostly I/O bound tasks, but I guess multi-threading is the only option we currently have. Something else related to infrastructure is if the node can support so many requests in prod without terminating the WS connection but this is out of scope for this PR.

please note that this is used during initialisation, so I think it should be fine if it takes some resources

I missed out on this, then in that case, i guess it's fine to allocate as much as possible

I think we should probably wait until the initialisation is done (maybe calling the /is_initialized endpoint) before calling the cli commands in the test.

Sounds like a good idea

…kers

silva-fj · 2024-07-16T12:53:35Z

I've spent quite some time debugging on CI and trying to make the script to work by calling the /is_initialized endpoint and running the tests after the initialization but I could not make it work (on CI), maybe I was not using the right port. So, for now I've increased the sleep time before requesting the id graphs to make the test pass

felixfaisal

It looks good, Still a bit skeptical about how this would work in prod or with a live chain. Would be nice if we have someone else to take a look.

Kailai-Wang · 2024-07-18T21:45:09Z

I'm ok-ish with it, although I'm not sure how many CPUs will be available within an enclave.

Can you test it a bit with sdk-v2.0 branch before merging it?
I ask this because we have a lower number of TCSNum there

silva-fj · 2024-07-22T07:31:20Z

I'm ok-ish with it, although I'm not sure how many CPUs will be available within an enclave.

Can you test it a bit with sdk-v2.0 branch before merging it? I ask this because we have a lower number of TCSNum there

There is no change in how the enclave api is being called, it's still 1 call at a time on each loop. What is different now is how we collect the data to pass to the enclave api on each loop

silva-fj · 2024-07-22T07:32:51Z

Still a bit skeptical about how this would work in prod or with a live chain

If our production nodes can't handle this load, we have bigger problems that this 😆

…kers

Kailai-Wang · 2024-07-23T14:12:25Z

Cool, let's get it merged

…kers

silva-fj added 2 commits July 11, 2024 16:33

improving sync_parentchain performance for bitacross-worker

bffc238

improving sync_parentchain performance for litentry-worker

9d30ab3

silva-fj requested a review from a team July 11, 2024 16:53

silva-fj and others added 4 commits July 11, 2024 18:53

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

3311404

…kers

fixing parentchain_api_mock

b87d670

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

128a5ed

…kers

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

38450fd

…kers

BillyWooo approved these changes Jul 14, 2024

View reviewed changes

Kailai-Wang and others added 3 commits July 15, 2024 08:34

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

3410fc8

…kers

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

4732cc1

…kers

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

b049a3c

…kers

silva-fj force-pushed the p-859-sync-live-parachain-is-too-slow-for-workers branch 3 times, most recently from 17d76f2 to b049a3c Compare July 15, 2024 15:22

felixfaisal reviewed Jul 15, 2024

View reviewed changes

increasing sleep time

5e690c0

silva-fj force-pushed the p-859-sync-live-parachain-is-too-slow-for-workers branch from 754d506 to 5e690c0 Compare July 16, 2024 12:21

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

f5a9620

…kers

silva-fj requested a review from a team July 18, 2024 07:12

felixfaisal approved these changes Jul 18, 2024

View reviewed changes

silva-fj and others added 3 commits July 22, 2024 09:33

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

8f456e2

…kers

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

314b013

…kers

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

ff1dbc2

…kers

Kailai-Wang enabled auto-merge (squash) July 23, 2024 14:12

Merge branch 'dev' into p-859-sync-live-parachain-is-too-slow-for-wor…

3685f66

…kers

Kailai-Wang merged commit f789aad into dev Jul 23, 2024
32 checks passed

Kailai-Wang deleted the p-859-sync-live-parachain-is-too-slow-for-workers branch July 23, 2024 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving sync parachain performance for workers #2888

Improving sync parachain performance for workers #2888

silva-fj commented Jul 11, 2024 •

edited

Loading

linear bot commented Jul 11, 2024

silva-fj commented Jul 12, 2024

BillyWooo left a comment

Kailai-Wang commented Jul 15, 2024

silva-fj commented Jul 15, 2024

felixfaisal left a comment

silva-fj commented Jul 15, 2024

felixfaisal commented Jul 16, 2024

silva-fj commented Jul 16, 2024

felixfaisal left a comment

Kailai-Wang commented Jul 18, 2024

silva-fj commented Jul 22, 2024

silva-fj commented Jul 22, 2024

Kailai-Wang commented Jul 23, 2024

Improving sync parachain performance for workers #2888

Improving sync parachain performance for workers #2888

Conversation

silva-fj commented Jul 11, 2024 • edited Loading

linear bot commented Jul 11, 2024

silva-fj commented Jul 12, 2024

BillyWooo left a comment

Choose a reason for hiding this comment

Kailai-Wang commented Jul 15, 2024

silva-fj commented Jul 15, 2024

felixfaisal left a comment

Choose a reason for hiding this comment

silva-fj commented Jul 15, 2024

felixfaisal commented Jul 16, 2024

silva-fj commented Jul 16, 2024

felixfaisal left a comment

Choose a reason for hiding this comment

Kailai-Wang commented Jul 18, 2024

silva-fj commented Jul 22, 2024

silva-fj commented Jul 22, 2024

Kailai-Wang commented Jul 23, 2024

silva-fj commented Jul 11, 2024 •

edited

Loading