[ADP-634] Run SMASH metadata fetching in batches of 15 concurrently #2432

hasufell · 2021-01-08T06:14:57Z

Issue Number

Overview

All requests already share the same Manager. I'm not sure if that implicitly leads to proper HTTP pipelining when used with forConcurrently, but it seems to work: the sync time on testnet goes from 3 minutes down to ~30 seconds.

Comments

Open questions

do we also want to add a way for daedalus to check the sync-status? If so, how to define it? Syncing is continuous.
is this safe? We don't want to DoS SMASH
do we want to make the batch size a settings? We already have a database table for settings...

hasufell · 2021-01-08T07:56:37Z

@piotr-iohk maybe want to do some mainnet testing?

piotr-iohk · 2021-01-08T19:46:01Z

I have done some testing against mainnet on Linux and Windows (on synced network).

Linux (Ubuntu i7 6 core, 32GB RAM):

Fetching time:
Not pipelined: ~8 minutes
Pipelined: ~1 minute

Windows (VM 4 virtual CPU, 4GB RAM):

Fetching time:
Not pipelined: ~9 min
Pipelined: ~2min

In both cases (and systems) cardano-wallet process memory consumption was steady at ~30-60MB.
Observable processor usage up to ~5% on Windows VM (via task manager).

I was wondering if it'd be worth adding such benchmark into benchmark suite for future reference?
I guess the problem would be to determine when the fetching is over?

hasufell · 2021-01-11T01:49:13Z

So it seems no one is really sure what http-client does under the hood when sharing a Manager. I opened a ticket: snoyberg/http-client#452

rvl · 2021-01-11T04:20:50Z

Use wireshark to check what's happening (before and after). You don't need to decrypt the HTTPS, just look at how many TCP connections are being opened. Running HTTP requests in parallel is faster than in sequence, because the connection setups are concurrent.

rvl · 2021-01-11T04:23:40Z

From looking at the source, I think the connection idle timeout for the http-client Manager connection pool is 30 seconds. I could be mistaken however.

hasufell · 2021-01-11T06:46:27Z

Following the http-client ticket:

http-client doesn't do pipelining in any way and it isn't supported
the alternative HTTP2 multiplexing isn't possible, since http-client doesn't support HTTP2
There's only one library that seems to do HTTP2 and it has zero tests: How would I trust the correctness of this library without any testing? haskell-grpc-native/http2-client#16

Given the improvement with forConcurrently I propose we go this way and make it a Setting that can be adjusted and call it a day.

rvl · 2021-01-11T09:29:53Z

I just did some testing with wireshark and ghci - if you set the header Connection: Keep-Alive and the server supports it (SMASH server ought to), then the TCP connection will not be closed between requests. I can see the TCP FIN packet 30 seconds after the last request. It's not pipelining but this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

hasufell · 2021-01-14T10:07:59Z

I think we have a pretty clear vision now on how to move this forward:

keep using async package to spawn threads and let http-client Manager sharing kick in
use a proper queue instead of batches, to ensure we're always waiting for 15 responses
add information to https://input-output-hk.github.io/cardano-wallet/api/edge/#operation/listStakePools that indicates per pool whether it's fetched, still fetching or failed/retrying
maybe: add another endpoint to query a single pool metadata sync status, e.g. /stake-pools/{poolId}/metadata/sync-status

I noticed this while syncing an empty wallet for looking at stake pools. On a node which is also syncing, the wallet was spamming the reward account endpoint with an empty reward list. Interestingly, each request would take between 300ms and 500ms (which is quite a lot for an empty request...) and I believe, would also require quite some time / context-switch to the node. I think we ought to have a sort of debouncer here (though arguably, once synced, this only get triggered every 20s or so...); yet, a lot hanging fruit is simply to short-circuit everything when there's no account.

KtorZ · 2021-02-15T12:14:36Z

@rvl implementation reviewed to use a bounded queue. Testing on mainnet, it took a bit less than a minute to fetch all metadata on my machine. I think that's "okay-ish" / sufficient at this stage. What could be useful as a follow-up work I imagine could be to provide some progress stats to Daedalus (e.g. number of pending attempts, number of known registrations...)

KtorZ · 2021-02-15T12:35:57Z

There's one seemingly important yet subtle change in this PR with regards to the fetching strategy.

From:

when (null refs || null successes) $ do
    traceWith tr $ MsgFetchTakeBreak blockFrequency
    threadDelay blockFrequency

To:

when (null refs) $ do
    traceWith tr $ MsgFetchTakeBreak blockFrequency
    threadDelay blockFrequency

So, as of now, the wallet pauses for a whole 20s when all requests to fetching metadata failed. I think the rationale was originally to prevent the wallet from just eating up CPU by trying to fetch only invalid pools... What this create in practice is simply more unnecessary delays. There's about 8.5K pool registrations currently on mainnet, and after a full sync, there are still ~2254 fetch attempts still-to-proceed... so that's a lot of occasions to get a null successes on a particular loop.

There's already a backoff which comes with the fetch attempts which prevent the wallet from doing needless work anyway. With this little change, we should massively improve the time needed for updating metadata upon new registrations, such as the problem described in: input-output-hk/daedalus#2359

rvl

👍 Nice.
The time taken to fetch all metadata is highly dependent on your ping to the smash server.
So a 1 minute fetch in Europe could be longer in Australia.
How long did it take before to fetch all metadata on your connection?

rvl · 2021-02-15T13:09:19Z

lib/shelley/src/Cardano/Wallet/Shelley/Pools.hs

+    enqueueMetadata
+        :: TBQueue (PoolId, StakePoolMetadataUrl, StakePoolMetadataHash)
+        -> IO ()
+    enqueueMetadata queue = forever $ do
        refs <- atomically (unfetchedPoolMetadataRefs 100)


Perhaps set this number of refs to be smaller than queueSize - maybe the same as numInFlight.

Whoops, I should've removed the magic number here and use queueSize instead. Thing is, I don't want the queue to be empty if there's something to fetch. So ideally, this should be > numInFlight. The bounded queue will block on writes anyway if the queue is full.

rvl · 2021-02-15T13:13:26Z

lib/shelley/src/Cardano/Wallet/Shelley/Pools.hs

+            h <- readTBQueue queue
+            q <- catMaybes <$> replicateM (numInFlight - 1) (tryReadTBQueue queue)
+            pure (h:q)
+        forConcurrently_ refs $ \(pid, url, hash) -> do


This will still have uneven bursts of requests, won't it?
But if the fetch time is down to one minute, then it's not really a problem.

I see what you mean, we could actually make the consumer send queries one-by-one (asynchronously) and actually use the queue's size to limit the number of requests in flight 🤔 ...

Whoops. Actually no, it's not that simple because now unfetchedPoolMetadataRefs gets called multiple times in a row before it's correctly updated.

Okay, now done properly. After a full run, there's no attempts with a retry_count greater than 3, which sounds right.

This works by using a bounded queue as a 'seat allocator'. Requests are only emitted in an asynchronous thread when there's a seat at the table. On my machine, the total time is cut from ~320s to ~50s. It also guarantees that there are never more than N requests in flight and does a best effort at keeping the number of requests in flight at any time close to N (this works because the time needed to fetch requests to send from the database is much smaller than the time needed to actually send and process requests. There's one seemingly important yet subtle change in this PR with regards to the fetching strategy. From: ```hs when (null refs || null successes) $ do traceWith tr $ MsgFetchTakeBreak blockFrequency threadDelay blockFrequency ``` To: ```hs when (null refs) $ do traceWith tr $ MsgFetchTakeBreak blockFrequency threadDelay blockFrequency ``` So, as of now, the wallet pauses for a whole 20s when all requests to fetching metadata failed. I think the rationale was originally to prevent the wallet from just eating up CPU by trying to fetch only invalid pools... What this create in practice is simply more unnecessary delays. There's about 8.5K pool registrations currently on mainnet, and after a full sync, there are still ~2254 fetch attempts still-to-proceed... so that's a lot of occasions to get a `null successes` on a particular loop. There's already a backoff which comes with the fetch attempts which prevent the wallet from doing needless work anyway. With this little change, we should massively improve the time needed for updating metadata upon new registrations, such as the problem described in: input-output-hk/daedalus#2359

KtorZ · 2021-02-15T14:06:18Z

@rvl: How long did it take before to fetch all metadata on your connection?

About 320s.
50s now after the last change.

lib/shelley/src/Cardano/Wallet/Shelley/Pools.hs

piotr-iohk · 2021-02-15T14:21:38Z

For me on mainnet:

it was: 8m 5s
with jospald/ADP-634/improve-smash-sync-time (a6c05d1): 50s

hasufell · 2021-02-15T14:26:56Z

lib/shelley/src/Cardano/Wallet/Shelley/Pools.hs

+        inFlights <- STM.atomically $ newTBQueue maxInFlight
+        forever $ do
+            refs <- atomically (unfetchedPoolMetadataRefs $ fromIntegral maxInFlight)
+            forM_ refs $ \(pid, url, hash) -> withAvailableSeat inFlights $ do


This is a pretty neat pattern. Unrelatedly, I'm wondering if this is achievable with streamly or conduit.

I actually thought of using a conduit here, to stream things directly from the database and then, process them at the right path. Yet, that's much more work on the db side to do, and persistent API is quite unsatisfactory here (we use a custom raw SQL query behind the scene, so most of the persistent API is just unusable). Plus, we would need a "refreshable" stream, because after requests are processed, the set of metadata refs to fetch do actually change.

hasufell · 2021-02-15T14:33:13Z

lib/shelley/src/Cardano/Wallet/Shelley/Pools.hs

+                threadDelay blockFrequency
+      where
+        maxInFlight :: Natural
+        maxInFlight = 20


Probably doesn't matter for merging the PR, but did someone check other values? I'm not questioning that 20 is a reasonable default, but maybe there's more room to optimize? At which point will performance degrade or SMASH give up on us?

I've tried: 10, 20 and 50. There seem to be almost no difference and they all oscillate between 40s and 60s for a entire refresh 🤷‍♂️ .. I went for 10 in the end.

I noticed that after running the new implementation, some metadata would have a retry_count way too high (internally, we use an exponential backoff, so it'd already take several days to reach ~6 or 7.. found some with 20+ retries). Turns out it was because the refs to fetch are only updated _after_ successful or unsuccessful fetching and since fetching is asynchronous, it could happen that a single metadata was simply enqueued over and over until a response finally came out of the server to change the state. Yet, all other queued request would also count towards the retry_count... and would simply eat up all the retries in one go.

rvl · 2021-02-16T07:05:37Z

Thanks 👋

bors r+

iohk-bors · 2021-02-16T09:01:30Z

Build succeeded:

Since #2432 (concurrent pool metadata fetching), the fetching of metadata is now done in new short-lived threads forked from the 'fetchMetadata' thread worker. Killing the parent thread 'fetchMetadata' does not actually kill child threads which may still complete afterwards. As a consequence, when changing settings from smash (or direct) to none, in-flight requests spawned in short-lived thread may still resolve even if the 'fetchMetadata' loop has been terminated. This commit makes sure to only write the result of requests if and only if the settings haven't changed since the parent thread was launched. This should prevent short-lived thread to update the database after the parent is killed when changing settings.

2567: Guard pool metadata fetch result persistence with settings status r=rvl a=KtorZ # Issue Number  ADP-634 # Overview  - Guard pool metadata fetch result persistence with settings status Since #2432 (concurrent pool metadata fetching), the fetching of metadata is now done in new short-lived threads forked from the 'fetchMetadata' thread worker. Killing the parent thread 'fetchMetadata' does not actually kill child threads which may still complete afterwards. As a consequence, when changing settings from smash (or direct) to none, in-flight requests spawned in short-lived thread may still resolve even if the 'fetchMetadata' loop has been terminated. This commit makes sure to only write the result of requests if and only if the settings haven't changed since the parent thread was launched. This should prevent short-lived thread to update the database after the parent is killed when changing settings. Co-authored-by: KtorZ <matthias.benkort@gmail.com> Co-authored-by: Piotr Stachyra <piotr.stachyra@iohk.io>

hasufell requested a review from KtorZ January 8, 2021 06:14

KtorZ assigned hasufell Jan 12, 2021

KtorZ force-pushed the jospald/ADP-634/improve-smash-sync-time branch from 910740f to 6fc3208 Compare February 15, 2021 12:10

KtorZ requested a review from rvl February 15, 2021 12:10

rvl approved these changes Feb 15, 2021

View reviewed changes

KtorZ force-pushed the jospald/ADP-634/improve-smash-sync-time branch from 6fc3208 to a6c05d1 Compare February 15, 2021 14:05

hasufell commented Feb 15, 2021

View reviewed changes

lib/shelley/src/Cardano/Wallet/Shelley/Pools.hs Show resolved Hide resolved

hasufell commented Feb 15, 2021

View reviewed changes

iohk-bors bot merged commit 5ff6c51 into master Feb 16, 2021

iohk-bors bot deleted the jospald/ADP-634/improve-smash-sync-time branch February 16, 2021 09:01

rvl mentioned this pull request Mar 4, 2021

Feature Request: Ability to re-fetch a specific pool's metadata using ID #2551

Open

KtorZ mentioned this pull request Mar 17, 2021

Guard pool metadata fetch result persistence with settings status #2567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADP-634] Run SMASH metadata fetching in batches of 15 concurrently #2432

[ADP-634] Run SMASH metadata fetching in batches of 15 concurrently #2432

hasufell commented Jan 8, 2021 •

edited

hasufell commented Jan 8, 2021

piotr-iohk commented Jan 8, 2021

hasufell commented Jan 11, 2021

rvl commented Jan 11, 2021

rvl commented Jan 11, 2021

hasufell commented Jan 11, 2021

rvl commented Jan 11, 2021

hasufell commented Jan 14, 2021

KtorZ commented Feb 15, 2021

KtorZ commented Feb 15, 2021

rvl left a comment

rvl Feb 15, 2021

KtorZ Feb 15, 2021

rvl Feb 15, 2021

KtorZ Feb 15, 2021

KtorZ Feb 15, 2021

KtorZ Feb 15, 2021

KtorZ Feb 15, 2021

KtorZ commented Feb 15, 2021

piotr-iohk commented Feb 15, 2021

hasufell Feb 15, 2021

KtorZ Feb 15, 2021

hasufell Feb 15, 2021

KtorZ Feb 15, 2021

rvl commented Feb 16, 2021

iohk-bors bot commented Feb 16, 2021

[ADP-634] Run SMASH metadata fetching in batches of 15 concurrently #2432

[ADP-634] Run SMASH metadata fetching in batches of 15 concurrently #2432

Conversation

hasufell commented Jan 8, 2021 • edited

Issue Number

Overview

Comments

Open questions

hasufell commented Jan 8, 2021

piotr-iohk commented Jan 8, 2021

hasufell commented Jan 11, 2021

rvl commented Jan 11, 2021

rvl commented Jan 11, 2021

hasufell commented Jan 11, 2021

rvl commented Jan 11, 2021

hasufell commented Jan 14, 2021

KtorZ commented Feb 15, 2021

KtorZ commented Feb 15, 2021

rvl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KtorZ commented Feb 15, 2021

piotr-iohk commented Feb 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvl commented Feb 16, 2021

iohk-bors bot commented Feb 16, 2021

hasufell commented Jan 8, 2021 •

edited