New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchtower: reduce AckedUpdate storage footprint #7055
watchtower: reduce AckedUpdate storage footprint #7055
Conversation
e2ca32a
to
cf1d99c
Compare
cf1d99c
to
01a530c
Compare
1683be8
to
94005e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool PR! ✨ Did just a very light pass to understand the intuition behind the changes. Concept ACK! 🌮
Will provide a more detailed review after @Roasbeef has taken another look.
watchtower/wtdb/client_db.go
Outdated
|
||
// Get the range index for the session-channel pair. | ||
chanID := byteOrder.Uint64(dbChanID) | ||
index, err := c.getRangeIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could also use the in-memory cache exclusively if we know that all range indexes for all channels are already cached. Maybe not as critical if NumAckedUpdates
isn't called too frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the reason I didnt use the in-mem one here is cause this iterates through all sessions (even ones we will never use again) and would result in them then being stored in the cache even if they are not ever needed again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial pass through the diff, need to revisit some of the existing client code to build a better mental model to review these changs.
At a glance, a few things pop out at me:
- will there ever actually be a gap in the commitment height that a tower has ACK'd?
- the new range index design seems to assume that a tower either deletes some random states, or that a client can somehow skip uploading a set of states (causing a gap in the set of ACK'd heights)
- for the mapping of the 32-byte cid, why can't we just use the existing 8-byte scid instead?
- We'll need to properly test these persistent changes to ensure they don't lead to a performance regression in practice. In particular I'm curious to see how this fares in a remote db setting, given there seems to be a few additional layers of bucket nesting (which can mean additional round trips w/ the current kv mapping)
watchtower/wtdb/range_index.go
Outdated
"sync" | ||
) | ||
|
||
// RangeIndex can be used go keep track of which numbers have been added to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing the motivation a bit here, in which case would there be a series of disparate gaps in what we've uploaded to the tower?
In other words, afaict, even if we're down for some time, we won't have gaps in prior heights uploaded unless the tower actually deleted information.
In the case where we switching to a tower and want to know the extent that we've re-uploaded our prior states, we'd still only need to track a single height (the final height that we've uploaded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case would there be a series of disparate gaps in what we've uploaded to the tower?
If we have more than one tower, we might switch to a session with a different tower for a bit if the first one goes down. Also on startup we just pick a random session to start sending updates on. In the idea case, yes there would not be gaps but we need to account for there being gaps. In future we may even want a more "Round Robin" type thing were we cycle through sessions more often & dont just use one till exhaustion before moving on to the next one.
In other words, afaict, even if we're down for some time, we won't have gaps in prior heights uploaded unless the tower actually deleted information.
There is nothing that ensures that we send the commitment update heights in a monotonically increasing way on a single session. On startup we just pick a random non-exhausted session to continue on.
In the case where we switching to a tower and want to know the extent that we've re-uploaded our prior states, we'd still only need to track a single height (the final height that we've uploaded).
Dont think I agree - if we have some of the commitments for a channel backed up to one tower and some to another & then we want remove the one tower, we just want to know the heights we uploaded to that session so that we know which we need to re-upload to the new tower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing that ensures that we send the commitment update heights in a monotonically increasing way on a single session.
Can you point to the code that makes this not monotonically increasing? I thought the queues involved in sending to the tower were FIFO?
Dont think I agree - if we have some of the commitments for a channel backed up to one tower and some to another & then we want remove the one tower, we just want to know the heights we uploaded to that session so that we know which we need to re-upload to the new tower
I couldn't find where this logic is - can you point me to it? Or is this something for the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the queues involved in sending to the tower were FIFO?
they are FIFO. But what can happen is that: we send backups for heights 1, 2, 3 on session A. Then we restart but this time choose session B as our active session (cause maybe the tower for session A is temporarily down) so we send heights 4, 5, 6 on session B. Now we restart again and choose session A again etc.
See here where we load all our candidate sessions (map!) that we will iterate through to find one that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find where this logic is - can you point me to it? Or is this something for the future?
I think my comment above answers this. Let me know if it does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that there can be gaps on session A? This makes sense. It's been a while since I took real analysis, but what tripped me up is that you said the heights cannot be monotonically increasing. I thought, but can't find a good internet answer, that a monotonically increasing sequence could have gaps - the only constraint here being that the sequence increases. I am curious to know the answer but it doesn't matter for this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see what you mean. I was mixing up the definition of "monotonically".
So:
- session seq number will be monotonically increasing without gaps.
- commitment height of a channel within a session is monotonically increasing and could possible have gaps.
I think it's quite easy to work around those since we can cache a range index and then there's no need for nesting down anymore after that if done properly, see: #7055 (comment) |
I had a look at the migrations and the general approach, and I think this is a great idea. Even though some ranges might end up being tiny (in weird multi-WT scenarios), this shouldn't be an issue when compared to the status quo. Instead of iterating through the ranges a binary search might be better, but I don't think this is relevant for a one-off migration with just a couple of ranges (i.e. not millions). |
d192c7e
to
51ff437
Compare
Aside from the RAM spike, the migration worked out just fine and my node did a few forwards. I restarted again to compact the DB:
Before the migration the DB had a size of 4650852352 bytes, so it's more like 4.17x. Nice, thanks a lot! PS: With gzip I was able to bring this down to 65 MByte, so maybe there's more potential. |
ok cool - so with gzip, the gain is about 72x? awesome! Thanks for giving this a spin @C-Otto ! should get even more gains once the Also - my original rough estimate for the db gains a node would get if they only had 1 channel was 1792x. So am I correct in saying that the node you migrated has over its lifetime roughly always had about 24 active channels at any given time (very rough estimate)? |
For each session I checked, I see a bunch of channels - more than 30 each. But I'd say 24 active channels per sessions sounds very realistic to me! |
1faac9f
to
fd37c72
Compare
ReportI wrote a script here that allows you to quickly create a large Using this script, I created a End result:After migrating this DB and the compacting it, I ended up with a db of size 185MB (gain of 16x). NOTE: I found that first migrating the DB and then compacting it was significantly faster than first compacting and the migrating and compacting again. The only down side here is that if you migrate first, your db size will grow slightly (in my example, it grew to 3.2GB). Migrating using multiple Txs vs 1 big TXThe migration has been split up into multiple transactions so that the migration is less RAM intensive. Note that when @C-Otto tried the migration (all-in-one) tx with his ~5GB db, the average RAM usabel was around 3GB but it is worth nothing that the encoding of certain items has changed since then so should be less now. @C-Otto, if you still have a copy of your previous DB around, it would be awesome to hear how this new multi-tx approach performs on it if you are willing to try it on a copy of that copy :) |
Thanks, @ellemouton! I don't think I have the old DB anymore, but we'll get plenty of reports once this is released :) |
@bhandras: review reminder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good! 🥇 Great work on the range index code too 🚀 My remaining question is whether to keep the new DB channel ID vs just using channel IDs?
ce108a6
to
2aefe41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 Excellent work @ellemouton! ✨
} | ||
|
||
// Start the iteration and exit on condition. | ||
for k := k; k != nil; k, _ = c.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized the kvdb
cursor should have a Current()
member so it's easier to write these type of loops.
1aeeb50
to
7f40398
Compare
// For each of these, create a new sub-bucket with this key in | ||
// cChanDetailsBkt. In this sub-bucket, add the cChannelSummary key with | ||
// the encoded ClientChanSummary as the value. | ||
err = chanSummaryBkt.ForEach(func(chanID, summary []byte) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a RAM-intensive migration since the summaries are just small pkscripts, right? If it is intensive, could be worth splitting this ForEach
into multiple transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a test that does this migration with 500 channels & found that the path does not even show up in the heap profile graph. Same for migration 3.
continue | ||
} | ||
|
||
err = kvdb.Update(db.bdb(), func(tx kvdb.RwTx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 the migrations should've been in separate tx's from the beginning
watchtower/wtdb/range_index.go
Outdated
"sync" | ||
) | ||
|
||
// RangeIndex can be used go keep track of which numbers have been added to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that there can be gaps on session A? This makes sense. It's been a while since I took real analysis, but what tripped me up is that you said the heights cannot be monotonically increasing. I thought, but can't find a good internet answer, that a monotonically increasing sequence could have gaps - the only constraint here being that the sequence increases. I am curious to know the answer but it doesn't matter for this patch.
Small refactor just to make the upcoming commit easier to parse. In this commit, we make better use of the getChanSummary helper function.
In this commit a migration of the tower client db is done. The migration creates a new top-level cChanDetailsBkt bucket and for each channel found in the old cChanSummaryBkt bucket, it creates a new sub-bucket. In the subbucket, the ClientChanSummary is then stored under the cChannelSummary key. The reason for this migration is that it will be useful in future when we want to store more easily accessible data under a specific client ID.
In this commit, a new channel-ID index is added to the tower client db with the help of a migration. This index holds a mapping from a db-assigned-ID (a uint64 encoded using BigSize encoding) to real channel ID (32 bytes). This mapping will help us save space in future when persisting references to channels.
In this commit, a new concept called a RangeIndex is introduced. It provides an efficient way to keep track of numbers added to a set by keeping track of various ranges instead of individual numbers. Notably, it also provides a way to map the contents & diffs applied to the in memory RangeIndex (which uses a sorted array structure) to a persisted KV structure.
Refactor the putClientSessionBody to take in a session sub-bucket rather than the top-level session bucket. This is mainly to make an upcoming commit diff easier to parse.
In this commit, an RangeIndex set is added to the ClientDB along with getter methods that can be used to populate this in-memory set from the DB.
In preparation for an upcoming commit where some helper functions will need access to the ClientDB's ackedRangeIndex member, this commit converts those helper functions into methods on the ClientDB struct.
Add a shouldFail boolean parameter to the migtest ApplyMigrationWithDB in order to make it easier to test migration failures.
In this commit, we add the ability to add a wtdb version migration that does not get given a transaction but rather a whole db object. This will be useful for migrations that are best done in multiple transaction in order to use less RAM.
In this commit, the code for migration 4 is added. This migration takes all the existing session acked updates and migrates them to be stored in the RangeIndex form instead. Note that this migration is not activated in this commit. This is done in a follow up commit in order to keep this one smaller.
In this commit, a migration is done that takes all the AckedUpdates of all sessions and stores them in the RangeIndex pattern instead and deletes the session's old AckedUpdates bucket. All the logic in the code is also updates in order to write and read from this new structure.
7f40398
to
e52461b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐒
This PR contains a few migrations of the watchtower client DB.
Main Achievements:
AckedUpdates
through the use of a newRangeIndex
model.Change Log:
1. Migrate ChanDetails Bucket
A
ChannelDetails
bucket is added and each registered channel gets its own sub-bucket in this bucket. The existing channel summaries are moved here under thecChannelSummary
key and the old top-levelcChanSummaryBkt
is deleted. This is done so that we can store more details about a channel without needing to add new info to the encoded ChannelSummary.2. Add a new Channel-ID index
A new channel-ID index is added. The index holds a mapping from db-assigned-ID (8 bytes) to real channel-ID (32 bytes). This mapping will allow us to preserve disk space in future when persisting references to channels.
3. Migrate AckedUpdates to the RangeIndex model.
This is the main meat of the PR.
Currently, for each session, we store its acked updates as follows:
The default number of backups per session is 1024 and so this means that regardless of the number of channels the session is actually covering, it takes 1024 * (2 + 32 + 8) bytes per session. Depending on how busy your node is, you could have many thousands of sessions (this user has 60k which would mean 2.5GB!).
The only reason that we want to keep track of AckedUpdates imo is if we need to determine which updates of a channel have/have not been backed up or if we want to replay backups to a different tower. So I argue that knowing the sequence number associated with each back up is not necessary.
Given this assumption, we can store the
AckedUpdates
using aRangeIndex
(per channel). We will also make use of the new channel index described above (in point 2 of the change log) so that we only need 8 bytes per channel.The
RangeIndex
is just a map from start height to end height of a range. This means that in the best case where all commit heights are present, we only need to store the start and end height. And if there are holes, we store separate ranges and that will still be more efficient since each range only requires 16 bytes (8 byte key for the start value and 8 bytes for the end value of the range). Our session can thus store its AckedUpdates as follows:Here is a quick example showing the the change of an index that starts empty and then gets the numbers 1, 2, 4, 3:
This means that every session only refers to a channel once and then just stores the index of ranges for that channel that is covered by the session.
Any additions to the index will at most require one value update & one key deletion.
The PR contains the necessary logic to make sure that the calculation of how to update an index given a new commit height is as efficient as possible.
Note:
I originally had the changes in #7059 here too since the bug in wtclientrpc makes it difficult to use the rpc to check the correctness of this PR. Separated it into its own PR now just to make the diff of this PR a bit smaller. But suggest we get that PR in first even though the two are not dependant on each other.
Fixes #6886