-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched graph updates #3367
Batched graph updates #3367
Conversation
88adc5e
to
d34b61a
Compare
channeldb/graph.go
Outdated
db: db, | ||
rejectCache: newRejectCache(rejectCacheSize), | ||
chanCache: newChannelCache(chanCacheSize), | ||
} | ||
g.chanScheduler = batch.NewTimeScheduler(db.DB, &g.cacheMu, 10*time.Millisecond) | ||
g.nodeScheduler = batch.NewTimeScheduler(db.DB, nil, 10*time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to set this slightly higher and still achieve the same sync time of about 2 minutes (granted, the peers I synced with on different runs were very fast). I used as high as 8 seconds, less db txn's. What do you think? I think perhaps the two schedulers could be staggered at different intervals so as to not cause db contention since we receive their messages at around the same time anyways?
With 10ms, I see writes taking a minimum of 20ms with very small batches (1-6 requests with the occasional large batch). Since there can only be one db writer at a time:
- If there is another batch during this time, it will block until the first one completes. This can cause several batches to be stalled (not really a problem in terms of time in this scenario but not ideal).
- More db txn's means more heap usage.
So far this doesn't affect the overall sync time in any way, but may cause more write contention with the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see some memory usage with dereference
calls when growing the mmap since these are large writes if 8 second batch time is used (~10MB) so a case could be made for InitialMmapSize
as discussed in #3278 , but this is nothing compared to the GetBlock
calls (6 GB +) made in the router.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see a reduction from the 10ms scheduler at 845MB:
ROUTINE ======================== github.com/lightningnetwork/lnd/batch.(*batch).trigger in /Users/nsa/go/src/github.com/lightningnetwork/lnd/batch/batch.go
0 845.71MB (flat, cum) 7.09% of Total
. . 27:}
. . 28:
. . 29:// trigger is the entry point for the batch and ensures that run is started at
. . 30:// most once.
. . 31:func (b *batch) trigger() {
. 845.71MB 32: b.start.Do(b.run)
. . 33:}
. . 34:
. . 35:// run executes the current batch of requests. If any individual requests fail
. . 36:// alongside others they will be retried by the caller.
. . 37:func (b *batch) run() {
to 269MB with the 8s scheduler:
ROUTINE ======================== github.com/lightningnetwork/lnd/batch.(*batch).trigger in /Users/nsa/go/src/github.com/lightningnetwork/lnd/batch/batch.go
0 269.59MB (flat, cum) 1.80% of Total
. . 27:}
. . 28:
. . 29:// trigger is the entry point for the batch and ensures that run is started at
. . 30:// most once.
. . 31:func (b *batch) trigger() {
. 269.59MB 32: b.start.Do(b.run)
. . 33:}
. . 34:
. . 35:// run executes the current batch of requests. If any individual requests fail
. . 36:// alongside others they will be retried by the caller.
. . 37:func (b *batch) run() {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Crypt-iQ yes the 10ms time is just a temporary commit so that i could drop in the batched version into the regular API and verify it has the same semantics w/o having to modify any timeouts.
In practice we should be using a timeout on the order of seconds. I did my testing with a duration of 1 second and it performed quite well. We could probably do 8 seconds as well, though this also delays the time till first write pretty significantly. Most of the writes were taking 10-20ms, so only about 2-4% of one core would be dedicated to the db txn itself if we go with 1 second, and even lower with 8.
The main thing left to do is benchmark this on mobile and get a sense for how these parameters behave on different architectures, just haven't gotten around to doing so :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will indeed be interesting to see how fast the IGD is on low powered devices. I saw my rpi3 take 5 hours for sync on a 100 mbit fiber connection.
My main question related to this PR is whether this is the right optimization path to take.
How would this PR compare to a different, quite extreme, approach: The graph only exists in memory. It is updated in memory and queried from memory (which we want at some point anyway to speed up path finding). Existing caches can be removed / merged with the in-memory graph.
Then there is a background process that periodically (maybe once a minute) serializes all of it into a byte array and writes it in a single bbolt key (or file even). On startup, that value is deserialized into memory again. Because the graph is non critical data, it is not a problem if we don't have all the very latest data on disk.
channeldb/graph.go
Outdated
db: db, | ||
rejectCache: newRejectCache(rejectCacheSize), | ||
chanCache: newChannelCache(chanCacheSize), | ||
} | ||
g.chanScheduler = batch.NewTimeScheduler(db.DB, &g.cacheMu, time.Second) | ||
g.nodeScheduler = batch.NewTimeScheduler(db.DB, nil, time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no node cache, couldn't this just use the normal bbolt batching?
channeldb/graph.go
Outdated
edgeNotFound bool | ||
) | ||
return c.chanScheduler.Execute(&batch.Request{ | ||
Update: func(tx *bbolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an option to add two error
return parameters, to prevent this pattern with the local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature, so excited to get this in for mobile users!! 🙌
Mostly for feedback I just have some questions. Looking forward to reading what the testing looks like, but time-based batching scheduler seems like the perfect starting point for batching db txs like these 👍
batch/batch.go
Outdated
// return the error directly. | ||
for _, req := range b.reqs { | ||
if req.OnSuccess != nil { | ||
req.errChan <- req.OnSuccess(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little subtle, but this value can still be non-nil. it's true that at this point, none of the requests fail inside the db txn. however, it's still possible for the txn itself to fail. that error is passed into OnSuccess
so that any errors in persisting the changes can be returned at the call sites
// in a channel update. | ||
// information. Note that this method is expected to only be called to update an | ||
// already present node from a node announcement, or to insert a node found in a | ||
// channel update. | ||
// | ||
// TODO(roasbeef): also need sig of announcement | ||
func (c *ChannelGraph) AddLightningNode(node *LightningNode) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if the plan is to get rid of this function since it just purely wraps BatchedAddLightningNode
now? And same w/ AddChannelEdge
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point, yes possibly. currently they are still useful esp. in unit tests, since we wouldn't have to set custom timeouts and stuff for tests. it would be a pretty big investment to do so tho because most of the test infra is already written with the one-off methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ended up making the timeout configurable from the config file, so everything passes through the original methods!
Could you get some pprof output comparisons? There is also a memory gain to be had and that would be especially beneficial to neutrino users. |
050ffe2
to
60e6071
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid PR! Also a relatively compact diff. Completed my first "new" pass since this PR was resurrected. Will do at least one additional pass. Still need to wrap my head around the way the batch itself interacts with the scheduler in a concurrent safe manner...
func (b *batch) run() { | ||
// Clear the batch from its scheduler, ensuring that no new requests are | ||
// added to this batch. | ||
b.clear(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still getting through this, but possibly couldn't a new batch instance be created that already removes the scheduled items from the scheduler? Seems like an odd "up call" in this context.
60e6071
to
7984199
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🥮
(pending resolution of the last commit)
I think we can either use a build tag to force the lower interval during the tests, or make it a config option and have all the tests specify their own value. No strong feelings in either direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, still think we should make the interval configurable, should also help setting it low during tests!
channeldb/graph.go
Outdated
c.chanCache.remove(edge.ChannelID) | ||
|
||
return nil | ||
return c.BatchedAddChannelEdge(edge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change to be reverted or no? If makes sense to only have AddChannelEdge
and let that be the batched version?
return err | ||
} | ||
|
||
if s.locker != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here
func (b *batch) run() { | ||
// Clear the batch from its scheduler, ensuring that no new requests are | ||
// added to this batch. | ||
b.clear(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should copy bbolt if it is not to our standards 😛 Just looks like this could be made much simpler, but also not a blocker for me.
Ping on this @cfromknecht, just needs the config flag now AFAIU. |
7984199
to
be6e53b
Compare
This allows for a 1000 different validation operations to proceed concurrently. Now that we are batching operations at the db level, the average number of outstanding requests will be higher since the commit latency has increased. To compensate, we allow for more outstanding requests to keep the gossiper busy while batches are constructed.
This allows for a 1000 different persistent operations to proceed concurrently. Now that we are batching operations at the db level, the average number of outstanding requests will be higher since the commit latency has increased. To compensate, we allow for more outstanding requests to keep the router busy while batches are constructed.
be6e53b
to
a51f05a
Compare
Config check is failing:
|
) | ||
|
||
// DB holds database configuration for LND. | ||
type DB struct { | ||
Backend string `long:"backend" description:"The selected database backend."` | ||
|
||
BatchCommitInterval time.Duration `long:"batch-commit-interval" description:"The maximum duration the channel graph batch schedulers will wait before attempting to commit a batch of pending updates. This can be tradeoff database contenion for commit latency."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively we could add this behind a build flag to avoid exposing it to "normal" users (would also not require to update the sample config)
This will permit a greater degree of tuning or customization depending on various hardware/environmental factors.
a51f05a
to
82a2383
Compare
This PR introduces batched operations for channel graph ingestion in order to speed up IGD. Currently a single db transaction is used to process each
channel_announcement
,channel_update
, andnode_announcement
, which with the current graph size results in around 160k-200k transactions.With this PR we bring this down to linear in the duration it takes the sync the graph, as we'll no make at most 2 db transactions/sec. OMM I was able to get a 4 minute graph sync using this, implying only about 500 db transactions were made in total.
So far I'm seeing a 2x speedup in IGD. The CPU and memory usage should be much lower, but I'm still in the process of tuning and gathering exact performance metrics.
batch
packageTo accomplish this, we introduce a new
batch
package, which exposes a simple API for constructing batched operations on a bbolt database. The primary reason we require this, and can't use bbolt's nativeBatch
operation, is that we maintain two caches external to bbolt, the channel cache and the reject cache.Since both are write through caches, it is critical that we maintain external consistency whenever a write succeeds. Updating the cache should atomically add the whole batch of updates, or none of them. To achieve this, we need a mechanism for doing locking at a level above bbolt, since we don't have the luxury of reentrant locks.
Most of the batching logic is borrowed from bbolt itself, but we also add some slight modifications to handle any updates to the external cache under a single mutex.
There are two primary components:
Different
Scheduler
s may implement different policies for when a batch gets executed. The current one implemented,TimeScheduler
, uses a fixed horizon after which it executes any concurrent requests added. Future schedulers may take into account metrics such as number of pending items in the batch or overall request throughput before deciding when to trigger, but that is left as an area for future research.Using this package, it becomes simple to express our batched operations, here's an example of
BatchedAddChannelEdge
:whose logic largely coincides with what we'd expect, given the unbatched variant:
Note that in the batched variant,
c.cacheMu
is passed into the creation of thechanScheduler
and is automatically acquired by the scheduler, so it is safe to modify the cache directly within theOnCommit
closure.There are a few other areas in the code base where we may wish to adopt, e.g. the
CircuitMap
, which should be a straightforward change using this primitive. I suspect this may be of interest to other projects using bbolt, we may want to consider making thebatch
package its own versioned go module.testing
Before proceeding in writing some more tests, I'm putting this up on travis with an temporary commit that causes the unbatched APIs to use the batched ones under the hood. This should give us a good idea that the semantics are unchanged before proceeding in writing more extensive tests to handle edge cases in the batched API.
Closes #3288