Asynchronous Datastores #137

aschmahmann · 2019-10-30T23:20:05Z

Proposal

Add a Sync(prefix Key) function to the Datastore interface.
- This function will be a no-op when the datastore is in synchronous mode (the default).
- Otherwise, Sync(prefix) guarantees that any Put(prefix + ..., value) calls that returned before Sync(prefix) was called will be observed after Sync(prefix) returns, even if the program crashes.
Insert calls to Sync where appropriate (in go-ipfs and go-libp2p).
When ready, turn off sync writes in go-ipfs's datastore (by default). (we'll have an experimental transition with heavy testing)

Notes:

We're not changing the default behavior. Datastores will still write synchronously unless configured not to do so.
Put will either completely put a value or not put a value. Even when sync writes is turned off, the datastore will never be left in a corrupt state.

Motivation

Writing to disk synchronously has poor performance and is rarely necessary.

Poor performance: ipfs add performance is doubled (on linux/ext4) when badger is used and synchronous writes are turned off.

Rarely necessary:

The DHT expects some number of nodes to be faulty so losing a few records is usually fine.
IPFS only guarantees that blocks are persisted when pinned. There's no reason to sync after every write.
- Note: For now, we'll likely want to explicitly sync after a full ipfs add as most users have GC turned off and expect the data to be persisted anyways. However, doing this once is cheaper than doing it for every write.
The peerstore definitely doesn't need synchronous writes.

Alternatives

Create a buffered/batching/async wrapper. This is what go-ipfs currently does but we could do better.
Use the "autobatching" datastore.

However:

Buffering/caching isn't easy.
Unlike buffering inside the OS, they can't (easily) respond to memory pressure.
Conversely, they force one to eagerly sync/flush periodically instead of as-needed. The OS knows when we have enough memory to keep writing into memory.

@Stebalien @whyrusleeping @raulk Seem like a reasonable plan?

The text was updated successfully, but these errors were encountered:

aschmahmann · 2019-11-04T20:56:42Z

It's looking a lot like Badger on Windows only supports asynchronous writes due to issues with Golang not respecting Windows permissions.

This means that until there's a fix in Golang/Badger then if we want to support Badger we should support asynchronous datastores.

whyrusleeping · 2019-11-04T22:04:01Z

I don't know that I would add Sync to the base datastore interface, but having it be an optional specialization that we soft require everywhere seems fine to me.

Other than that, this LGTM. Seems like something i've wanted for quite a long time.

raulk · 2019-11-05T00:04:55Z

My main comment is that in practice, a datastore gets used by components with different reliability requirements, e.g. peerstore, IPFS repo, etc. If some some are async and others are sync, the sync ones would end up paying the cost to flush the writes from the async ones. That’s unfair. To make this model effective and fair, we’d need to use it in conjunction with if segmentation/compartmentalisation like the Namespace abstraction we already have for go-datastore.

Stebalien · 2019-11-05T09:26:46Z

This means that until there's a fix in Golang/Badger then if we want to support Badger we should support asynchronous datastores.

Note: fixing badger is much simpler.

momack2 · 2019-11-06T07:58:24Z

This sounds good! What are the next steps? Should there be a spec and design review first or move forward with a checklist and quick PoC to validate the approach? Is this simple enough to push forward and land a usable MVP this quarter (to demonstrate package manager performance improvements), or is this fix a significant chunk-o-work?

Stebalien · 2019-11-06T09:23:07Z

I don't know that I would add Sync to the base datastore interface, but having it be an optional specialization that we soft require everywhere seems fine to me.

I'd rather add it to the interface for two reasons:

We have more wrappers than datastores so optional functions really doesn't buy us much. We have to implement them everywhere anyways.
Either option is a breaking change but making it optional is silent:
1. If we require it, all datastores must upgrade to compile.
2. If we make it optional, failing to update a datastore wrapper will cause us to silently drop all Sync calls (we'll just assume that the wrapper is always synchronous even when it can contain async datastores).

Stebalien · 2019-11-06T09:23:17Z

@momack2 this should be pretty simple.

Stebalien · 2019-11-06T09:25:36Z

@aschmahmann could you track any progress on this in ipfs/kubo#6523?

aschmahmann mentioned this issue Nov 11, 2019

Datastore Synchronicity Audit #139

Open

aschmahmann mentioned this issue Nov 26, 2019

Support Asynchronous Writing Datastores #140

Merged

6 tasks

This was referenced Dec 1, 2019

Make badger-ds the default datastore ipfs/kubo#4279

Open

Improve ipfs add performance for large trees of files & directories ipfs/kubo#6523

Open

Improve Add Performance W/ Async Datastores ipfs/kubo#6775

Closed

Stebalien closed this as completed Dec 18, 2019

aschmahmann mentioned this issue Mar 16, 2023

Consider expanding the core Blockstore interface ipfs/boxo#207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous Datastores #137

Asynchronous Datastores #137

aschmahmann commented Oct 30, 2019 •

edited by Stebalien

aschmahmann commented Nov 4, 2019

whyrusleeping commented Nov 4, 2019

raulk commented Nov 5, 2019

Stebalien commented Nov 5, 2019

momack2 commented Nov 6, 2019

Stebalien commented Nov 6, 2019 •

edited

Stebalien commented Nov 6, 2019

Stebalien commented Nov 6, 2019

Asynchronous Datastores #137

Asynchronous Datastores #137

Comments

aschmahmann commented Oct 30, 2019 • edited by Stebalien

Proposal

Motivation

Alternatives

aschmahmann commented Nov 4, 2019

whyrusleeping commented Nov 4, 2019

raulk commented Nov 5, 2019

Stebalien commented Nov 5, 2019

momack2 commented Nov 6, 2019

Stebalien commented Nov 6, 2019 • edited

Stebalien commented Nov 6, 2019

Stebalien commented Nov 6, 2019

aschmahmann commented Oct 30, 2019 •

edited by Stebalien

Stebalien commented Nov 6, 2019 •

edited