Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous Datastores #137

Closed
aschmahmann opened this issue Oct 30, 2019 · 8 comments
Closed

Asynchronous Datastores #137

aschmahmann opened this issue Oct 30, 2019 · 8 comments

Comments

@aschmahmann
Copy link
Contributor

aschmahmann commented Oct 30, 2019

Proposal

  1. Add a Sync(prefix Key) function to the Datastore interface.
    • This function will be a no-op when the datastore is in synchronous mode (the default).
    • Otherwise, Sync(prefix) guarantees that any Put(prefix + ..., value) calls that returned before Sync(prefix) was called will be observed after Sync(prefix) returns, even if the program crashes.
  2. Insert calls to Sync where appropriate (in go-ipfs and go-libp2p).
  3. When ready, turn off sync writes in go-ipfs's datastore (by default). (we'll have an experimental transition with heavy testing)

Notes:

  1. We're not changing the default behavior. Datastores will still write synchronously unless configured not to do so.
  2. Put will either completely put a value or not put a value. Even when sync writes is turned off, the datastore will never be left in a corrupt state.

Motivation

Writing to disk synchronously has poor performance and is rarely necessary.

Poor performance: ipfs add performance is doubled (on linux/ext4) when badger is used and synchronous writes are turned off.

Rarely necessary:

  • The DHT expects some number of nodes to be faulty so losing a few records is usually fine.
  • IPFS only guarantees that blocks are persisted when pinned. There's no reason to sync after every write.
    • Note: For now, we'll likely want to explicitly sync after a full ipfs add as most users have GC turned off and expect the data to be persisted anyways. However, doing this once is cheaper than doing it for every write.
  • The peerstore definitely doesn't need synchronous writes.

Alternatives

  • Create a buffered/batching/async wrapper. This is what go-ipfs currently does but we could do better.
  • Use the "autobatching" datastore.

However:

  1. Buffering/caching isn't easy.
  2. Unlike buffering inside the OS, they can't (easily) respond to memory pressure.
  3. Conversely, they force one to eagerly sync/flush periodically instead of as-needed. The OS knows when we have enough memory to keep writing into memory.

@Stebalien @whyrusleeping @raulk Seem like a reasonable plan?

@aschmahmann
Copy link
Contributor Author

It's looking a lot like Badger on Windows only supports asynchronous writes due to issues with Golang not respecting Windows permissions.

This means that until there's a fix in Golang/Badger then if we want to support Badger we should support asynchronous datastores.

@whyrusleeping
Copy link
Member

I don't know that I would add Sync to the base datastore interface, but having it be an optional specialization that we soft require everywhere seems fine to me.

Other than that, this LGTM. Seems like something i've wanted for quite a long time.

@raulk
Copy link
Member

raulk commented Nov 5, 2019

My main comment is that in practice, a datastore gets used by components with different reliability requirements, e.g. peerstore, IPFS repo, etc. If some some are async and others are sync, the sync ones would end up paying the cost to flush the writes from the async ones. That’s unfair. To make this model effective and fair, we’d need to use it in conjunction with if segmentation/compartmentalisation like the Namespace abstraction we already have for go-datastore.

@Stebalien
Copy link
Member

This means that until there's a fix in Golang/Badger then if we want to support Badger we should support asynchronous datastores.

Note: fixing badger is much simpler.

@momack2
Copy link

momack2 commented Nov 6, 2019

This sounds good! What are the next steps? Should there be a spec and design review first or move forward with a checklist and quick PoC to validate the approach? Is this simple enough to push forward and land a usable MVP this quarter (to demonstrate package manager performance improvements), or is this fix a significant chunk-o-work?

@Stebalien
Copy link
Member

Stebalien commented Nov 6, 2019

I don't know that I would add Sync to the base datastore interface, but having it be an optional specialization that we soft require everywhere seems fine to me.

I'd rather add it to the interface for two reasons:

  1. We have more wrappers than datastores so optional functions really doesn't buy us much. We have to implement them everywhere anyways.
  2. Either option is a breaking change but making it optional is silent:
    1. If we require it, all datastores must upgrade to compile.
    2. If we make it optional, failing to update a datastore wrapper will cause us to silently drop all Sync calls (we'll just assume that the wrapper is always synchronous even when it can contain async datastores).

@Stebalien
Copy link
Member

@momack2 this should be pretty simple.

@Stebalien
Copy link
Member

@aschmahmann could you track any progress on this in ipfs/kubo#6523?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants