Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into replacing sled #294

Closed
ShadowJonathan opened this issue Jul 14, 2021 · 34 comments
Closed

Look into replacing sled #294

ShadowJonathan opened this issue Jul 14, 2021 · 34 comments
Labels
enhancement New feature or request

Comments

@ShadowJonathan
Copy link
Contributor

heed is a high-level wrapper around LMDB, i think it'd be interesting to figure out if it has better performance, compatibility, and caching possibilities compared to sled (at the moment), and/or to allow these database backends to be swapped interchangeably.

@poljar
Copy link
Contributor

poljar commented Jul 27, 2021

Such an abstraction exists:

pub trait StateStore: AsyncTraitDeps {
. You can implement a custom store and use it.

As for now, there aren't any show stoppers that would make me move away from sled and I don't have the bandwidth to support multiple store implementations.

@poljar poljar changed the title Looking into replacing sled with heed Looking into replacing sled Mar 10, 2022
@jplatte
Copy link
Collaborator

jplatte commented Mar 10, 2022

Other options mentioned on Matrix today:

Both written in Rust, so presumably don't need a C toolchain which could give us easy cross compilation once olm is also replaced by vodozemac (which should be pretty soon!).

@poljar
Copy link
Contributor

poljar commented Mar 10, 2022

Another one:

Which has multi-process support, which might be relevant for the iOS use-case where currently two sync loops are going on. One for the notifications and one for the main process.

@gnunicorn
Copy link
Contributor

note about BonsaiDB, from their docs:

However, the underlying storage layer that BonsaiDb is built upon, sled, does not support multiple processes writing its data simultaneously.

@jplatte jplatte added the enhancement New feature or request label Apr 8, 2022
@gnunicorn
Copy link
Contributor

@gnunicorn
Copy link
Contributor

don't forget about https://github.com/paritytech/parity-db

@bnjbvr
Copy link
Member

bnjbvr commented Oct 1, 2022

A new contender recently appeared: https://github.com/cberner/redb/

Pure Rust, claims almost as fast as LMDB in the README.md.

@jplatte
Copy link
Collaborator

jplatte commented Oct 1, 2022

Unfortunately it is mmap-based, so fundamentally incompatible with accessing the DB from multiple processes at the same time, which seems like is going to be a hard requirement for whatever we replace sled with.

@jplatte jplatte changed the title Looking into replacing sled Look into replacing sled Oct 4, 2022
@gnunicorn
Copy link
Contributor

I stand corrected on Bonsai-DBb - since 0.1.0 it's been using nebari, not sled:

The underlying dependency on sled has been changed for an in-house storage implementation nebari.

It's not clear what that means for multi-process support though (I doubt it has that).

@P-E-Meunier
Copy link

P-E-Meunier commented Dec 10, 2022

I'm the author of Sanakirja and reached this thread while reading about news in Matrix. If you need any help let me know. See our benchmarks: https://pijul.org/posts/2021-02-06-rethinking-sanakirja/

@jplatte
Copy link
Collaborator

jplatte commented Dec 12, 2022

@P-E-Meunier Thanks for the offer! I have started work on a crypto-store based on Sanakirja, but it's not clear to me how I would use arbitrarily-sized (byte)strings in there, both for keys and values. I got the basic setup working and thought [u8] key / value would work, but then hit the 510 byte limit in impl UnsizedStorable for [u8]. I started looking into how Pijul uses sanakirja, but didn't get very far yet. The current state of my work can be found here.

@P-E-Meunier
Copy link

This isn't a hard limit, Sanakirja is a framework rather than a database, its main duty is to manage memory (including optional reference counting if you need to fork your datastructures) in an mmapped file, in a fully transactional way.

You can write many different datastructures on top of that, B trees are just one particular datastructure, but I've worked on others (ropes, for example). It mostly depends on what operations you want, but one workaround for the 510b limit is to create a new type of pages for storing large chunks of contiguous data, and reference these pages in the tree. The only tricky part is that the allocator now needs to return contiguous pages, but that isn't hard.

If you don't need your values to be contiguous in memory (or can afford to copy them rather than work with pointers), things will be easier. One significant advantage you'll gain if you can use the current allocator is that one can implement it on top of more exotic backends: Pijul uses the same code when working on-disk and in zstd-compressed files. You could also potentially use other databases (such as replication/HA tools).

Just tell me what you need, I may be able to implement it if it isn't too complicated.

I started looking into how Pijul uses sanakirja, but didn't get very far yet.

This is at the same time a good and a terrible idea: a good one since it is possibly the most "real-world" use of Sanakirja currently deployed, and a terrible one because at the time of starting Pijul, GATs hadn't even started in unstable Rust. Therefore, I decided to do the compiler's job of monomorphising manually, using macros. But then that gave me too much freedom to express type constraints that made "repos-in-zstd" possible, and these constraints can't be expressed in Rust's type system without the manual monomorphisation.

So, Pijul's way is also possibly the most convoluted way to use Sanakirja.

@P-E-Meunier
Copy link

Looking at your code, Pijul can handle concurrent readers and writers, you don't need to drop your reader before starting a writer. The exact constraint is much more flexible than this, but a first approximation is that you need to drop the readers before committing a writer.

@jplatte
Copy link
Collaborator

jplatte commented Dec 12, 2022

The biggest concern for us right now is to get something that works. I'm not at all experienced with low-level details of databases, nor do I want to optimize things a lot right now, except we need to be able to open the DB without immediately running out of the memory restrictions for background processes on iOS. We have sled as an existing backend which runs into those limits, which is one reason we are now spending more time on trying out replacements.

If you don't need your values to be contiguous in memory (or can afford to copy them rather than work with pointers), things will be easier.

I think we can totally afford to copy rather than working with pointers. I guess the alternative would be an io::Read implementing type, but due to serde-rs/json#160, that will be much slower, and I assume it won't be easier to implement either, right?

you don't need to drop your reader before starting a writer

But it reduces contention to drop the reader earlier, right?

@P-E-Meunier
Copy link

we need to be able to open the DB without immediately running out of the memory restrictions for background processes on iOS

One of my goals was actually to ensure that the database was always at least half full before growing the file, so this could work, depending on how iOS implements mmap (Sanakirja uses multiple maps to avoid this issue).

I guess the alternative would be an io::Read implementing type, but due to serde-rs/json#160, that will be much slower

That could be quite easier, actually, since it doesn't need a new version of the AllocPage trait. But I've wanted to implement contiguous allocations for a long time, so maybe the time has finally come!

But it reduces contention to drop the reader earlier, right?

The only hard contention is on the writers mutex. The readers count is an atomic variable, so depending on the architecture it might result in some contention, but probably not significant.

Avoid concurrency inside a thread would make your code easier to reason about though.

@jplatte
Copy link
Collaborator

jplatte commented Dec 12, 2022

I guess the alternative would be an io::Read implementing type, but due to serde-rs/json#160, that will be much slower

That could be quite easier, actually, since it doesn't need a new version of the AllocPage trait.

In that case, could you give me some hints as to how to get started with that?

@P-E-Meunier
Copy link

Well, let me try something first ;-) I believe this can be solved during the end of my lunch break.

@P-E-Meunier
Copy link

Ok, I just published the allocator (Sanakirja 1.3), although I don't think it will be very useful for now, it seems I'll have to think a bit more about the best way to use it, I tried various things without success.

The main issue is that if you're storing bytestrings of widely different sizes, you could end up wasting disk space, so I tried to mitigate against that, but didn't find a convincing way. I'll come back to it in the next few days.

@jplatte
Copy link
Collaborator

jplatte commented Dec 12, 2022

Sorry, I don't understand – what changed exactly?

@P-E-Meunier
Copy link

The AllocPage trait has a new method, alloc_contiguous, which MutTxn implements. I tested it on small examples, it seems to work ok. Now the only thing missing is the type of large values using that allocator.

@jplatte
Copy link
Collaborator

jplatte commented Dec 13, 2022

Now the only thing missing is the type of large values using that allocator.

So is that something I can do myself easily enough without understanding how sanakirja works in depth?

@P-E-Meunier
Copy link

No, but I looked at it again last night and I believe I found a way. I'll keep you posted.

@P-E-Meunier
Copy link

P-E-Meunier commented Dec 13, 2022

Alright, using Sanakirja 1.3.1 you can the following:

use sanakirja::*;

fn main() {
    let env = Env::new_anon(409600000, 1).unwrap();
    let mut txn = Env::mut_txn_begin(&env).unwrap();

    let mut db: btree::UDb<u64, Slice> = btree::create_db_(&mut txn).unwrap();
    btree::put(&mut txn, &mut db, &0, &(&b"blabal bilbli"[..]).into()).unwrap();

    let v = vec![b'a'; 5000];
    btree::put(&mut txn, &mut db, &1, &(&v[..]).into()).unwrap();

    txn.set_root(0, db.db);
    txn.commit().unwrap();

    let mut txn = Env::mut_txn_begin(&env).unwrap();
    let mut db: btree::UDb<u64, Slice> = btree::UDb::from_page(txn.root(0).unwrap());
    println!("{:?}", btree::get(&txn, &mut db, &0, None).unwrap().unwrap().1.as_bytes(&txn));
    let bb = btree::get(&txn, &mut db, &1, None).unwrap().unwrap().1.as_bytes(&txn).unwrap();
    println!("{:?} {:?}", bb.len(), bb.iter().all(|c| *c == b'a'));

    txn.set_root(0, db.db);
    txn.commit().unwrap();
}

Some words of caution:

  • Sanakirja is relatively low-level, as you can see. The main advantage is that you can store complex Rust types as raw bytes, and write well-typed wrappers that don't copy any byte.
  • In particular, you should have at least some understanding of what the pointers are. A Db, for example, is a pointer to an allocated page. If you don't retain it, you'll leak memory. If you store it twice (without forking it), you'll have a use-after-free.
  • The roots may store any kind of value, but usually
  • About storing this new Slice type, it points to another page if the value is large enough, and stores it on the page otherwise. In principle, btree::del takes care of deallocating the values.

If you're unsure, you can ask me. The Rust typesystem isn't sufficiently expressive to provide a safe interface to Sanakirja, which means that the rules aren't that simple.

@jplatte
Copy link
Collaborator

jplatte commented Dec 13, 2022

  • In particular, you should have at least some understanding of what the pointers are. A Db, for example, is a pointer to an allocated page. If you don't retain it, you'll leak memory. If you store it twice (without forking it), you'll have a use-after-free.

So you mean in the example above,

  • If the call txn.set_root(0, db.db); is removed, memory is leaked
  • If the call is kept but I also add txn.set_root(1, db.db); afterwards, that is UB

?

@P-E-Meunier
Copy link

Yes. About set_root, note that Sanakirja allocates memory atomically: all the allocations from a MutTxn are done atomically at commit time.

So, while using a Db from one MutTxn in another transaction might "work", I still recommend that you store it in the transaction (possibly in a root), and read it back in the other transaction, to benefit from these guarantees.

For example, if you follow this advice, and some failure happens after commit (my usual hypothesis is power failure), no memory will be leaked and no reference will be left hanging.

@jplatte
Copy link
Collaborator

jplatte commented Dec 13, 2022

Well, UB using functions that are not marked unsafe is not something I expected. Is it just set_root or are there other parts of the API that aren't actually safe?

@P-E-Meunier
Copy link

Actually, since Sanakirja manipulates pointers to memory not managed by Rust, nothing is really safe, and the borrow checker is unfortunately not sufficient to check everything. Other libraries suffer from this as well: for example in LMDB, mutably borrowing two different tables in LMDB is probably safe, but tables should borrow the transaction mutably to ensure the tables aren't reused after a commit.

But if you were using a simpler library before (i.e. not generic in the types), writing a safe wrapper should be really easy.

@jplatte
Copy link
Collaborator

jplatte commented Dec 13, 2022

In that case I don't have the confidence needed to continue with the sanakirja crypto store prototype. Long-term, we might be interested in building a higher-level safe API on top of a version of sanakirja with anything that can cause UB marked unsafe + documented safety preconditions, but right now we're probably going to try getting sqlite to work.

@P-E-Meunier
Copy link

No problem. If you know what you want, I can give you a safe wrapper. How many tables? Do you need nested datastructures (a Db where the keys and/or values are also Db)?

@gnunicorn
Copy link
Contributor

gnunicorn commented Mar 14, 2023

@jplatte
Copy link
Collaborator

jplatte commented May 15, 2023

The default store backend is sqlite now, and we will likely kill sled entirely soon enough.

@Hywan
Copy link
Member

Hywan commented Sep 20, 2023

And sled has been killed, RIP. I'm closing this issue :-).

@Hywan Hywan closed this as completed Sep 20, 2023
@tshepang
Copy link

hm, there was a(n alpha) release just last week

@poljar
Copy link
Contributor

poljar commented Sep 20, 2023

Indeed there was. Though what @Hywan meant is that the sled based store implementation has been removed from the repo.

Nowadays we're having slightly different requirements so bringing the sled store back is very unlikely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants