Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using (large) PropSubjectMap items in Index (if possible), speed up update_index for Commits #282

Closed
joepio opened this issue Jan 11, 2022 · 2 comments
Labels
performance Speed improvements server atomic-server

Comments

@joepio
Copy link
Member

joepio commented Jan 11, 2022

Since I use sled, a KV store, I have to manually create my own index solutions. One of these is the ValueIndex, which stores Values as Keys, so I get a reference index. This is like an inverted store. With this, I can quickly find incoming references in resources. Very useful for constructing Collections! So lets say resource A has a reference to resource B, I might want to be able to find incoming references for resource B, without having to iterate over all resources.

So the Key (K) here is actually the Value from a normal atom. The V is a PropSubjectMap, a HashMap which is a map of Properties and Subjects:

pub type PropSubjectMap = HashMap<String, HashSet<String>>;

This works OK, but does not scale well. Let's consider what happens when we update a Document, and append some new Element to it (a user presses Enter in a document editor). The Commit is parsed by the server, and the server will update the modified resource. After that, an Actor receives a message about this commit, and will update the indexes. We've only updated one Atom (the list of Elements), but since that is a ResourceArray, we'll iterate over every Subject inside that array. For each one, we'll get_prop_subject_map, remove the item accordingly, and then set_prop_subject_map. Setting and getting are expensive operations, since the entire object has to be fetched, (de)serialized and stored.

In total, we're talking about 20ms on my laptop. I think this will get far worse as we have more resources, as getting and setting will become slower the bigger the PropSubjectMap items are. Note that some subjects will have a lot of incoming links, such as commonly used Classes, and probably also users.

@joepio joepio added performance Speed improvements server atomic-server labels Jan 11, 2022
@joepio
Copy link
Member Author

joepio commented Jan 12, 2022

Two possible approaches:

Create a Tree for each incoming reference

Instead of storing Vec<String> in a Tree, we could make one tree for every Value. I'm not sure whether there are downsides to this. Is opening a tree slow? Probably. A user I chatted with:

I ran small test with 1000 trees and it was quite ok.
I get them all on startup which takes a lot of time, but afterwards it’s fast for writing

Use range

Think so - range gives you a double ended iterator
Or maybe get_lt works too for your needs 
The whole stuff hardly depends on the keys you have.
Easiest way to handle is with numbers (Big endian encoded as mentioned in readme)
 IVec implements SSO, so an empty IVec (or any value shorter than a pointer) is equivalent to an usize in memory. I'd

data structure in ValPropSubject index:

Key: (Value, Property, Number)
Value: Subject

I think I'll use .scan_prefix, and use (Value, none, none) prefix for TPF queries without a property, and (Value, property, none) for TPF queries with both a value and a property.

The add_atom_to_index function can now simply create an item without loading into memory a potentially big Hashmap.

But I'm not entirely sure what to do with Number. We can't use Value, Property as the sole key, since these can be the same for multiple atoms, so they will collide.

  • Make it something random. We fixed the collision issue!
  • Use a timestamp. Also fixes collision, but now we can also maybe sort by latest changes. Is that useful?
  • n+1. Check how many items for Value, property exist, and add one. I guess this would make it easier to sort? But I think it will make it more costly to create the atom, since we now have to check what the last item is with something like tree.range_prefix().iter_last()

The remove_atom_from_index function will need to iterate over all items that match some Value, Property prefix, and check if the Subject is the same.

Or...

Should we put the subject in the Key? Does this make sense?

Key: (Value, Property, Subject)
Value: ? 

If we can use a range_prefix to find Value, Property combinations, we can iterate over all matching items to get all subjects.

But the value would be empty, or at least, I don't know if it should contain anything. It feels like a ridiculous way of using a KV store, but maybe this is a good thing?

Lexicographic ordering and byte serialization

Sled only stores an array of bytes. If I'm to rely on the range_prefix function, I'll need to make sure that my binary serialization of my keys maintains lexicographic ordering. UTF-8 is already lexicographically ordered, so I can just drop bincode for the keys and use subject.as_bytes(). If I serialize Value, Property, Subject as one string, it becomes a bit hard to parse. I'd rather Have tuple or struct, but that will break ordering.

Some libraries exist that fix this, though. ordcode is interesting, and so is bytekey.

But perhaps splitting a string up using some unconventional, URL-illegal character Also does the trick just fine.

@joepio joepio changed the title Stop using (large) Vec items in Index (if possible) Stop using (large) PropSubjectMap items in Index (if possible) Jan 15, 2022
@joepio joepio changed the title Stop using (large) PropSubjectMap items in Index (if possible) Stop using (large) PropSubjectMap items in Index (if possible), speed up update_index for Commits Jan 15, 2022
joepio added a commit that referenced this issue Jan 15, 2022
joepio added a commit that referenced this issue Jan 15, 2022
@joepio joepio mentioned this issue Jan 15, 2022
4 tasks
@joepio
Copy link
Member Author

joepio commented Jan 15, 2022

I went for using format!("{}\n{}\n{}", val, prop, subject) as Keys, and use empty values. Performance is amazing now, mostly <1ms handling of commits, from about 50ms. Good stuff!

@joepio joepio closed this as completed Jan 15, 2022
joepio added a commit that referenced this issue Jan 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed improvements server atomic-server
Projects
None yet
Development

No branches or pull requests

1 participant