Stop using (large) `PropSubjectMap` items in Index (if possible), speed up `update_index` for Commits #282

joepio · 2022-01-11T16:08:58Z

Since I use sled, a KV store, I have to manually create my own index solutions. One of these is the ValueIndex, which stores Values as Keys, so I get a reference index. This is like an inverted store. With this, I can quickly find incoming references in resources. Very useful for constructing Collections! So lets say resource A has a reference to resource B, I might want to be able to find incoming references for resource B, without having to iterate over all resources.

So the Key (K) here is actually the Value from a normal atom. The V is a PropSubjectMap, a HashMap which is a map of Properties and Subjects:

pub type PropSubjectMap = HashMap<String, HashSet<String>>;

This works OK, but does not scale well. Let's consider what happens when we update a Document, and append some new Element to it (a user presses Enter in a document editor). The Commit is parsed by the server, and the server will update the modified resource. After that, an Actor receives a message about this commit, and will update the indexes. We've only updated one Atom (the list of Elements), but since that is a ResourceArray, we'll iterate over every Subject inside that array. For each one, we'll get_prop_subject_map, remove the item accordingly, and then set_prop_subject_map. Setting and getting are expensive operations, since the entire object has to be fetched, (de)serialized and stored.

In total, we're talking about 20ms on my laptop. I think this will get far worse as we have more resources, as getting and setting will become slower the bigger the PropSubjectMap items are. Note that some subjects will have a lot of incoming links, such as commonly used Classes, and probably also users.

The text was updated successfully, but these errors were encountered:

joepio · 2022-01-12T09:45:27Z

Two possible approaches:

Create a `Tree` for each incoming reference

Instead of storing Vec<String> in a Tree, we could make one tree for every Value. I'm not sure whether there are downsides to this. Is opening a tree slow? Probably. A user I chatted with:

I ran small test with 1000 trees and it was quite ok.
I get them all on startup which takes a lot of time, but afterwards it’s fast for writing

Use `range`

Think so - range gives you a double ended iterator
Or maybe get_lt works too for your needs 
The whole stuff hardly depends on the keys you have.
Easiest way to handle is with numbers (Big endian encoded as mentioned in readme)

 IVec implements SSO, so an empty IVec (or any value shorter than a pointer) is equivalent to an usize in memory. I'd

data structure in ValPropSubject index:

Key: (Value, Property, Number)
Value: Subject

I think I'll use .scan_prefix, and use (Value, none, none) prefix for TPF queries without a property, and (Value, property, none) for TPF queries with both a value and a property.

The add_atom_to_index function can now simply create an item without loading into memory a potentially big Hashmap.

But I'm not entirely sure what to do with Number. We can't use Value, Property as the sole key, since these can be the same for multiple atoms, so they will collide.

Make it something random. We fixed the collision issue!
Use a timestamp. Also fixes collision, but now we can also maybe sort by latest changes. Is that useful?
n+1. Check how many items for Value, property exist, and add one. I guess this would make it easier to sort? But I think it will make it more costly to create the atom, since we now have to check what the last item is with something like tree.range_prefix().iter_last()

The remove_atom_from_index function will need to iterate over all items that match some Value, Property prefix, and check if the Subject is the same.

Or...

Should we put the subject in the Key? Does this make sense?

Key: (Value, Property, Subject)
Value: ?

If we can use a range_prefix to find Value, Property combinations, we can iterate over all matching items to get all subjects.

But the value would be empty, or at least, I don't know if it should contain anything. It feels like a ridiculous way of using a KV store, but maybe this is a good thing?

Lexicographic ordering and byte serialization

Sled only stores an array of bytes. If I'm to rely on the range_prefix function, I'll need to make sure that my binary serialization of my keys maintains lexicographic ordering. UTF-8 is already lexicographically ordered, so I can just drop bincode for the keys and use subject.as_bytes(). If I serialize Value, Property, Subject as one string, it becomes a bit hard to parse. I'd rather Have tuple or struct, but that will break ordering.

Some libraries exist that fix this, though. ordcode is interesting, and so is bytekey.

But perhaps splitting a string up using some unconventional, URL-illegal character Also does the trick just fine.

joepio · 2022-01-15T15:57:42Z

I went for using format!("{}\n{}\n{}", val, prop, subject) as Keys, and use empty values. Performance is amazing now, mostly <1ms handling of commits, from about 50ms. Good stuff!

joepio added performance Speed improvements server atomic-server labels Jan 11, 2022

joepio changed the title ~~Stop using (large) Vec items in Index (if possible)~~ Stop using (large) PropSubjectMap items in Index (if possible) Jan 15, 2022

joepio changed the title ~~Stop using (large) PropSubjectMap items in Index (if possible)~~ Stop using (large) PropSubjectMap items in Index (if possible), speed up update_index for Commits Jan 15, 2022

joepio added a commit that referenced this issue Jan 15, 2022

#282 WIP refactor value index

e2c082d

joepio added a commit that referenced this issue Jan 15, 2022

#282 refactor value index

2a40b7d

joepio mentioned this issue Jan 15, 2022

#282 refactor value index #284

Merged

4 tasks

joepio closed this as completed Jan 15, 2022

joepio added a commit that referenced this issue Jan 15, 2022

#282 refactor value index (#284)

5555b70

joepio added this to the v0.31 querying, caching, performance milestone Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using (large) `PropSubjectMap` items in Index (if possible), speed up `update_index` for Commits #282

Stop using (large) `PropSubjectMap` items in Index (if possible), speed up `update_index` for Commits #282

joepio commented Jan 11, 2022 •

edited

joepio commented Jan 12, 2022 •

edited

joepio commented Jan 15, 2022

Stop using (large) PropSubjectMap items in Index (if possible), speed up update_index for Commits #282

Stop using (large) PropSubjectMap items in Index (if possible), speed up update_index for Commits #282

Comments

joepio commented Jan 11, 2022 • edited

joepio commented Jan 12, 2022 • edited

Create a Tree for each incoming reference

Use range

Lexicographic ordering and byte serialization

joepio commented Jan 15, 2022

Stop using (large) `PropSubjectMap` items in Index (if possible), speed up `update_index` for Commits #282

Stop using (large) `PropSubjectMap` items in Index (if possible), speed up `update_index` for Commits #282

joepio commented Jan 11, 2022 •

edited

joepio commented Jan 12, 2022 •

edited

Create a `Tree` for each incoming reference

Use `range`