Skip to content
This repository has been archived by the owner on Sep 4, 2020. It is now read-only.

Count + limit & offset + iterators #23

Closed
wants to merge 8 commits into from
Closed

Count + limit & offset + iterators #23

wants to merge 8 commits into from

Conversation

SalomonBrys
Copy link
Contributor

Hi,
this pull request adds three features that I need to use SnappyDB in my future app.
My future app will manage chat objects, potentially tens of thousands of them.

Count

Counting is in fact loading data and counting the number of entries. The problem I have with the findKeys("prefix").length is that all keys are loaded into the JVM as managed Strings that I will not use. For small collections, this is not an issue, but for very big collections (as I plan to use SnappyDB for), counting in C++ and not having the JVM manage unused key strings is, I think, important.

Hence, this request adds countKeys and countKeysBetween.

Offset & limit

@tieubao asked it in #21. It allows paging (however, iterators are more suited) and allows to access directly a limited view of the key collection.

Iterators

This is the most important feature. It allows to acces data one by one or by batch and to keep a "pointer" to the position inside the collection so that we can continue browsing later. It is very useful for a huge collection because:

  • It allows paging without having to scroll from begining (as offset does).
  • It loads values on demand (better than loading tens of thousands strings into the JVM at once).

I have documented and tested these three features.
I have also tried to respect your code style, both in C++ and in Java.

Let me know what you think ;)

Salomon

@nhachicha
Copy link
Owner

Hi @SalomonBrys thx for the PR, I'll take a look & get back to you soonish!

@nhachicha
Copy link
Owner

Hi @SalomonBrys
The PR Looks great. Tiny things though.

  • I made KeyIterator implements Iterable (so we can use the for-each loop) fc6429d
  • when reaching the end of the iterator we close it automatically
  • update Gradle & Doc ad6629e 3c61300

I merged into develop branch (since I'm using git-flow) if you're happy with the above changes, I can push to master & roll out a new version ASAP

Cheers

@SalomonBrys
Copy link
Contributor Author

Hi @nhachicha

I do agree with Closing the iterator when reaching the end of the collection, but what is the purpose of line 53 since this is exactly what close() does ?

I've thought a bit about Iterable and the fact is I'm not entirely sure KeyIterator should be even be an Iterator after all.
Overall, I think the only methods that should be used in KeyIterator are hasNext(), next(int) and close().

I think that, 90% of the time, for (String key : db.findKeys("prefix")) will be better than for (String key : db.findKeysIterator("prefix")) since the former will allocate and fill an array of key string all at once while the latter will allocate each key String one after the other, effectively giving the GC more work.

One could argue that a for-each with findKeysIterator can avoid an OutOfMemoryException or simply avoid allocating all key strings on a very large collection. But even then, I would recommend using this form:

KeyIterator it = db.findKeysIterator("prefix");
while (it.hasNext()) {
    String[] keys = it.next(BATCH_SIZE);
    for (String key : keys) {
        /*...*/
    }
}
it.close();

For now, I haven't found a usage of next() and of for-each on findKeysIterator that I don't think to be less optimized than using findKeys() or next(int).

The loop-on-batch algorithm written above could even be integrated on the library with iterable, something like for (String[] keys : findKeysIterator("prefix").byBatch(BATCH_SIZE)) or for (String[] keys : findKeysByBatch("prefix", BATCH_SIZE)). I could work on that if you want.

What do you think ?

@nhachicha
Copy link
Owner

Hey @SalomonBrys

line 53 is useless indeed, good catch :)

You raised a good point in your analysis
we have two situation:
Let M be the number of access (iterations) & N the number of keys

  • One of the advantages of the iterator approach is to avoid OutOfMemory issue as you mentioned (for a big key set)
    on the other hand the real performance drop IMO is the cost of the JNI calls (nbr of crossing boundaries - marshalling/unmarshalling), unlike findKeys (that requires one JNI call)

Average Time complexity is ~ O(ln(N)) // O(ln(N)) to find the first key + M call to next ~ == O(ln(N)) if M is not huge (otherwise worst case is O(N) if ~ M== N)
Average Space complexity is ~ O(1) (worst case is O(N) if ~ M == N)

Bottom line:

  • using findKeys to traverse the keys is faster for a small key set.
  • Iterating with Iterator (using a for-each or while, doesn't matter foreach is only a syntactic sugar) is slower (still empirical guess) given the number of JNI call.
    but it's ~ constant for space complexity O(1) could avoid OOM for a big key set.

The size of key set IMO drives the decision to use one approach over the other. Hence, we should give this choice to the user by highlighting the pro/con of each approach via javadoc.

WDYT

@SalomonBrys
Copy link
Contributor Author

I agree with everything you said, except that, even on large key set, we should not allow to iterate over all keys one by one. Such case would effectively mean that M == N and therefore we would encourage the worst case scenario by allowing the very simple for (String key : db.findKeysIterator("prefix")).

I think we should "force" the user to reduce M by having to handle keys by batch and force him to iterate on key batchs and not keys: for (String[] keys : db.findKeysIterator("prefix").byBatch(200)).

@nhachicha
Copy link
Owner

Ok, I like this pattern where only byBatch would return the Iterator so the user is forced to specify a batch value (we could emphasise the fact that using a smaller batch value ex 1 will degrade performance)

@SalomonBrys
Copy link
Contributor Author

OK, I'll re-fork the project and work on this feature in devel ;)

@SalomonBrys SalomonBrys mentioned this pull request Oct 14, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants