Count + limit & offset + iterators #23

SalomonBrys · 2014-10-09T18:08:55Z

Hi,
this pull request adds three features that I need to use SnappyDB in my future app.
My future app will manage chat objects, potentially tens of thousands of them.

Count

Counting is in fact loading data and counting the number of entries. The problem I have with the findKeys("prefix").length is that all keys are loaded into the JVM as managed Strings that I will not use. For small collections, this is not an issue, but for very big collections (as I plan to use SnappyDB for), counting in C++ and not having the JVM manage unused key strings is, I think, important.

Hence, this request adds countKeys and countKeysBetween.

Offset & limit

@tieubao asked it in #21. It allows paging (however, iterators are more suited) and allows to access directly a limited view of the key collection.

Iterators

This is the most important feature. It allows to acces data one by one or by batch and to keep a "pointer" to the position inside the collection so that we can continue browsing later. It is very useful for a huge collection because:

It allows paging without having to scroll from begining (as offset does).
It loads values on demand (better than loading tens of thousands strings into the JVM at once).

I have documented and tested these three features.
I have also tried to respect your code style, both in C++ and in Java.

Let me know what you think ;)

Salomon

… for #21

… especially on large chunks

nhachicha · 2014-10-09T18:14:55Z

Hi @SalomonBrys thx for the PR, I'll take a look & get back to you soonish!

nhachicha · 2014-10-11T17:59:48Z

Hi @SalomonBrys
The PR Looks great. Tiny things though.

I made KeyIterator implements Iterable (so we can use the for-each loop) fc6429d
when reaching the end of the iterator we close it automatically
update Gradle & Doc ad6629e 3c61300

I merged into develop branch (since I'm using git-flow) if you're happy with the above changes, I can push to master & roll out a new version ASAP

Cheers

SalomonBrys · 2014-10-12T13:53:53Z

Hi @nhachicha

I do agree with Closing the iterator when reaching the end of the collection, but what is the purpose of line 53 since this is exactly what close() does ?

I've thought a bit about Iterable and the fact is I'm not entirely sure KeyIterator should be even be an Iterator after all.
Overall, I think the only methods that should be used in KeyIterator are hasNext(), next(int) and close().

I think that, 90% of the time, for (String key : db.findKeys("prefix")) will be better than for (String key : db.findKeysIterator("prefix")) since the former will allocate and fill an array of key string all at once while the latter will allocate each key String one after the other, effectively giving the GC more work.

One could argue that a for-each with findKeysIterator can avoid an OutOfMemoryException or simply avoid allocating all key strings on a very large collection. But even then, I would recommend using this form:

KeyIterator it = db.findKeysIterator("prefix");
while (it.hasNext()) {
    String[] keys = it.next(BATCH_SIZE);
    for (String key : keys) {
        /*...*/
    }
}
it.close();

For now, I haven't found a usage of next() and of for-each on findKeysIterator that I don't think to be less optimized than using findKeys() or next(int).

The loop-on-batch algorithm written above could even be integrated on the library with iterable, something like for (String[] keys : findKeysIterator("prefix").byBatch(BATCH_SIZE)) or for (String[] keys : findKeysByBatch("prefix", BATCH_SIZE)). I could work on that if you want.

What do you think ?

nhachicha · 2014-10-14T11:53:50Z

Hey @SalomonBrys

line 53 is useless indeed, good catch :)

You raised a good point in your analysis
we have two situation:
Let M be the number of access (iterations) & N the number of keys

One of the advantages of the iterator approach is to avoid OutOfMemory issue as you mentioned (for a big key set)
on the other hand the real performance drop IMO is the cost of the JNI calls (nbr of crossing boundaries - marshalling/unmarshalling), unlike findKeys (that requires one JNI call)

Average Time complexity is ~ O(ln(N)) // O(ln(N)) to find the first key + M call to next ~ == O(ln(N)) if M is not huge (otherwise worst case is O(N) if ~ M== N)
Average Space complexity is ~ O(1) (worst case is O(N) if ~ M == N)

Bottom line:

using findKeys to traverse the keys is faster for a small key set.
Iterating with Iterator (using a for-each or while, doesn't matter foreach is only a syntactic sugar) is slower (still empirical guess) given the number of JNI call.
but it's ~ constant for space complexity O(1) could avoid OOM for a big key set.

The size of key set IMO drives the decision to use one approach over the other. Hence, we should give this choice to the user by highlighting the pro/con of each approach via javadoc.

WDYT

SalomonBrys · 2014-10-14T12:01:46Z

I agree with everything you said, except that, even on large key set, we should not allow to iterate over all keys one by one. Such case would effectively mean that M == N and therefore we would encourage the worst case scenario by allowing the very simple for (String key : db.findKeysIterator("prefix")).

I think we should "force" the user to reduce M by having to handle keys by batch and force him to iterate on key batchs and not keys: for (String[] keys : db.findKeysIterator("prefix").byBatch(200)).

nhachicha · 2014-10-14T12:14:51Z

Ok, I like this pattern where only byBatch would return the Iterator so the user is forced to specify a batch value (we could emphasise the fact that using a smaller batch value ex 1 will degrade performance)

SalomonBrys · 2014-10-14T12:33:03Z

OK, I'll re-fork the project and work on this feature in devel ;)

SalomonBrys added 8 commits October 9, 2014 12:47

Added a limit argument to findKeys and findKeysBetween for #21

db535a6

Added an offset argument before limit to findKeys and findKeysBetween…

965e207

… for #21

Added countKeys and countKeys between which are a lot more efficient,…

c280fd6

… especially on large chunks

Added iterators and reverse iterators

452be90

Bump to 0.5.0

829a0ef

Update README.md for counting, offset, limit and iterators

8427530

Added the possibility to retrive a batch of keys from an iterator

d1eb02e

Update README.md

e29a446

SalomonBrys closed this Oct 14, 2014

SalomonBrys mentioned this pull request Oct 14, 2014

Batch iterator #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count + limit & offset + iterators #23

Count + limit & offset + iterators #23

SalomonBrys commented Oct 9, 2014

nhachicha commented Oct 9, 2014

nhachicha commented Oct 11, 2014

SalomonBrys commented Oct 12, 2014

nhachicha commented Oct 14, 2014

SalomonBrys commented Oct 14, 2014

nhachicha commented Oct 14, 2014

SalomonBrys commented Oct 14, 2014

Count + limit & offset + iterators #23

Count + limit & offset + iterators #23

Conversation

SalomonBrys commented Oct 9, 2014

Count

Offset & limit

Iterators

nhachicha commented Oct 9, 2014

nhachicha commented Oct 11, 2014

SalomonBrys commented Oct 12, 2014

nhachicha commented Oct 14, 2014

SalomonBrys commented Oct 14, 2014

nhachicha commented Oct 14, 2014

SalomonBrys commented Oct 14, 2014