Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve CursorIterator #7

Closed
phraktle opened this issue Jul 10, 2016 · 14 comments
Closed

improve CursorIterator #7

phraktle opened this issue Jul 10, 2016 · 14 comments

Comments

@phraktle
Copy link
Contributor

Consider making CursorIterator more extensible. It would be reasonable to be able to subclass to provide a range iterator (i.e. a forward iterator that checks an upper bound key, or a reverse iterator checking a lower bound). Since the class is final and tryToComputeNext is private, this is not currently feasible.

Another minor point is that the state machine should probably include a CLOSED state (in which hasNext returns false, and repeated calls to close are idempotent).

@krisskross
Copy link
Contributor

The specific case of an "end" key is that you need to know what buffer you're dealing in order to implement a compare method.

Even if we make tryToComputeNext public, the iterator is instantiated by Dbi so you cannot force a certain implementation at the moment. We could open up the API a little, make the class public and let users implement their own?

@benalexau
Copy link
Member

I recently added a CursorIterator.KeyValPopulator without being aware of this ticket and this offers a simple mechanism for client code to modify the key or value being returned. My use case was iterating an index table where the keys were the index keys and values were the primary key of a different table (so a custom KeyValPopulator would return the value from the data table, not the value from the index table).

One idea is to rename KeyValPopulator and allow users to override whether the iteration is complete (eg boolean isComplete(T key, T val). We could leave a default method on KeyValPopulator which just returns true. Would this suit the use case @phraktle had in mind or are there other suggestions?

@krisskross
Copy link
Contributor

I think we should be careful about introducing domain specific concepts into the API and somehow I feel that CursorIterator.KeyValPopulator is a bit too specific. Wouldn't it be cleaner to implement it with a wrapper class outside of lmdbjava?

For the range iterator, apart from knowing when iteration is complete (stop key), we also need to know where to begin (start key). Both of keys should be optional and can be either inclusive (closed) or exclusive (open). Ideally, we should also let users decide if iteration order is reversed or not.

One suggestion is to introduce a separate method Dbi.iterate(Range) similar to how ranges work in Guava. A Comparator would need to be provided by the user, or using a default implementation per BufferProxy (though implementing a performant Comparator is non-trivial).

@krisskross
Copy link
Contributor

krisskross commented Jul 7, 2017

I implemented something similar in RxLMDB a while back which might serve as inspiration.

https://github.com/deephacks/RxLMDB/blob/master/rxlmdb/src/main/java/org/deephacks/rxlmdb/KeyRange.java

@benalexau
Copy link
Member

benalexau commented Jul 8, 2017

I think we should be careful about introducing domain specific concepts into the API and somehow I feel that CursorIterator.KeyValPopulator is a bit too specific. Wouldn't it be cleaner to implement it with a wrapper class outside of lmdbjava?

I did consider this, but felt a wrapper class would be too verbose. I've pasted the two approaches below.

  1. What can be done with LmdbJava 0.5.0 with this use case:
package au.com.acegi.core.lmdb;

import org.agrona.DirectBuffer;
import static org.apache.commons.lang3.Validate.notNull;
import org.lmdbjava.Cursor;
import org.lmdbjava.CursorIterator.KeyValPopulator;
import org.lmdbjava.Dbi;
import org.lmdbjava.Txn;

/**
 * LmdbJava <code>KeyValPopulator</code> used with index tables.
 *
 * <p>
 * Index tables have their values as the primary key of the data table.
 * When the index table cursor is presented, its value is therefore read to
 * obtain the primary key and the iterator value is fetched from the data
 * table.
 */
public final class IndexKeyValPop implements KeyValPopulator<DirectBuffer> {

  private final Dbi<DirectBuffer> dbi;
  private final Txn<DirectBuffer> txn;

  public IndexKeyValPop(final Txn<DirectBuffer> txn,
                        final Dbi<DirectBuffer> dbi) {
    notNull(txn, "Txn required");
    notNull(dbi, "Dbi required");
    this.txn = txn;
    this.dbi = dbi;
  }

  @Override
  public DirectBuffer getKey(final Cursor<DirectBuffer> cursor) {
    return cursor.key();
  }

  @Override
  public DirectBuffer getVal(final Cursor<DirectBuffer> cursor) {
    final DirectBuffer pk = cursor.val();
    return dbi.get(txn, pk);
  }

}
  1. What would be required to do this without KeyValPopulator:
package au.com.acegi.core.lmdb;

import java.util.Iterator;
import java.util.NoSuchElementException;
import static org.apache.commons.lang3.Validate.notNull;
import org.lmdbjava.CursorIterator;
import org.lmdbjava.Dbi;
import org.lmdbjava.Txn;

/**
 * {@link CursorIterator} that supports transparent use with index tables.
 *
 * <p>
 * Index tables have their values as the primary key of the data table. When an
 * index table cursor is iterated, its value is therefore read to obtain the
 * primary key. The iterator value is then fetched from the primary key table.
 *
 * <p>
 * This class also supports primary key tables for iterator return type
 * consistency within DAOs. In this case the cursor value is simply returned
 * without a lookup.
 *
 * @param <T> buffer type
 */
public final class DaoIterator<T> implements
    Iterator<DaoIterator.KeyVal<T>>, AutoCloseable {

  private final Dbi<T> dbi;
  private final CursorIterator<T> delegate;
  private final KeyVal<T> entry = new KeyVal<>();
  private final Txn<T> txn;

  /**
   * Create a new instance (should be called from generated code only).
   *
   * <p>
   * The generated iterator will delegate all calls to the passed delegate.
   *
   * <p>
   * If this instance will wrap a cursor for an index table, the passed
   * {@link Dbi} must be non-null and represent the primary key table. When
   * {@link #next()} is called, the index table value will be replaced with the
   * corresponding primary key table value.
   *
   * <p>
   * If this instance will wrap a cursor for the primary key table, the passed
   * {@link Dbi} must be null. This avoids the aforementioned replacement of
   * values as would occur with an index table.
   *
   *
   * @param txn      transaction used within the delegate (required)
   * @param delegate to pass all calls to (required)
   * @param dbi      primary key table (must be null if this is the PK table)
   */
  public DaoIterator(final Txn<T> txn, final CursorIterator<T> delegate,
                     final Dbi<T> dbi) {
    notNull(txn, "Transaction required");
    notNull(delegate, "Delegate required");
    this.txn = txn;
    this.delegate = delegate;
    this.dbi = dbi;
  }

  @Override
  public void close() {
    delegate.close();
  }

  @Override
  public boolean hasNext() {
    return delegate.hasNext();
  }

  /**
   * Obtain an iterator.
   *
   * @return an iterator (never null)
   */
  public Iterable<KeyVal<T>> iterable() {
    return () -> DaoIterator.this;
  }

  @Override
  public KeyVal<T> next() throws NoSuchElementException {
    final CursorIterator.KeyVal<T> fetched = delegate.next();
    entry.setKey(fetched.key());
    if (dbi == null) {
      entry.setVal(fetched.val());
    } else {
      final T pk = fetched.val();
      entry.setVal(dbi.get(txn, pk));
    }
    return entry;
  }

  @Override
  public void remove() {
    delegate.remove();
  }

  /**
   * Holder for a key and value pair.
   *
   * <p>
   * The same holder instance will always be returned for a given iterator.
   * The returned keys and values may change or point to different memory
   * locations following changes in the iterator, cursor or transaction.
   *
   * @param <T> buffer type
   */
  public static final class KeyVal<T> {

    private T k;
    private T v;

    /**
     * The key.
     *
     * @return key
     */
    public T key() {
      return k;
    }

    /**
     * The value.
     *
     * @return value
     */
    public T val() {
      return v;
    }

    void setKey(final T key) {
      this.k = key;
    }

    void setVal(final T val) {
      this.v = val;
    }

  }

}

As such the KeyValPopulator is less than one third the lines of code and does not introduce largely duplicate types (DaoIterator = CursorIterator, DaoIterator.KeyVal = CursorIterator.KeyVal).

So while I totally agree we shouldn't introduce domain concepts, I've encountered many occasions where I needed to fetch data using more than one key. I assumed it was a common enough requirement that others would be doing so as well. I'm happy to revert KeyValPopulator if you like though, as DaoIterator solves the problem adequately enough and I might simply be wrong on this being a common requirement.

I implemented something similar in RxLMDB a while back which might serve as inspiration.

Looks good. I'm assuming we could make it buffer agnostic? It would seem we'll need to make BufferProxy<T> implements Comparator<T> to make this work though? I'd only add there would be value in making the KeyRangeType and KeyRange constructor public so that generated DAOs can express the desired behaviour in terms of their domain objects (not the buffer type, which is handled behind-the-scenes in the DAO implementation). For example:

  CursorIterator<DirectBuffer> findPeopleByName(Txn<DirectBuffer> txn, KeyRangeType type, String firstName, String lastName);

In the above the firstName and lastName are converted to a fixed-length (or other delimited) key, with the passed type governing which of the arguments (if any) are actually used. This would suit code generators fine and allow them to abstract away how the key is actually constructed. Certainly the CursorIterator<DirectBuffer> is still returned, but the value is likely the user's domain object (eg a SBE-encoded payload, or other direct input to a deserializer).

@benalexau
Copy link
Member

A Comparator would need to be provided by the user, or using a default implementation per BufferProxy (though implementing a performant Comparator is non-trivial).

I've just pushed a commit that contains tested comparators for all current BufferProxys as prerequisite work for range support. Can @krisskross please take a look? Also would anyone else like to write the range code are you guys happy for me to get started with it?

benalexau added a commit that referenced this issue Jul 12, 2017
Netty already provides Comparable<ByteBuf> so we delegate there.

The work shared by @krisskross on DirectBuffer comparable support at
https://github.com/deephacks/RxLMDB/blob/master/rxlmdb/src/main/java/org/deephacks/rxlmdb/DirectBufferComparator.java
was updated to Java 8. This in turn serves as the basis of the other
BufferProxy comparable implementations. Public static methods are
provided for end user convenience in case they wish to compare
buffers.

ComparatorTest was developed using String.compareTo(String) as the
reference of what is expected for varying-length
strings. ComparatorTest therefore ensures all implementations return
the same comparable integer result for the same inputs. This gives
some confidence the implementations are correct, as they align across
Java String, Netty and the hand-written comparable methods.

No attempt has been made to benchmark or optimise the hand-written
comparable methods at this stage.
benalexau added a commit that referenced this issue Jul 13, 2017
This commit captures the available key ranges and unit tests that
confirm the JavaDoc specification is correct. The KeyRange class
provides package protected methods that encapsulate the iteration and
control logic such that the eventual Iterator class will not have any
awareness of the various KeyRangeTypes and it should be possible to
add new KeyRangeTypes without modifying the actual Iterator.
benalexau added a commit that referenced this issue Jul 14, 2017
benalexau added a commit that referenced this issue Jul 14, 2017
This is possible because KeyRangeType.BACKWARD and
KeyRangeType.FORWARD do not use start/stop buffers and thus can be
static singletons to reduce allocation and GC pressure.
benalexau added a commit that referenced this issue Jul 14, 2017
Existing code should continue to work without modification, as
confirmed by the unmodified CursorIteratorTest.
@benalexau
Copy link
Member

All done. Please take a look and let me know if this meets the CursorIterator flexibility goals.

I reverted KeyValPopulator, as the flexibility now offered by KeyRange requires extended scope for start and stop keys with certain KeyRangeTypes (eg FORWARD_RANGE). Previously CursorIterator only required the passed key for the initial get, so one could trigger this via hasNext() and then deallocate the key buffer before returning control flow. With extended start and stop key scopes now making this infeasible, an iterator wrapper such as the aforementioned DaoIterator is necessary to deallocate those buffers.

@krisskross
Copy link
Contributor

Nice work! I think the changes look good.

The only thing that feels a bit unclear is whether keys are inclusive or exclusive. Ideally we should let users choose for themselves.

I suppose that we intend to let users provide their own comparator by overriding the BufferProxy.compare method, right? I think this is good since there are quite a few comparators out there that users might want to use for performance reasons. Maybe we could add a test using (for example) the lexicographical comparators in UnsignedBytes and SignedBytes?

@benalexau
Copy link
Member

The only thing that feels a bit unclear is whether keys are inclusive or exclusive. Ideally we should let users choose for themselves.

I too found the names unclear, which is why I put examples in the KeyRangeType JavaDocs. I intend to add some more KeyRangeType values to cover the exclusive cases.

I suppose that we intend to let users provide their own comparator by overriding the BufferProxy.compare method, right? I think this is good since there are quite a few comparators out there that users might want to use for performance reasons.

Users can provide their own Comparator by invoking CursorIterator<T> Dbi.iterate(Txn<T>, KeyRange<T>, Comparator<T>). Is that sufficient?

Maybe we could add a test using (for example) the lexicographical comparators in UnsignedBytes and SignedBytes?

Good idea. I'll tackle that tomorrow.

@krisskross
Copy link
Contributor

Sounds all good!

benalexau added a commit that referenced this issue Jul 15, 2017
Deferring to a widely-known external source simplifies adoption and
allows us to more easily identify unsupported range types.
benalexau added a commit that referenced this issue Jul 15, 2017
This simplifies KeyRange, which is the user-facing class. It also
colocates the control logic with the detailed descriptions and
examples of what each KeyRangeType is supposed to do, making it easier
to understand and maintain.
benalexau added a commit that referenced this issue Jul 15, 2017
This is an example of how users could pass their own Comparator,
subject of course to the comparator reflecting the key order returned
by LMDB by Cursor previous / next invocations (if a comparator would
result in a different key order than LMDB will return from its B+ tree
index, we cannot efficiently use LMDB to iterate the key space).
@benalexau
Copy link
Member

I just pushed the final commits. LmdbJava now supports every range type offered by Guava (in both forward and backward cursor directions) and method names now reflect its terminology. There is also an example of using a Guava comparator, plus I added Guava to the ComparatorTest to further validate the project's comparators against third-party references. Are we happy to close this ticket?

@krisskross
Copy link
Contributor

Awesome work again! Yes, the API looks great now.

I noticed the switch for the iterator. Not sure, but the code path might be more performant with separate implementation/function for each KeyRange type? Anyway, it doesn't really matter since we can always refactor it later without affecting the API.

A release might be appropriate?

@benalexau
Copy link
Member

Not sure, but the code path might be more performant with separate implementation/function for each KeyRange type? Anyway, it doesn't really matter since we can always refactor it later without affecting the API.

I found the KeyRangeType switch approach provided the easiest way to see the difference between each range type, which is generally what to do with a just-fetched key. Having all of those those decisions beside one another in the same method was helpful when writing them (and I suspect reasoning about them should we receive related bug reports down the track). As the logic is fairly simple and mostly deals with enum values or final fields, I am hoping it will be optimised away or immaterial versus the broader buffer, JNI and LMDB native overhead. We have to benchmark it but I'd like to see if we receive bug reports or usability suggestions first.

A release might be appropriate?

Absolutely. I will explore #56 and generate a release.

@benalexau
Copy link
Member

Commit 8ec9550 just pushed for ticket #56 may also be of interest, as it shows a user-defined comparator being passed to LMDB itself (via mdb_set_compare) and this user-defined comparator automatically being used with the resulting CursorIterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants