New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map reduce support #231

Open
klehmann opened this Issue Oct 29, 2013 · 6 comments

Comments

Projects
None yet
3 participants
@klehmann

klehmann commented Oct 29, 2013

I am currently thinking about implementing map reduce with MapDB maps in my application. CouchDB has an approach where intermediate reduce results get stored in b-tree nodes for faster computations after b-tree changes (only nodes with changed children need to be recomputed). That might also be possible with MapDB, when DirNodes could have additional persistent properties.
Would be interesting to have such a map/reduce functionality built right into the MapDB core.

@jankotek

This comment has been minimized.

Owner

jankotek commented Oct 29, 2013

CouchDB has an approach where intermediate reduce results get stored in b-tree nodes for faster computations after b-tree changes

Good idea. Sounds doable as BTreeMap extension. Intermediate results should be stored in secondary collections.

Simplest case is probably counted btree
http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html

@dzlab

This comment has been minimized.

dzlab commented Jun 6, 2014

@klehmann can you share how you had eventually implemented map-reduce using MapDB?

@klehmann

This comment has been minimized.

klehmann commented Jun 6, 2014

I haven't done any implementation yet.

@jankotek

This comment has been minimized.

Owner

jankotek commented Jun 6, 2014

I think better name would be "per node agregations".

1.0 storage format supports per-node metadata, so it is just question to add tooling.

This has been added to roadmap and is scheduled for 1.2 in about 5 months
http://www.kotek.net/blog/MapDB_Roadmap_and_near_future

@jankotek jankotek added 1.2 and removed after 1.0 labels Jun 23, 2014

@jankotek jankotek added 2.0 and removed 1.2 labels Aug 17, 2014

@jankotek

This comment has been minimized.

Owner

jankotek commented Nov 11, 2014

I started some actual work on this in separate branch for 2.0. Lets discus some design:

For start I have two use cases: counted btree and sum for BTreeMap<String,Long>. In both cases submap (interval) should produce count and sum without traversing entire submap.

Aggregation value must be calculated without traversing entire set as well. So on updates new aggregation value should be produced on incremental fashion from old aggregation value, old node and new node. For this aggregation value must be produced using associative operation (such as + or * or else).

And finally the aggregation value should be flexible, any typ should do. Only condition is that it can be produced using incremental (associative) operations.

So for now I am thinking that user would attach this call bacl interface to BTreeMap:

class Agr implements Agregator<A>{

  /** serializer used to serialize aggregation value
  Serializer<A> serializer();

  /*calculate leaf node aggregation value, after some key was removed*/
  A leafDelete(LeafNode oldNode, LeafNode newNode, A oldAgregate, Object removedKey, Object removedVal);

  /*calculate leaf node aggregation value, after some key was added*/
  A leafAdd(LeafNode oldNode, LeafNode newNode, A oldAgregate, Object newKey, Object newVal);

  /*calculate leaf node aggregation value, after some key was updated*/
  A leafUpdate(LeafNode oldNode, LeafNode newNode, A oldAgregate, Object key, Object oldVal, Object newVal);


  /** 
   * Calculate directory node agregate after child leaf node has new key added 
   * Please note that we only have access to old aggregate and modified data
   */
  A dirAdd(A oldAggregate, Object newKey, Object newValue);


  /** 
   * Calculate directory node agregate after child leaf node has key removed
   * Please note that we only have access to old aggregate and modified data
   */
  A dirDelete(A oldAggregate, Object deletedKey, Object deletedValue);

  /** 
   * Calculate directory node agregate after child leaf node has key updated
   * Please note that we only have access to old aggregate and modified data
   */
  A dirUpdated(A oldAggregate, Object key, Object oldVal, Object newVal);

}

Now BTreeMap will use this class to calculate aggregate value on each update. It will also keep aggregate values synchronized with updates.

I am not sure yet howto use it exactly, but there will be two phases:

Traverse
In here user code will traverse directory nodes in which its interested. It will start from root and progress to lower layers. Callback interface will get directory nodes and will be able to do 3 operations:

  1. reject, do not traverse children of this dir node
  2. continue: traverse children of this dir node. In case this directory node is crossing boundary of interval limits
  3. aggregate: do not traverse children of this node, but use aggregate value of this node in final aggregation

Once traversal descends to leaf nodes, it will have similar operations. But instead of operation '2) continue', it will calculate new partial aggregate which matches interval limits.

Aggregate
Traversal code will return set of aggregation values as an iterator. In this phase user will collect all values and produce final value (count or final sum... etc).

@jankotek jankotek added 3.0 and removed 2.0 labels Feb 15, 2016

@jankotek jankotek added 3.2 and removed 3.0 labels Jun 12, 2016

@jankotek jankotek added 4.0 document and removed 3.2 labels Oct 11, 2017

@jankotek

This comment has been minimized.

Owner

jankotek commented Oct 11, 2017

there is new chapter about data pump, but should include important bits from thjis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment