Map reduce support #231
Map reduce support #231
Comments
Good idea. Sounds doable as BTreeMap extension. Intermediate results should be stored in secondary collections. Simplest case is probably counted btree |
@klehmann can you share how you had eventually implemented map-reduce using MapDB? |
I haven't done any implementation yet. |
I think better name would be "per node agregations". 1.0 storage format supports per-node metadata, so it is just question to add tooling. This has been added to roadmap and is scheduled for 1.2 in about 5 months |
I started some actual work on this in separate branch for 2.0. Lets discus some design: For start I have two use cases: counted btree and sum for BTreeMap<String,Long>. In both cases submap (interval) should produce count and sum without traversing entire submap. Aggregation value must be calculated without traversing entire set as well. So on updates new aggregation value should be produced on incremental fashion from old aggregation value, old node and new node. For this aggregation value must be produced using associative operation (such as + or * or else). And finally the aggregation value should be flexible, any typ should do. Only condition is that it can be produced using incremental (associative) operations. So for now I am thinking that user would attach this call bacl interface to BTreeMap: class Agr implements Agregator<A>{
/** serializer used to serialize aggregation value
Serializer<A> serializer();
/*calculate leaf node aggregation value, after some key was removed*/
A leafDelete(LeafNode oldNode, LeafNode newNode, A oldAgregate, Object removedKey, Object removedVal);
/*calculate leaf node aggregation value, after some key was added*/
A leafAdd(LeafNode oldNode, LeafNode newNode, A oldAgregate, Object newKey, Object newVal);
/*calculate leaf node aggregation value, after some key was updated*/
A leafUpdate(LeafNode oldNode, LeafNode newNode, A oldAgregate, Object key, Object oldVal, Object newVal);
/**
* Calculate directory node agregate after child leaf node has new key added
* Please note that we only have access to old aggregate and modified data
*/
A dirAdd(A oldAggregate, Object newKey, Object newValue);
/**
* Calculate directory node agregate after child leaf node has key removed
* Please note that we only have access to old aggregate and modified data
*/
A dirDelete(A oldAggregate, Object deletedKey, Object deletedValue);
/**
* Calculate directory node agregate after child leaf node has key updated
* Please note that we only have access to old aggregate and modified data
*/
A dirUpdated(A oldAggregate, Object key, Object oldVal, Object newVal);
} Now BTreeMap will use this class to calculate aggregate value on each update. It will also keep aggregate values synchronized with updates. I am not sure yet howto use it exactly, but there will be two phases: Traverse
Once traversal descends to leaf nodes, it will have similar operations. But instead of operation '2) continue', it will calculate new partial aggregate which matches interval limits. Aggregate |
there is new chapter about data pump, but should include important bits from thjis |
I am currently thinking about implementing map reduce with MapDB maps in my application. CouchDB has an approach where intermediate reduce results get stored in b-tree nodes for faster computations after b-tree changes (only nodes with changed children need to be recomputed). That might also be possible with MapDB, when DirNodes could have additional persistent properties.
Would be interesting to have such a map/reduce functionality built right into the MapDB core.
The text was updated successfully, but these errors were encountered: