Implementation Progress

Jump to bottom

Matheus C. Santos edited this page Jan 17, 2015 · 3 revisions

Fast Distributed Dataset (FDD) types:

Simple ( char, int, long int, float and double ).
Pointer ( char *, int *, long int *, float * and double * ) (WILL BE DISCONTINUED IN THE FUTURE).
Containers ( std::vector, std::string ).
Indexed FDDs - pair of Key (simple or string) and Data (simple, pointer or container).
Grouped (a group of two or tree datasets).

Data Functions

Map - transform a data item in any other type ( 1 to 1 ).
Reduce - reduce all elements into one ( 2 to 1 ).
FlatMap - generate a new set of data ( 1 to n ).
Bulk Map and FlatMap - performance efficient function enables sub-iteration implementations.
MapByKey - transform all indexed datasets items with the same key ( n to 1 ).
FlatMapByKey - export a new set of data from entries grouped by keys.
UpdateByKey - a function to modify a dataset content.

Other Functions

FDD creation from local memory ( through constructor ).
Distributed read from file through constructor - each process read from a global file offset.
collect - get a local copy of the dataset ( send the distributed data to the driver process ).
coutByKey - just like a histogram ( count occurrence of every key and send to driver process ).
groupByKey - Group a dataset data by key, data with the same key migrates to a single machine.
printInfo - Prints runtime information of all tasks
printHeader - Prints the header of the runtime information
updateInfo - Prints runtime information for all tasks called after last updateInfo (useful for program status update).
Global variables - Global variables that can be modified by the driver process transparently.

Release Oprimizations:

Memmory leak plug.

Examples:

Pagerank - (w/ and wo/ bulk) http://en.wikipedia.org/wiki/PageRank Latency test - Tests Framework latency woth O(1) functions.

Implementation planned for next releases:

(in order of priority)

Cogroup Optimization.
Load Redistribution/Tune
Aditional function arguments - Arguments passed with custom function pointer ) ex.: myFdd->map(&mapFunc, arg1, arg2);
Distributed directory read - Each process reads a local file from a directory (simulate a DFS)
HDFS support
Fault Tolerance
- Dataset data replication
  - Process restart/replacement