Skip to content

Implementation Progress

Matheus C. Santos edited this page Jan 17, 2015 · 3 revisions

Fast Distributed Dataset (FDD) types:

  • Simple ( char, int, long int, float and double ).
  • Pointer ( char *, int *, long int *, float * and double * ) (WILL BE DISCONTINUED IN THE FUTURE).
  • Containers ( std::vector, std::string ).
  • Indexed FDDs - pair of Key (simple or string) and Data (simple, pointer or container).
  • Grouped (a group of two or tree datasets).

Data Functions

  • Map - transform a data item in any other type ( 1 to 1 ).
  • Reduce - reduce all elements into one ( 2 to 1 ).
  • FlatMap - generate a new set of data ( 1 to n ).
  • Bulk Map and FlatMap - performance efficient function enables sub-iteration implementations.
  • MapByKey - transform all indexed datasets items with the same key ( n to 1 ).
  • FlatMapByKey - export a new set of data from entries grouped by keys.
  • UpdateByKey - a function to modify a dataset content.

Other Functions

  • FDD creation from local memory ( through constructor ).
  • Distributed read from file through constructor - each process read from a global file offset.
  • collect - get a local copy of the dataset ( send the distributed data to the driver process ).
  • coutByKey - just like a histogram ( count occurrence of every key and send to driver process ).
  • groupByKey - Group a dataset data by key, data with the same key migrates to a single machine.
  • printInfo - Prints runtime information of all tasks
  • printHeader - Prints the header of the runtime information
  • updateInfo - Prints runtime information for all tasks called after last updateInfo (useful for program status update).
  • Global variables - Global variables that can be modified by the driver process transparently.

Release Oprimizations:

  • Memmory leak plug.

Examples:

Pagerank - (w/ and wo/ bulk) http://en.wikipedia.org/wiki/PageRank Latency test - Tests Framework latency woth O(1) functions.

Implementation planned for next releases:

(in order of priority)

  • Cogroup Optimization.
  • Load Redistribution/Tune
  • Aditional function arguments - Arguments passed with custom function pointer ) ex.: myFdd->map(&mapFunc, arg1, arg2);
  • Distributed directory read - Each process reads a local file from a directory (simulate a DFS)
  • HDFS support
  • Fault Tolerance
    • Dataset data replication
      • Process restart/replacement

Clone this wiki locally