Lua-MapReduce framework implemented in Lua using luamongo driver and MongoDB as storage. It follows Iterative MapReduce for training of Machine Learning statistical models.
Lua Shell Emacs Lisp
Latest commit 767321e Dec 22, 2015 @pakozm Merge pull request #21 from pakozm/devel
Version 0.4.0
Failed to load latest commit information.
external Added luamongo as dependency Dec 22, 2015
mapreduce Version 0.4.0 Dec 22, 2015
misc Updated cmd in make_sharded script Oct 14, 2014
.dir-locals.el Update .dir-locals.el Jan 12, 2015
.gitignore First commit with mapreduce lua scripts Apr 9, 2014
.gitmodules Added luamongo as dependency Dec 22, 2015
.travis.yml Forcing mongodb installation Dec 22, 2015
LICENSE fix typo in rohitjoshi/lua-mapreduce Oct 4, 2015 Updated tree directory structure May 18, 2014 Updated tree directory structure May 18, 2014 Updated for travis May 20, 2014
execute_server.lua Added sshfs, shared and gridfs test May 16, 2014
execute_worker.lua Improving the execution time, and adding statistics output at the server May 6, 2014 Updating to work with luamongo 0.5.0 Dec 22, 2015


Build Status Travis CI (master branch)

Build Status Travis CI (devel branch)


Lua MapReduce implementation based in MongoDB. It differs from rohitjoshi/lua-mapreduce in the basis of the communication between the processes. In order to allow fault tolerancy, and to reduce the communication protocol complexity, this implementation relies on mongoDB. So, all the data is stored at auxiliary mongoDB collections.

This software depends in:


Copy the mapreduce directory to a place visible from your LUA_PATH environment variable. It is possible to add the active directory by writing in the terminal:

$ export LUA_PATH='?.lua;?/init.lua'


Available at wiki pages.

Performance notes

Word-count example using Europarl v7 English data, with 1,965,734 lines and 49,158,635 running words. The data has been splitted in 197 files with a maximum of 10,000 lines per file. The task is executed in one machine with four cores. The machine runs a MongoDB server, a lua-mapreduce server and four lua-mapreduce workers. Note that this task is not fair because the data could be stored in the local filesystem.

The output of lua-mapreduce was:

$ ./ > output
# Iteration 1
#    Preparing Map
#    Map execution, size= 197
      100.0 % 
#    Preparing Reduce
#    Reduce execution, num_files= 1970  size= 10
      100.0 % 
#   Map sum(cpu_time)     80.297174
#   Reduce sum(cpu_time)  56.829328
# Sum(cpu_time)           137.126502
#   Map sum(real_time)    84.476371
#   Reduce sum(real_time) 63.458693
# Sum(real_time)          147.935064
# Sum(sys_time)           10.808562
#   Map cluster time      26.661836
#   Reduce cluster time   20.710385
# Cluster time            47.372221
# Failed maps     0
# Failed reduces  0
# Server time 49.229152
#    Final execution

Note 1: using only one worker takes: 146 seconds

Note 2: using 30 mappers and 15 reducers (30 workers) takes: 32 seconds

A naive word-count version implemented with pipes and shellscripts takes:

$ time cat /home/experimentos/CORPORA/EUROPARL/en-splits/* | \
  tr ' ' '\n'  | sort | uniq -c > output-pipes
real    2m21.272s
user    2m23.339s
sys     0m2.951s

A naive word-count version implemented in Lua takes:

$ time cat /home/experimentos/CORPORA/EUROPARL/en-splits/* | \
  lua misc/naive.lua > output-naivetime
real    0m26.125s
user    0m17.458s
sys     0m0.324s

Looking to these numbers, it is clear that the better is to work in main memory and in local storage filesystem, as in the naive Lua implementation, which needs only 26 seconds (real time), but uses local disk files. The map-reduce approach takes 49 seconds (real time) with four workers and 146 seconds (real time) with only one worker. These last two numbers are comparable with the naive shellscript implementation using pipes, which takes 146 seconds (real time). Concluding, the preliminar lua-mapreduce implementation, using four workers and MongoDB for communication and GridFS for auxiliary storage, is up to 3 times faster than a shellscript implementation using pipes. Both implementations sort the data in order to aggregate the results. In the future, a larger data task will be choosen to compare this implementation with raw map-reduce in MongoDB and/or Hadoop.

Last notes

This software is in development. More documentation will be added to the wiki pages, while we have time to do that. Collaboration is open, and all your contributions will be welcome.