Shavadoop

My own implementation of Hadoop MapReduce in Java.

This includes:

a parallel and distributed computing system via multiple computers in a local network
a simple Wordcount algorithm to apply on large files
a JobTracker to relaunch tasks which failed during MapReduce process (socket communication)
the files's transfer between computers, thanks to scp (do not require a Distributed File System)

1 worker = 1 computer CPU's core

Hadoop VS Shavadoop

Hadoop

Shavadoop

Initialization step : the main repository is created / cleaned on each host and the network's ip adress may be automatically collected and stored if no source files are provided.

Input splitting step : the master split the file by line (if the size is less than 64 MB), else by blocs. Each bloc is send to the mapper worker for the next step.
The file's name syntax : Sx, where x is a counter

Split mapping step : each worker count the number of words in the file and create files in almost equal numbers of workers. Indeed, thanks to this trip, every word is on the right file for the right reducer worker :

// the hash on the key modulo the worker's number is always unique
Math.abs((int) (Util.simpleHash(key) % nbWorker));

It also noted that the combiner is integrated as standard.
The file's name syntax : UM_x_W_y_fullhostname, where x is the mapper worker's id and y is the right reducer worker's id
At the end and thanks to socket communication, a dictionary is created to the next step : the shuffling.
The file's name syntax : DSM_y, where y is the right reducer worker's id

Suffling maps step : from the DSM file dictionary, each reducer worker get the files they need to parse to store the counters for each word in memory.

Mapping sorted maps step : each reducer worker sum all the counters of the last step and send to the master a file which contains this informations
The file's name syntax : RM_y_fullhostname, where y is the reducer worker's id

Assembling final maps step : it's the last step, the master can merge all the RM files in one.
The file's name syntax : output, it's the result

The job tracker : check frequently if a worker (mapper or reducer) is alive and it's working, thanks to socket communication. Else, the job tracker kill the job and relaunch the task on another worker available and ready to work hard.

Parallel distributed computing

How to use Shavadoop

clone the project with git
edit the config.properties file (included in the src root directory)
generate the jar file with your favorite IDE (or use the terminal like a boss)
from the terminal, execute the jar
have a coffee (or not)
enjoy the result with this command to see the thirty most frequently used words :

cat output | sort -k 2 -nr | head -30

To the future

change Wordcount implementation to execute any algorithms
export the properties file outside the jar (more convenient)
optimize block's size in the Input Splitting step, with dynamic size (according to the number of mapper workers)
when the input file is splited into several blocs, it could cut a word in half (serialization with Avro to avoid that ?)
change the transfer file, currently with scp, by another way because scp has number limited connection (only socket ?)
add JUnit test for regression testing

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.settings		.settings
doc		doc
res		res
src/main/java		src/main/java
target		target
.classpath		.classpath
.project		.project
README.md		README.md
log4j.properties		log4j.properties
pom.xml		pom.xml
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

doc

doc

res

res

src/main/java

src/main/java

target

target

.classpath

.classpath

.project

.project

README.md

README.md

log4j.properties

log4j.properties

pom.xml

pom.xml

sonar-project.properties

sonar-project.properties

Repository files navigation

Shavadoop

Hadoop VS Shavadoop

Hadoop

Shavadoop

Parallel distributed computing

How to use Shavadoop

To the future

About

Releases

Packages

Languages

martin-prillard/shavadoop

Folders and files

Latest commit

History

Repository files navigation

Shavadoop

Hadoop VS Shavadoop

Hadoop

Shavadoop

Parallel distributed computing

How to use Shavadoop

To the future

About

Resources

Stars

Watchers

Forks

Languages