mapreduce_python

example to calculate the mean and sample variance column-wise of a matrix using mapreduce with python

1. Deploy

1.1. Install hadoop

download hadoop, for example http://mirrors.koehn.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz and decompress it (without further configuration it will run in standalone mode)

1.2. Setup the mapreduce code and example data

clone this repo to your computer and consider its local path and ${PATH_TO_REPO} as synonyms in the rest of this file.

2. Execution

2.1. Calculate Means with Hadoop

Execute the following. The "3" passed as argument to the mapper is the number of columns of the input matrix

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -file ${PATH_TO_REPO}/pearson_mean_mapper.py -mapper "./pearson_mean_mapper.py 3" -file ${PATH_TO_REPO}/pearson_mean_reducer.py -reducer "./pearson_mean_reducer.py"  -input ${PATH_TO_REPO}/input -output ${PATH_TO_REPO}/out-mean-01

To visualize the output

cat ${PATH_TO_REPO}/out-mean-01/part-*

Te code can also be tested without hadoop, using pipes (useful to debug)

cat input/*.text | ./pearson_mean_mapper.py 3 | sort -k1,1 | ./pearson_mean_reducer.py

2.2. Calculate Sample Variances with Hadoop

Execute the following. The "3" passed as argument to the mapper is the number of columns of the input matrix; the values "173,76.33,33.83" are the corresponding means of each column

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -file ${PATH_TO_REPO}/pearson_variance_mapper.py -mapper "./pearson_variance_mapper.py 3 173,76.33,33.83" -file ${PATH_TO_REPO}/pearson_variance_reducer.py -reducer "./pearson_variance_reducer.py "  -input ${PATH_TO_REPO}/input -output ${PATH_TO_REPO}/out-variance-01

To visualize the output

cat ${PATH_TO_REPO}/out-variance-01/part-*

Te code can also be tested without hadoop, using pipes (useful to debug)

cat input/*.text | ./pearson_variance_mapper.py 3 173,76.33,33.83 | sort -k1,1 | ./pearson_variance_reducer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mapreduce_python

1. Deploy

1.1. Install hadoop

1.2. Setup the mapreduce code and example data

2. Execution

2.1. Calculate Means with Hadoop

2.2. Calculate Sample Variances with Hadoop

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
README.md		README.md
pearson_mean_mapper.py		pearson_mean_mapper.py
pearson_mean_reducer.py		pearson_mean_reducer.py
pearson_variance_mapper.py		pearson_variance_mapper.py
pearson_variance_reducer.py		pearson_variance_reducer.py

jdavidgaviria/mapreduce_python

Folders and files

Latest commit

History

Repository files navigation

mapreduce_python

1. Deploy

1.1. Install hadoop

1.2. Setup the mapreduce code and example data

2. Execution

2.1. Calculate Means with Hadoop

2.2. Calculate Sample Variances with Hadoop

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages