Example of running hadoop streaming mapper/reducer inside docker container
Hadoop streaming python inside docker

Run Python hadoop streaming example by running mapper/reducer inside docker container.

Python code taken from

This setup will work under the following conditions:

  • The docker container will be started by the yarn user, so make sure that one is part of the docker group. Also the /var/run/docker.sock should be group writeable.
  • stdin and stdout of the docker container are connected to those of the hadoop mapper; this can be done by using docker run -i, see the script.
  • The container should operate on a line-by-line basis: read something from stdin, and write a response to stdout, until an end-of-file is encounterd, or the pipe is closed. You can not read the whole of stdin, do processing, and write to stdout.

970 runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown jiska /usr/jiska" 971 runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown jiska /user/jiska"


docker build -t mapred-python .

Run example

# upload test dataset
hdfs dfs -put data/gutenberg

Without docker

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-file -mapper \
-file -reducer \
-input /user/vagrant/gutenberg/* \
-output /user/vagrant/gutenberg-pyout

With docker

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-mapper -file \
-reducer -file \
-input /user/vagrant/gutenberg/* -output /user/vagrant/gutenberg-output

