Skip to content
Example of running hadoop streaming mapper/reducer inside docker container
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Hadoop streaming python inside docker

Run Python hadoop streaming example by running mapper/reducer inside docker container.

Python code taken from

This setup will work under the following conditions:

  • The docker container will be started by the yarn user, so make sure that one is part of the docker group. Also the /var/run/docker.sock should be group writeable.
  • stdin and stdout of the docker container are connected to those of the hadoop mapper; this can be done by using docker run -i, see the script.
  • The container should operate on a line-by-line basis: read something from stdin, and write a response to stdout, until an end-of-file is encounterd, or the pipe is closed. You can not read the whole of stdin, do processing, and write to stdout.

970 runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown jiska /usr/jiska" 971 runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown jiska /user/jiska"


docker build -t mapred-python .

Run example

# upload test dataset
hdfs dfs -put data/gutenberg

Without docker

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-file -mapper \
-file -reducer \
-input /user/vagrant/gutenberg/* \
-output /user/vagrant/gutenberg-pyout

With docker

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-mapper -file \
-reducer -file \
-input /user/vagrant/gutenberg/* -output /user/vagrant/gutenberg-output

You can’t perform that action at this time.