# Hadoop Streaming

Usually MapReduce jobs are written in Java. Nevertheless, Hadoop has a feature called somewhat misleadingly [Hadoop Streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) which enables one to use Python or any other script language such as `shell` for developing mappers and reducers.

# Writing Some Code

First, we need to code our mapper. In case of Hadoop Streaming, a mapper is a script which gets some text from the standard input until the EOF and produces some text line by line to the standard output. For example, it can be like the following file:

In [1]:
!cat mapper.py

#!/usr/bin/python3

counter = 0
while True:
    try:
        counter += 1
        input()
    except EOFError:
        break
print(counter)


Mind the first line (so-called [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix))) - it's very important to keep it in place, since Hadoop doesn't know where your favourite Python executable is located. It could even be `#!/usr/bin/perl` or `#!/usr/bin/bash` as well.

This script does nothing interesting, it simply goes through the file line by line until EOF and counts lines. Then it prints the total number of lines in a file.

Of course, you can code anything more complicated: import additional packages, define functions, classes, and so forth.

The reducer looks similar since generally it does the same trick: goes through the lines of the standard input and prints something to the standard output. The main difference is that it has the output of the mapper as it's input.

In [2]:
!cat reducer.py

#!/usr/bin/python3

counter = 0
while True:
    try:
        line = input()
    except EOFError:
        break
    counter += int(line)
print(counter)


This reducer sums integer values (which are the line counts produced by the mapper).

# Pushing your code to the cluster

Hadoop lives on a cluster, and your MapReduce jobs will run on the cluster too. Hadoop __can't__ execute any code from a local machine directly. That means we need to put out code to HDFS somehow.

In [5]:
!hdfs dfs -put mapper.py code
!hdfs dfs -put reducer.py code

Remember to do that whenever your want to update your MapReduce jobs!

# Running MapReduce

We will use the `mapred streaming` command for running our Hadoop Streaming job. The description of parameters follows:
* files - here we put a comma-separated list of our source code files __on HDFS__. In case of Python they are simple Python scripts but in case of Java they would be `jar` files
* input - a file on HDFS to input to the mapper
* output - some location to HDFS where to put the results (the output of the reducer)
* mapper - the name of the mapper script
* reducer - the name of the reducer script

Let magic happen!

In [7]:
!mapred streaming \
    -files code/mapper.py,code/reducer.py \
    -input data/yelp_academic_dataset_tip.json \
    -output data/result \
    -mapper mapper.py \
    -reducer reducer.py

packageJobJar: [] [/home/boris/opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar] /tmp/streamjob3635668951347712675.jar tmpDir=null
2020-10-19 17:56:44,237 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2020-10-19 17:56:44,503 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2020-10-19 17:56:44,747 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-19 17:56:44,763 INFO mapreduce.JobSubmitter: number of splits:8
2020-10-19 17:56:44,879 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1603104554930_0019
2020-10-19 17:56:44,880 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-19 17:56:45,110 INFO conf.Configuration: resource-types.xml not found
2020-10-19 17:56:45,111 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020-10-19 17:56:45,181 INFO impl.YarnClientImpl: Submitted application application_1603104554930_0019
2020-10-19 17:56:45,227 INFO mapreduce.Job: The url to t

Let's see the results:

In [9]:
!hdfs dfs -ls data/result

Found 2 items
-rw-r--r--   1 boris atg          0 2020-10-19 17:53 data/result/_SUCCESS
-rw-r--r--   1 boris atg          9 2020-10-19 17:53 data/result/part-00000


Mind the file \_SUCCESS. It appears only when a MapReduce job finished successfully. Since the reducer outputs a file, here it is - `part-00000`

In [10]:
!hdfs dfs -cat data/result/part-00000

1320769	


This is the number of lines in our input file.

# Multiple Parts

Let's run the same MapReduce job but without a reducer (then it's considered to act as identity and print exactly what it reads from it's input):

In [13]:
!mapred streaming \
    -files code/mapper.py,code/reducer.py \
    -input data/yelp_academic_dataset_tip.json \
    -output data/multi_result \
    -mapper mapper.py

packageJobJar: [] [/home/boris/opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar] /tmp/streamjob7977458957710831605.jar tmpDir=null
2020-10-19 18:02:29,774 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2020-10-19 18:02:30,047 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2020-10-19 18:02:30,478 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-19 18:02:30,495 INFO mapreduce.JobSubmitter: number of splits:8
2020-10-19 18:02:30,615 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1603104554930_0020
2020-10-19 18:02:30,616 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-19 18:02:30,843 INFO conf.Configuration: resource-types.xml not found
2020-10-19 18:02:30,844 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020-10-19 18:02:30,913 INFO impl.YarnClientImpl: Submitted application application_1603104554930_0020
2020-10-19 18:02:30,948 INFO mapreduce.Job: The url to t

In [15]:
!hdfs dfs -ls data/multi_result

Found 2 items
-rw-r--r--   1 boris atg          0 2020-10-19 17:59 data/multi_result/_SUCCESS
-rw-r--r--   1 boris atg         64 2020-10-19 17:59 data/multi_result/part-00000


Although there is only one resulting file, it has several lines:

In [18]:
!hdfs dfs -cat data/multi_result/part-00000

144176	
165905	
166856	
167410	
167537	
169365	
169534	
169986	


That happened because our mapper was run in parallel. You can wonder why it was slow then. The answer is quite simple - Python is an interpreted language, so it would be faster if we were using C++ or Java for our mappers and reducers.

# Do It Yourself
* code a mapper for counting words or characters, not lines
* code a reducer to count lines, words, and symbols in one job
* code a mapper to sum over the `compliment_count` field value