# Hadoop Streaming

Usually MapReduce jobs are written in Java. Nevertheless, Hadoop has a feature called somewhat misleadingly [Hadoop Streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) which enables one to use Python or any other script language such as `shell` for developing mappers and reducers.

# Writing Some Code

First, we need to code our mapper. In case of Hadoop Streaming, a mapper is a script which gets some text from the standard input until the EOF and produces some text line by line to the standard output. For example, it can be like the following file:

In [1]:
!cat mapper.py

#!/usr/bin/python3

counter = 0
while True:
    try:
        counter += 1
        input()
    except EOFError:
        break
print(counter)


Mind the first line (so-called [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix))) - it's very important to keep it in place, since Hadoop doesn't know where your favourite Python executable is located. It could even be `#!/usr/bin/perl` or `#!/usr/bin/bash` as well.

This script does nothing interesting, it simply goes through the file line by line until EOF and counts lines. Then it prints the total number of lines in a file.

Of course, you can code anything more complicated: import additional packages, define functions, classes, and so forth.

The reducer looks similar since generally it does the same trick: goes through the lines of the standard input and prints something to the standard output. The main difference is that it has the output of the mapper as it's input.

In [2]:
!cat reducer.py

#!/usr/bin/python3

counter = 0
while True:
    try:
        line = input()
    except EOFError:
        break
    counter += int(line)
print(counter)


This reducer sums integer values (which are the line counts produced by the mapper).

# Pushing your code to the cluster

Hadoop lives on a cluster, and your MapReduce jobs will run on the cluster too. Hadoop __can't__ execute any code from a local machine directly. That means we need to put out code to HDFS somehow.

In [3]:
!hdfs dfs -put mapper.py code
!hdfs dfs -put reducer.py code

Remember to do that whenever your want to update your MapReduce jobs!

# Running MapReduce

We will use the `mapred streaming` command for running our Hadoop Streaming job. The description of parameters follows:
* files - here we put a comma-separated list of our source code files __on HDFS__. In case of Python they are simple Python scripts but in case of Java they would be `jar` files
* input - a file on HDFS to input to the mapper
* output - some location to HDFS where to put the results (the output of the reducer)
* mapper - the name of the mapper script
* reducer - the name of the reducer script

Let magic happen!

In [4]:
!mapred streaming \
    -files code/mapper.py,code/reducer.py \
    -input data/yelp_academic_dataset_tip.json \
    -output data/result \
    -mapper mapper.py \
    -reducer reducer.py

2020-10-23 19:20:19,322 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-10-23 19:20:19,400 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-23 19:20:19,401 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-23 19:20:19,411 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-10-23 19:20:19,839 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-23 19:20:19,860 INFO mapreduce.JobSubmitter: number of splits:8
2020-10-23 19:20:20,064 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1720303681_0001
2020-10-23 19:20:20,064 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-23 19:20:20,385 INFO mapred.LocalDistributedCacheManager: Localized file:/workdir/boris/projects/big-data-exercises/hadoop-examples/code/mapper.py as file:/tmp/hadoop-boris/mapred/local/job_local1720303681_0001_b33a27fb-19ea-4c3b-bb88-1b7917cced61/mapper.py
202

2020-10-23 19:20:21,496 INFO streaming.PipeMapRed: PipeMapRed exec [/workdir/boris/projects/big-data-exercises/hadoop-examples/./mapper.py]
2020-10-23 19:20:21,496 INFO mapreduce.Job: Job job_local1720303681_0001 running in uber mode : false
2020-10-23 19:20:21,526 INFO mapreduce.Job:  map 100% reduce 0%
2020-10-23 19:20:21,548 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:21,549 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:21,550 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:21,588 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:21,606 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:21,746 INFO streaming.PipeMapRed: R/W/S=100000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:21,858 INFO streaming.PipeMapRed: Records R/W=169364/1
2020-10-23 19:20:21,859 INFO streaming.PipeMapRed: MREr

2020-10-23 19:20:22,746 INFO streaming.PipeMapRed: Records R/W=165904/1
2020-10-23 19:20:22,748 INFO streaming.PipeMapRed: MRErrorThread done
2020-10-23 19:20:22,749 INFO streaming.PipeMapRed: mapRedFinished
2020-10-23 19:20:22,750 INFO mapred.LocalJobRunner: 
2020-10-23 19:20:22,750 INFO mapred.MapTask: Starting flush of map output
2020-10-23 19:20:22,750 INFO mapred.MapTask: Spilling map output
2020-10-23 19:20:22,750 INFO mapred.MapTask: bufstart = 0; bufend = 8; bufvoid = 104857600
2020-10-23 19:20:22,750 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
2020-10-23 19:20:22,753 INFO mapred.MapTask: Finished spill 0
2020-10-23 19:20:22,754 INFO mapred.Task: Task:attempt_local1720303681_0001_m_000003_0 is done. And is in the process of committing
2020-10-23 19:20:22,756 INFO mapred.LocalJobRunner: Records R/W=165904/1
2020-10-23 19:20:22,756 INFO mapred.Task: Task 'attempt_local1720303681_0001_m_000003_0' done.
2020-10-23 19:20:22,757

2020-10-23 19:20:23,744 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2020-10-23 19:20:23,744 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2020-10-23 19:20:23,744 INFO mapred.MapTask: soft limit at 83886080
2020-10-23 19:20:23,744 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2020-10-23 19:20:23,744 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2020-10-23 19:20:23,746 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2020-10-23 19:20:23,780 INFO streaming.PipeMapRed: PipeMapRed exec [/workdir/boris/projects/big-data-exercises/hadoop-examples/./mapper.py]
2020-10-23 19:20:23,796 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:23,796 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:23,796 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:23,823 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA 

2020-10-23 19:20:24,496 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 12, inMemoryMapOutputs.size() -> 8, commitMemory -> 84, usedMemory ->96
2020-10-23 19:20:24,497 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2020-10-23 19:20:24,498 INFO mapred.LocalJobRunner: 8 / 8 copied.
2020-10-23 19:20:24,498 INFO reduce.MergeManagerImpl: finalMerge called with 8 in-memory map-outputs and 0 on-disk map-outputs
2020-10-23 19:20:24,505 INFO mapred.Merger: Merging 8 sorted segments
2020-10-23 19:20:24,505 INFO mapred.Merger: Down to the last merge-pass, with 8 segments left of total size: 24 bytes
2020-10-23 19:20:24,506 INFO reduce.MergeManagerImpl: Merged 8 segments, 96 bytes to disk to satisfy reduce memory limit
2020-10-23 19:20:24,507 INFO reduce.MergeManagerImpl: Merging 1 files, 86 bytes from disk
2020-10-23 19:20:24,507 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2020-10-23 19:20:24,507 INFO mapred.Merger: M

Let's see the results:

In [5]:
!hdfs dfs -ls data/result

Found 2 items
-rw-r--r--   1 boris atg          0 2020-10-23 19:20 data/result/_SUCCESS
-rw-r--r--   1 boris atg          9 2020-10-23 19:20 data/result/part-00000


Mind the file \_SUCCESS. It appears only when a MapReduce job finished successfully. Since the reducer outputs a file, here it is - `part-00000`

In [6]:
!hdfs dfs -cat data/result/part-00000

1320769	


This is the number of lines in our input file.

# Multiple Parts

Let's run the same MapReduce job but without a reducer (then it's considered to act as identity and print exactly what it reads from it's input):

In [7]:
!mapred streaming \
    -files code/mapper.py,code/reducer.py \
    -input data/yelp_academic_dataset_tip.json \
    -output data/multi_result \
    -mapper mapper.py

2020-10-23 19:20:32,026 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-10-23 19:20:32,106 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-23 19:20:32,106 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-23 19:20:32,117 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-10-23 19:20:32,355 INFO mapred.FileInputFormat: Total input files to process : 1
2020-10-23 19:20:32,368 INFO mapreduce.JobSubmitter: number of splits:8
2020-10-23 19:20:32,500 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local175791011_0001
2020-10-23 19:20:32,500 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-10-23 19:20:32,728 INFO mapred.LocalDistributedCacheManager: Localized file:/workdir/boris/projects/big-data-exercises/hadoop-examples/code/mapper.py as file:/tmp/hadoop-boris/mapred/local/job_local175791011_0001_cbd5e7ce-bf55-4b0e-aa56-e97dd6c0600a/mapper.py
2020-

2020-10-23 19:20:33,326 INFO streaming.PipeMapRed: PipeMapRed exec [/workdir/boris/projects/big-data-exercises/hadoop-examples/./mapper.py]
2020-10-23 19:20:33,336 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:33,336 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:33,336 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:33,356 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:33,369 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:33,489 INFO streaming.PipeMapRed: R/W/S=100000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:33,616 INFO streaming.PipeMapRed: Records R/W=169364/1
2020-10-23 19:20:33,620 INFO streaming.PipeMapRed: MRErrorThread done
2020-10-23 19:20:33,620 INFO streaming.PipeMapRed: mapRedFinished
2020-10-23 19:20:33,621 INFO mapred.LocalJobRunner: 
2020-10-23 19:20:33,621 INFO map

2020-10-23 19:20:34,222 INFO streaming.PipeMapRed: Records R/W=165904/1
2020-10-23 19:20:34,225 INFO streaming.PipeMapRed: MRErrorThread done
2020-10-23 19:20:34,225 INFO streaming.PipeMapRed: mapRedFinished
2020-10-23 19:20:34,226 INFO mapred.LocalJobRunner: 
2020-10-23 19:20:34,226 INFO mapred.MapTask: Starting flush of map output
2020-10-23 19:20:34,226 INFO mapred.MapTask: Spilling map output
2020-10-23 19:20:34,226 INFO mapred.MapTask: bufstart = 0; bufend = 8; bufvoid = 104857600
2020-10-23 19:20:34,226 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
2020-10-23 19:20:34,227 INFO mapred.MapTask: Finished spill 0
2020-10-23 19:20:34,229 INFO mapred.Task: Task:attempt_local175791011_0001_m_000003_0 is done. And is in the process of committing
2020-10-23 19:20:34,230 INFO mapred.LocalJobRunner: Records R/W=165904/1
2020-10-23 19:20:34,230 INFO mapred.Task: Task 'attempt_local175791011_0001_m_000003_0' done.
2020-10-23 19:20:34,231 I

2020-10-23 19:20:34,887 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2020-10-23 19:20:34,895 INFO streaming.PipeMapRed: PipeMapRed exec [/workdir/boris/projects/big-data-exercises/hadoop-examples/./mapper.py]
2020-10-23 19:20:34,903 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:34,904 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:34,904 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:34,924 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:34,938 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:35,060 INFO streaming.PipeMapRed: R/W/S=100000/0/0 in:NA [rec/s] out:NA [rec/s]
2020-10-23 19:20:35,155 INFO streaming.PipeMapRed: Records R/W=169533/1
2020-10-23 19:20:35,157 INFO streaming.PipeMapRed: MRErrorThread done
2020-10-23 19:20:35,158 INFO

2020-10-23 19:20:35,532 INFO mapred.LocalJobRunner: reduce task executor complete.
2020-10-23 19:20:35,834 INFO mapreduce.Job:  map 100% reduce 100%
2020-10-23 19:20:35,834 INFO mapreduce.Job: Job job_local175791011_0001 completed successfully
2020-10-23 19:20:35,849 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=1479772890
		FILE: Number of bytes written=6816471
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=1320761
		Map output records=8
		Map output bytes=64
		Map output materialized bytes=128
		Input split bytes=1208
		Combine input records=0
		Combine output records=0
		Reduce input groups=8
		Reduce shuffle bytes=128
		Reduce input records=8
		Reduce output records=8
		Spilled Records=16
		Shuffled Maps =8
		Failed Shuffles=0
		Merged Map outputs=8
		GC time elapsed (ms)=37
		Total committed heap usage (bytes)=19327352832
	Shuffle Erro

In [8]:
!hdfs dfs -ls data/multi_result

Found 2 items
-rw-r--r--   1 boris atg          0 2020-10-23 19:20 data/multi_result/_SUCCESS
-rw-r--r--   1 boris atg         64 2020-10-23 19:20 data/multi_result/part-00000


Although there is only one resulting file, it has several lines:

In [9]:
!hdfs dfs -cat data/multi_result/part-00000

144176	
165905	
166856	
167410	
167537	
169365	
169534	
169986	


That happened because our mapper was run in parallel. You can wonder why it was slow then. The answer is quite simple - Python is an interpreted language, so it would be faster if we were using C++ or Java for our mappers and reducers.

# Do It Yourself
* code a mapper for counting words or characters, not lines
* code a reducer to count lines, words, and symbols in one job
* code a mapper to sum over the `compliment_count` field value