Where are we?
-----

[Let's look at the map](http://insightdataengineering.com/blog/pipeline_map.html)

----

<br>
<br>
<br>

MapReduce
====

![](images/ENtJs.png)

Why MapReduce?
-------------

We have a 100 TB of sales data that looks like this:

ID    |Date          |Store  |State |Product   |Amount
--    |----          |-----  |----- |-------   |------
101   |11/13/2014    |100    |WA    |331       |300.00
104   |11/18/2014    |700    |OR    |329       |450.00

What If
-------

What are some of the questions we could answer if we could process this huge data set?

- How much revenue did we make by store, state?

- How much revenue did we make by product?

- How much revenue did we make by week, month, year?


Engineering Problem
-------------------

To answer these questions we have to solve two problems:

- Store 100 TB of data

- Process 100 TB of data


MapReduce 101
-----
MapReduce is a framework originally developed at Google that allows large scale distributed computing. 

Apache Hadoop is the open source implementation as the defacto standard for Big Data processing. 

It scales well to many thousands of nodes and petabytes of data. 

----
HDFS and MapReduce are the peanut butter & jelly of Big Data
------------------

![](http://hadoopilluminated.com/hadoop_illuminated/images/hadoop_coin.png)

- HDFS solves the storage problem.

- MapReduce works with HDFS to break down queries.
    - In the *map* phase the data is processed locally.

    - In the *reduce* phase the results of the map phase are consolidated.


What problem does Map Reduce solve?
-----

It's a computing paradigm similar to divide and conquer.

If you break up a problem in to smaller sub problems and solve them in parallel, then reduce down to the results.

----
By the end of this session, we will be able to:
----
- Understand the MapReduce algorithm.

- Create MapReduce jobs using MRJob to process large data sets. 

- Optimize MapReduce jobs so that most processing is done locally via Combiners

----
The MapReduce algorithm
---------
![](images/MapReduce_overview.png)

How does MapReduce work?
---------------

- The developer (you) provides mapper and reducer code.

- The mapper function transforms individual records and attaches a key to each record.

- All the records with the same key end up on the same reducer.

- For each key the reduce function combines the records with that key.

---
Word Count is "Hello, World" of Big Data
---

![](images/word_count.jpg)

MapReduce in Hadoop
----

<img src="images/map-reduce-key-partition.png">

![](images/MapReduce-Data-Flow-of-Word-Count.png)

Check for understanding
--------

<details><summary>
Q: How many mappers does each job get?
</summary>
1. One mapper per block of data.
<br>
2. Large files get more mappers, small files get fewer.
</details>

<details><summary>
Q: How many reducers does each job get?
</summary>
1. This is configured by the programmer.
<br>
2. By default each job gets one reducer.
</details>

<details><summary>
Q: Suppose I want to find out how many sales transactions are in a
data set for each state. What key should the mapper output?
</summary>
1. The mapper should output *state* as the key, and *1* as the value.
<br>
2. This will ensure that all the records for a specific state end up
   on the same reducer.
<br>
3. The reducer can then add up the *1*s to get the total number of
   transactions.
</details>

---
Functional Programming: Put $ into the jar
----

One thing of note is the idea that map and reduce are both very strong functional programming paradigms.

Map takes an input and returns some output. Maps are the composed and piped in to a reduce step for further processing.

A higher order operation of map is taking a function and applying a transform on the give input.

An easy example of this would be a list:

In [1]:
input = [1, 2, 3, 4, 5]

In [2]:
# Map

def map_add_1(input):
     return input + 1

In [3]:
map(map_add_1, input) #=> [2, 3, 4, 5, 6]

[2, 3, 4, 5, 6]

In [4]:
def add_two_numbers(x, y):
    return x + y

In [5]:
reduce(add_two_numbers, input)

15

---
Student Activity
----

<details><summary>
Write a map function that counts the length of each word in the following quote: <br>
<br>
words = "The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.".split()

</summary>
__Solution__: <br>
<br>
`map(len, words)` <br>
<br>
`map(lambda word: len(word), words)`

<br>
</details>

<details><summary>
Write a reduce function that [1, 2, 3, 4, 5, 6, 7, 8] into 12345678.<br>
</summary>
__Solution__: <br>
<br>
reduce(lambda a,d: 10*a+d, [1,2,3,4,5,6,7,8]) <br>
or <br>
reduce(lambda x,y: str(x)+str(y), [1,2,3,4,5,6,7,8])
</details>

__Optional__

<details><summary>
Write a reduce function to flatten a list. <br>
Turn: [[1, 2, 3], [4, 5], [6, 7, 8]] into [1, 2, 3, 4, 5, 6, 7, 8]
<br>

</summary>
reduce(lambda x,y: x+y, [[1, 2, 3], [4, 5], [6, 7, 8]]) <br>
or <br>
reduce(list.__add__, [[1, 2, 3], [4, 5], [6, 7, 8]])
</details>

----
MapReduce Using MRJob
---

![](http://cdn.meme.am/instances/60004133.jpg)

Sales Data
----------

Here is the sales data we are going to analyze.

In [6]:
%%writefile sales.txt
#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00

Overwriting sales.txt


Transactions By State
---------------------

Q: How many transactions were there for each state?

- Create the `SaleCount.py` file.

In [7]:
%%writefile SaleCount.py
from mrjob.job import MRJob

class SaleCount(MRJob):
    
    def mapper(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        state = fields[3]
        yield (state, 1)
    
    def reducer(self, state, counts): 
        yield state, sum(counts)

if __name__ == '__main__': 
    SaleCount.run()

Overwriting SaleCount.py


- Run it locally.

In [8]:
!python SaleCount.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232924.236071

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232924.236071/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232924.236071/step-0-mapper-sorted
> sort /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232924.236071/step-0-mapper_part-00000
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232924.236071/step-0-reducer_part-00000
Coun

----
¿¿¿¿¿¿¿¿¿¿¿Did that even work???????????????

Let's check the output.

In [9]:
!cat output.txt

"CA"	3
"OR"	1
"WA"	2


The is what successful MapReduce is like.

![](http://bltz.info/wp-content/uploads/2015/11/fml.jpg)

Check for understanding
--------

<details><summary>
Q: Suppose instead of counting transactions by state we want to count
transactions by store. What should we change in the code above?
</summary>
1. Replace `state = field[3]` with `store = field[2]`
<br>
2. Replace `yield (state, 1)` with `yield (store, 1)`
</details>

<details><summary>
Q: Suppose instead of counting transactions we want to find total
revenue by state. What should we change in the code above?
</summary>
1. Add `amount = float(fields[5])` 
<br>
2. Replace `yield (state, 1)` with `yield (state, amount)`
</details>

Using MapReduce For Statistics
------------------------------

- Using MapReduce we can calculate statistics for any factors.

- Our factor or condition becomes the key.

- The parameter that we want to calculate the statistic on becomes
  the value.

- The reducer contains the logic to apply the statistic.

- The statistic can be sum, count, average, stdev, etc.



Check for understanding
--------

<details><summary>
Q: What common statistic of central tendency would suck to calculate? Why?
</summary>
Median <br>
<br>
Require that 1 reducer is passed the entire range of numbers to determine which is the 'middle' value. <br>
<br>
[Stackoverflow](http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce)
</details>

Using MRJob for Word Count
--------------------------

Q: Count the frequency of words using MRJob.

- Create an input file.

In [10]:
%%writefile input.txt
hello world
this is the second line
this is the third line
hello again

Overwriting input.txt


- Create the `WordCount.py` file.

In [11]:
%%writefile WordCount.py

from mrjob.job import MRJob

import re

WORD_RE = re.compile(r"[\w']+")

class WordCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1
    
    def reducer(self, word, counts): 
        yield word, sum(counts)

if __name__ == '__main__': 
    WordCount.run()

Overwriting WordCount.py


- Run it locally.

In [12]:
!python WordCount.py input.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/WordCount.brian.20160202.232924.773338

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/WordCount.brian.20160202.232924.773338/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/WordCount.brian.20160202.232924.773338/step-0-mapper-sorted
> sort /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/WordCount.brian.20160202.232924.773338/step-0-mapper_part-00000
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/WordCount.brian.20160202.232924.773338/step-0-reducer_part-00000
Coun

In [13]:
# Check the output:
!cat output.txt

"again"	1
"hello"	2
"is"	2
"line"	2
"second"	1
"the"	2
"third"	1
"this"	2
"world"	1


Word Count Notes
----------------

- WordCount is used as a standard distributed application

- For a large number of words it is not solvable on a single machine

- A large corpus can require more storage than the disk on a single
  machine

- A large vocabulary can require more memory than on a single machine.

- WordCount generalizes to other counting applications: such as 
  counting clicks by category.

Customizing MapReduce
---------------------

Q: What are the places in the MapReduce pipeline that can be modified
using Java classes?

<img src="images/map-reduce-key-partition.png">

<img src="images/map-reduce-phases.png">

Class               |Runs On          |Decides 
-----               |-------          |------- 
InputFormat         |Client           |Splits HDFS file to InputSplits
InputFormat         |Mapper           |Splits InputSplit to `(key1,value1)`
Mapper              |Mapper           |Maps `(key1,value1)` to `(key2,value2)`
Partitioner         |Mapper           |Decides which `(key2,value2)` goes to which reducer
SortComparator      |Mapper + Reducer |Determines sort order between all the `key2`
GroupingComparator  |Reducer          |Groups `(key2,value2)` for a single `reduce` call
OutputFormat        |Reducer          |Writes `(key2,value2)` to HDFS file

----
Hadoop Streaming
----------------

<img src="images/hadoop-streaming.png">

Q: How does MRJob work under the hood?

- Hadoop's MapReduce framework is written in Java.

- The most direct way to write MapReduce jobs is in Java.

- Hadoop also supports a Streaming API.

- The Hadoop Streaming API lets you use any language to write Mappers
  and Reducers.

- It sends the data to the Streaming Mappers and Reducers of *standard
  input* and *standard output*.


Streaming Pros and Cons
-----------------------

<details><summary>
Q: What are the pros and cons of Hadoop Streaming?
</summary>
Pros:
<br>1. You can program in any language. E.g. you can use Perl, Python, or Ruby.
<br>2. You can leverage libraries that you already have.
<br>3. You can shorten development time.
<br>Cons:
<br>1. Hadoops spins up a separate interpreter process on every mapper
and reducer, which uses up CPUs and memory.
<br>2. The performance is not as good as using Java directly.
<br>3. Binary types are not directly supported.
<br>4. You can only write custom mappers and reducers---you cannot
customize partitioners, input formats, and other parts of the
MapReduce pipeline in other languages. For these you have to use
Java or a JVM language.
</details>



----
Advanced MapReduce Applications
----

---
What is a Combiner?
----

A ‘Combiner’ is a mini reducer that performs the local reduce task.

It receives the input from the mapper on a particular node and sends the output to the reducer.

Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.

![](http://4.bp.blogspot.com/-JK9zFB5WSl0/UpczHCmCQyI/AAAAAAAADw0/WwVgqWbRvgc/s1600/combiner.png)
The goal is to find the minimum of the k-v

Combiner Tips and Tricks
-----

- The *combiner* (if specified) is the reducer that the mapper uses to reduce the data locally.

- What is the advantage of a combiner? 
> It reduces the disk footprint for the map output. Also it saves network bandwidth.

- A reducer can only be used as a combiner if it is commutative and associative.

![](https://californiaonlinehighschool.files.wordpress.com/2012/09/commutative_associative_properties.png)

Transactions By State Using Combiner
------------------------------------

Q: How many transactions were there for each state?

- Create the `SaleCountFast.py` file.

In [14]:
%%writefile SaleCountFast.py

from mrjob.job import MRJob

class SaleCountFast(MRJob):

    def mapper(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        state = fields[3]
        yield (state, 1)
    
    def combiner(self, state, counts): 
        yield state, sum(counts)
    
    def reducer(self, state, counts): 
        yield state, sum(counts)

if __name__ == '__main__': 
    SaleCountFast.run()

Overwriting SaleCountFast.py


- Run it locally.

In [15]:
!python SaleCountFast.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCountFast.brian.20160202.232925.413893

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCountFast.brian.20160202.232925.413893/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCountFast.brian.20160202.232925.413893/step-0-mapper-sorted
> sort /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCountFast.brian.20160202.232925.413893/step-0-mapper_part-00000
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCountFast.brian.20160202.232925.413893/step-0-red

In [16]:
# Check the output.
!cat output.txt

"CA"	3
"OR"	1
"WA"	2


Check for understanding
--------

<details><summary>
Q: Can we use the reduce function as a combiner if we are calculating
the total number of sales transactions per state?
</summary>
Yes.
</details>

<details><summary>
Q: Can we use the reduce function as a combiner if we are calculating
the average transaction revenue per state? </summary>
1. No we cannot.
<br>
2. This is because average is non-associative.
</details>


---
Using Map-Only Job To Clean Data
--------------------------------

Q: Write an ETL application that extracts all the `CA` sales records.

- This only requires transforming records, without consolidating them.

- Any time we don't have to consolidate records we can use a *Map Only* job.

- Create the `SaleExtract.py` file.

In [17]:
%%writefile SaleExtract.py

from mrjob.job  import MRJob
from mrjob.step import MRStep

class SaleExtract(MRJob):

    def mapper_extract(self, _, line):
        if line.startswith('#'): return
        fields = line.split()
        state = fields[3]
        if state != 'CA': return
        yield (state, line)
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper_extract)
        ]

if __name__ == '__main__': 
    SaleExtract.run()

Overwriting SaleExtract.py


- Run it locally.

In [18]:
!python SaleExtract.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleExtract.brian.20160202.232926.067798

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleExtract.brian.20160202.232926.067798/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
Moving /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleExtract.brian.20160202.232926.067798/step-0-mapper_part-00000 -> /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleExtract.brian.20160202.232926.067798/output/part-00000
Streaming final output from /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleExtract.brian.20160202.232926.067798/output
removin

In [19]:
# Check the output:
!cat output.txt

"CA"	"102    11/15/2014     203     CA     321        200.00"
"CA"	"106    11/19/2014     202     CA     331        330.00"
"CA"	"105    11/19/2014     202     CA     321        200.00"


Map-Only Applications
---------------------

Here are some other applications of map-only jobs.

- Web-crawler that finds out how many jobs are on Craigslist for a
  particular keyword.

- Application that maps property addresses to property back-taxes by
  scraping county databases.

Check for understanding
--------

<details><summary>
Q: Do map-only applications shuffle and sort the data?
</summary>
1. No they do not shuffle and sort the data.
<br>
2. Map-only jobs immediately output the data after it is transformed
   by map.
</details>



-----
Counters
--------

Counters are a special type of Map only job.

For example, count how many transactions there were in California and Washington.

- One way to solve this problem is to use a MapReduce application we
  did before.

- However, if we have a fixed number of categories we want to count we
  can use counters.

- If we use counters we no longer need a reduce phase, and can use a
  map-only job.
  
- MapReduce has a limit of 120 counters.

- So this cannot be used to count frequencies for an unknown number of
  categories.

- Create the `SaleCount1.py` file.

In [20]:
%%writefile SaleCount1.py

from mrjob.job  import MRJob
from mrjob.step import MRStep

class SaleCount1(MRJob):

    def mapper_count(self, _, line):
        if line.startswith('#'): return
        fields = line.split()
        state = fields[3]
        if state == 'CA':
            self.increment_counter('State', 'CA', 1)
        if state == 'WA':
            self.increment_counter('State', 'WA', 1)
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper_count)
        ]

if __name__ == '__main__': 
    SaleCount1.run()

Overwriting SaleCount1.py


- Run it locally.

In [21]:
!python SaleCount1.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount1.brian.20160202.232926.664738

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount1.brian.20160202.232926.664738/step-0-mapper_part-00000
Counters from step 1:
  State:
    CA: 3
    WA: 2
Moving /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount1.brian.20160202.232926.664738/step-0-mapper_part-00000 -> /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount1.brian.20160202.232926.664738/output/part-00000
Streaming final output from /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount1.brian.20160202.232926.664738/

In [22]:
# There should not be any output. The counter values were printed when the job was executed.
!cat output.txt

Counter Notes
-------------

- Counters can be incremented in both the map and the reduce phase.

- Counter values from all the machines participating in a MapReduce
  job are aggregated to compute job-wide value.

- Counter values are printed out when the job completes and are also
  accessible on the Hadoop Web UI that stores job history.

- Counters are have a group name and a counter name.

- Group names help organize counters.

- Here is how we increment a counter:
  `self.increment_counter(group_name, counter_name, 1)`

Check for understanding
--------

<details><summary>
Q: SalesStrategy Inc employs 100,000 part-time sales partners to sell
their products. The salespeople get monthly bonuses based on the
number of transactions they ring up. Should SalesStrategy use counters
to calculate these bonuses? Why or why not?
</summary>
1. Instead of counters they should use a regular MapReduce counting
   application.
<br>
2. Counters are only appropriate if the number of categories is fixed
   and is about 100.
<br>
3. While the Hadoop admin can configure the system to support more
   counters than 120, this increases intra-cluster network traffic,
   and is not recommended.
</details>

Map-Only Job Observations
-------------------------

- Map-only jobs are the multi-machine equivalent of the
  multi-threading and multi-processing exercises we did earlier.

- Like our multi-threading and multi-processing applications, map-only
  jobs break up a larger problem into smaller chunks and then work on
  a particular chunk.

- Any time we have a problem where we don't need to reconcile or
  consolidate records we should use map-only jobs.

- Map-only jobs are much faster than regular MapReduce jobs.

Check for understanding
--------

<details><summary>
Q: Why are map-only jobs faster than regular MapReduce jobs?
</summary>
1. The map phase is perfectly parallelizable.
<br>
2. Map-only jobs don't have a shuffle-and-sort or reduce phase, which
   tend to be the bottleneck for regular MapReduce jobs.
</details>

Chaining Jobs Together
----------------------

Q: Find word frequencies and sort the result by frequency. 

- This requires running two MapReduce jobs.

- The first job will calculate word frequencies.

- The second job will sort them.

- This can be accomplished in MRJob by chaining multiple jobs together
  as steps.

- Create `MostUsedWords.py`.

In [23]:
%%writefile MostUsedWords.py
from mrjob.job  import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")

class MostUsedWords(MRJob):

    def mapper_get_words(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def reducer_count_words(self, word, counts):
        count_sum = '%03d'%sum(counts) 
        yield (count_sum, word)

    def reducer_sort(self, count, words):
        for word in words:
            yield (word, count)

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                   reducer=self.reducer_count_words),
            MRStep(reducer=self.reducer_sort)
        ]

if __name__ == '__main__':
    MostUsedWords.run()

Overwriting MostUsedWords.py


- Run it locally.

In [24]:
!python MostUsedWords.py input.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/MostUsedWords.brian.20160202.232927.186574

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/MostUsedWords.brian.20160202.232927.186574/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/MostUsedWords.brian.20160202.232927.186574/step-0-mapper-sorted
> sort /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/MostUsedWords.brian.20160202.232927.186574/step-0-mapper_part-00000
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/MostUsedWords.brian.20160202.232927.186574/step-0-red

In [25]:
# Check the output:
!cat output.txt

"again"	"001"
"second"	"001"
"third"	"001"
"world"	"001"
"hello"	"002"
"is"	"002"
"line"	"002"
"the"	"002"
"this"	"002"


---
Curse of the Last Reducer
---

These toy examples have had approximately uniform distributions.

__Danger__: The real world is not uniform. Power-law distribution is very common.

One poor set of Reducers will be overloaded. You'll be waiting on them to finish.

This is the “curse of the last reducer”

Very common in NLP and Graph Processing (for example, popular people on Facebook or Twitter) 

---
Check for understanding
---
<details><summary>
If you are running word count on the internet, what is the frequency distribution of words?
</summary>
Zipf's law
<br>
![](https://blogemis.files.wordpress.com/2015/09/graph-zipf.png)
</details>

[Read more here](http://www.slideserve.com/ura/counting-triangles-and-the-curse-of-the-last-reducer)

----
MapReduce Streaming API
-----------------------

- Why are we left-padding the amount with zeros? 

- MRJob is a wrapper around the MapReduce Streaming API.

- The MapReduce Streaming API converts all intermediate types to strings for comparison.

- So `123` will be smaller than `59` because it starts with `1` which
  is less than `5`.
  
- To get around this in MRJob if we want our data to sort numerically
  we have to left-pad the numbers with zeros.

Check for understanding
--------

<details><summary>
Q: How can we find out which state had the highest sales total revenue?
</summary>
1. We can chain together two jobs.
<br>
2. The first one calculates revenue per state.
<br>
3. The second sorts the result of the first step by revenue.
</details>

Sorting Sales Data
------------------

Q: Find the total sales per state and then sort by sales to find the
state with the highest sales total.

- We can use a multi-step MRJob to do this.

- Sort sales data using two steps.

In [26]:
%%writefile SaleCount.py
from mrjob.job  import MRJob
from mrjob.step import MRStep
import numpy as np

class SaleCount(MRJob):
   
    def mapper1(self, _, line):
        if line.startswith('#'):
            return
        fields = line.split()
        amount = float(fields[5])
        state = fields[3]
        yield (state, amount)

    def reducer1(self, state, amounts):
        amount = '%07.2f'%sum(amounts) 
        yield (state, amount)
    
    def mapper2(self, state, amount):
        yield (amount, state)

    def reducer2(self, amount, states):
        for state in states: 
            yield (state, amount)
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper1, reducer=self.reducer1),
            MRStep(mapper=self.mapper2, reducer=self.reducer2)
        ]
    
if __name__ == '__main__': 
    SaleCount.run()

Overwriting SaleCount.py


- Run it locally.

In [27]:
!python SaleCount.py sales.txt > output.txt

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232928.301006

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232928.301006/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232928.301006/step-0-mapper-sorted
> sort /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232928.301006/step-0-mapper_part-00000
writing to /var/folders/ld/yffmln1s7z1cr9qmqgt9hkcw0000gn/T/SaleCount.brian.20160202.232928.301006/step-0-reducer_part-00000
Coun

In [28]:
# Check the output:
!cat output.txt

"OR"	"0450.00"
"CA"	"0730.00"
"WA"	"1050.00"


Data Locality
-------------

Q: What is *data locality*?

- Data locality is the secret sauce in HDFS and MapReduce. 

- Data locality means MapReduce runs mappers on locally machines with HDFS blocks.

- When these machine are busy mappers may come up on other machines.


Data Locality
-------------

![](https://cvw.cac.cornell.edu/MapReduce/images/hdfs_mr.png)


Which machines run mappers and which run reducers?
-----

- The JobTracker tries to run the mappers on the machines where the
  blocks of input data are located.

- This is called data locality--ideally, the mapper does not need to
  pull data across the network.

- The reducers are assigned randomly to machines which have memory and
  CPUs currently available.

----
Building on top of MapReduce: Extensions on a faulty abstraction
----

![](http://cdn.meme.am/instances/61329997.jpg)

Hive
----

- Hive was developed at Facebook.

- It translates SQL to generate MapReduce code.

- Its dialect of SQL is called HiveQL.

- Data scientists can use SQL instead of MapReduce to process data.

Pig
---

- Pig was developed at Yahoo.

- It solves the same problem as Hive.

- Pig uses a custom scripting language called PigLatin instead of SQL.

- PigLatin resembles scripting languages like Python and Perl.

- Pig is frequently used for processing unstructured or badly formed
  data.

Machine Learning
---

__DO NOT EVEN TRY!!!__

![](http://www.roflcat.com/images/cats/All_Is_Lost.jpg)

---
Summary
----

- MapReduce is a distrubted batch processing framework
- High throughput 😀and __not__: real-time / interactive / low latency 😦
- All you get is functions on k-v pairs
- 3 stages:
    1. Map: Takes input and returns a key-value pairs.
    2. Shuffle: Aligns the key-value pairs
    3. Reduce: Aggregate function from key-value to key-value.
- Plan your jobs at a (relatively) low-level to optimize compute resources
- Avoid graph processing and machine learning in MapReduce (actually avoid MR altogether)

<br>
<br>
<br>

---
Extra
----

Pipes
-----

Q: What options do I have if I want to write my MapReduce code in
C++?

- For C++ there is a special interface that Hadoop provides called
  *Pipes*.

- Pipes is similar to streaming but instead of standard input and
  output it uses sockets for communication.

- Pipes is primarily used to leverage legacy code written in C++ or in
  situations where a computation needs to use C++'s smaller memory
  footprint for speed or scalability.