# <img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Intro to Big Data, Hadoop and MapReduce

---



![image.png](attachment:image.png)

### Learning Objectives

- Recognize big data problems
- Explain how the map reduce algorithm works
- Understand the difference between high performance computing and cloud computing
- Describe the divide and conquer strategy
- Perform a map-reduce on a single node using python

<h1>Lesson Guide<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-0.1">Learning Objectives</a></span></li></ul></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#What-is-&quot;big-data&quot;?" data-toc-modified-id="What-is-&quot;big-data&quot;?-2">What is "big data"?</a></span></li><li><span><a href="#One-solution:-High-performance-computing-(HPC)" data-toc-modified-id="One-solution:-High-performance-computing-(HPC)-3">One solution: High performance computing (HPC)</a></span></li><li><span><a href="#Can-you-think-of-advantages-and-disadvantages-of-HPC-configurations?" data-toc-modified-id="Can-you-think-of-advantages-and-disadvantages-of-HPC-configurations?-4">Can you think of advantages and disadvantages of HPC configurations?</a></span></li><li><span><a href="#Computer-Clusters-and-Cloud-Computing" data-toc-modified-id="Computer-Clusters-and-Cloud-Computing-5">Computer Clusters and Cloud Computing</a></span><ul class="toc-item"><li><span><a href="#Have-you-heard-of-..." data-toc-modified-id="Have-you-heard-of-...-5.1">Have you heard of ...</a></span></li></ul></li><li><span><a href="#Parallel-computing-through-divide-and-conquer" data-toc-modified-id="Parallel-computing-through-divide-and-conquer-6">Parallel computing through divide and conquer</a></span></li><li><span><a href="#Back-to-2004,-with-Google-and-its-quest-to-index-(control?)-the-web" data-toc-modified-id="Back-to-2004,-with-Google-and-its-quest-to-index-(control?)-the-web-7">Back to 2004, with Google and its quest to index (control?) the web</a></span></li><li><span><a href="#MapReduce-in-more-detail" data-toc-modified-id="MapReduce-in-more-detail-8">MapReduce in more detail</a></span></li><li><span><a href="#What-are-map-and-reduce?" data-toc-modified-id="What-are-map-and-reduce?-9">What are map and reduce?</a></span></li><li><span><a href="#Word-count-example" data-toc-modified-id="Word-count-example-10">Word count example</a></span></li><li><span><a href="#Independent-practice" data-toc-modified-id="Independent-practice-11">Independent practice</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-12">Conclusions</a></span><ul class="toc-item"><li><span><a href="#Word-Count-Notes" data-toc-modified-id="Word-Count-Notes-12.1">Word Count Notes</a></span></li><li><span><a href="#Map-Reduce-Notes" data-toc-modified-id="Map-Reduce-Notes-12.2">Map Reduce Notes</a></span></li><li><span><a href="#Pros-and-cons-of-distributed-computing" data-toc-modified-id="Pros-and-cons-of-distributed-computing-12.3">Pros and cons of distributed computing</a></span></li></ul></li><li><span><a href="#Additional-resources" data-toc-modified-id="Additional-resources-13">Additional resources</a></span></li></ul></div>

## Introduction
---

This lesson identifies some major trends in the field of "big data" and data infrastructure, including common tools and problems that you may encounter working as a data scientist. 

It is time to take the tools you've learned to a new level by scaling up the size of datasets you can tackle!


<img src="https://snag.gy/mDzP4d.jpg" style="height: 300px">



![image.png](attachment:image.png)

> **Big data is a super hot topic. It refers to techniques and tools that allow to store, process and analyze large-scale (multi-terabyte) datasets.**

**Can you think of any datasets that would be "big data"?**

For example:

- Facebook social graph
- Netflix movie preferences
- Large recommender systems
- Activity of visitors to a website
- Customer activity in a retail store 

**What about Structured vs. Unstructured datasets?**

![image.png](attachment:image.png)

**What challenges exist with such large amounts of data? (Give examples)**

- Processing time
- Cost
- Architecture maintenance and setup
- Hard to visualize

## What is "big data"?
---

Big data is a term used for data that exceeds the processing capacity of typical databases. We need a big data analytics team when the data is enormous and growing quickly but we need to uncover hidden patterns, unknown correlations, and build models. 

**There are three main features in big data (the 3 "V"s):**
- **Volume**: Large amounts of data
- **Variety**: Different types of structured, unstructured, and multi-structured data
- **Velocity**: Needs to be analyzed quickly

**4th V (unofficial big data tenet):**
- **Value**: It's important to assess the value of predictions to business value.  Understanding the underpinnings of cost vs benefit is even more essential in the context of big data.  It's easy to misunderstand the 3 V's without looking at the bigger picture, connecting the value of the business cases involved.

![3v](./assets/images/3vbigdata.png)

## One solution: High performance computing (HPC)
---

Supercomputers are very expensive, very powerful calculators used by researchers to solve complicated math problems.

![supercomputer](./assets/images/supercomputer.png)


## Can you think of advantages and disadvantages of HPC configurations?

**PROS:**
- can perform very complex calculations
- centrally controlled
- useful for research and complicated math problems

**CONS:**
- expensive
- difficult to maintain (self-managed or managed hosting both incur operations overhead)
- scalability is bounded 

## Computer Clusters and Cloud Computing
---

Instead of using one huge machine, what if we bought a bunch of commodity machines?

> *Note: Commodity hardware is a term used in operations to describe mixed-server hardware, but it can also refer to the basic machines you would use in an office.*

![Commodity hardware](https://snag.gy/fNYgt0.jpg)<center>*Actual AWS Datacenter*</center>.

**Advantages:**
- Relatively cheaper
- Easier to maintain (as a user of the cloud system)
- Scalability is unbounded (just add more nodes to the cluster)
- A variety of turnkey solutions are available through cloud providers

**Disadvantages:**
- Complex infrastructure 
- Subject matter expertise required to leverage lower-level resources within the infrastructure
- Mainly tailored for parallelizable problems
- Relatively small CPU power at the lowest level
- More input/output between machines

The term big data refers to the cloud computing case in which commodity hardware with unlimited scalability is used to solve highly parallelizable problems.

Having more computers available gives more computing power, but

- will have to manage data transfer between different computers
- lower costs through use of standard hardware
- will have to handle failure of hardware

### Have you heard of ...

![image.png](attachment:image.png)

- In the early 2000s Google was on a quest to index the web (and still is)
- The company knew it HAD to disrupt the industry as the size of the data it had to deal with was exponentially increasing
- In 2003, Google published the GFS standing for the Google File System aiming at leveraging standard hardware

> The problem: if you average the number of failures of standard HD and multiply it with millions of them, you end up with a **high number of failures**.

> The solution: 
- The data will be replicated 3 times around the network to insure availability and redundancy
- Data is broken into chunks and spread through many nodes

> This strategy implies
- No central storage to overwhelm 
- The ability to scale indefinitely (increase of nodes => increase in computing power => increase of storage capacity)


**OK the storage problem is solved.**

## Parallel computing through divide and conquer
---

<img src="https://snag.gy/xh2mJA.jpg">

The "Divide and Conquer" strategy is a fundamental algorithmic technique for solving a task. The steps are:

- Split the task into subtasks
- Solve these subtasks independently
- Recombine the subtask results into a final result

For a problem to be suitable for the divide and conquer approach it **must be able to be broken into smaller independent subtasks**. 

Many processes are suitable for this strategy, but there are plenty that do not meet this criterion.

## Back to 2004, with Google and its quest to index (control?) the web

To do the data processing in an efficient way, 
Google's engineers used the MapReduce technique (apparently as the first in a production environment).

It lead to a white paper they published in 2004.

Those two papers (GFS + MR) motivated two Yahoo engineers to build Hadoop.

## MapReduce in more detail

---

![image.png](attachment:image.png)

The term **Map Reduce** indicates a two-phase divide and conquer algorithm initially invented and publicized by Google in 2004. It involves splitting a problem into subtasks and processing these subtasks in parallel. It consists of two phases:

1. The **mapper** phase
- The **reducer** phase

In the **mapper phase**, data is split into chunks and the same computation is performed on each chunk, while in the **reducer phase**, data is aggregated back to produce a final result.

Map-reduce uses a functional programming paradigm.  The data processing primitives are mappers and reducers.

- **Mappers** – filter and transform data
- **Reducers** – aggregate results

The functional paradigm is good for describing how to solve a problem, but not very good at describing data manipulations (e.g., relational joins).

## What are map and reduce?

In [2]:
a = list(map(lambda x: 2*x, [1, 2, 3]))
a

[2, 4, 6]

In [3]:
from functools import reduce

reduce(lambda x, y: x+y, a)

12

In [4]:
b = list(map(lambda x: (x, 1), ['a', 'b', 'a']))
b

[('a', 1), ('b', 1), ('a', 1)]

## Word count example

![](assets/images/word_count.png)

Reduce by key implementation from https://gist.github.com/Juanlu001/562d1ec55be970403442

In [5]:
from itertools import groupby
[(x[0], list(x[1]))
 for x in groupby(sorted(b, key=lambda y: y[0]), lambda y: y[0])]

[('a', [('a', 1), ('a', 1)]), ('b', [('b', 1)])]

In [6]:
def reduceByKey(func, iterable):
    """Reduce by key.
    Equivalent to the Spark counterpart
    Inspired by http://stackoverflow.com/q/33648581/554319
    1. Sort by key
    2. Group by key yielding (key, grouper)
    3. For each pair yield (key, reduce(func, last element of each grouper))
    """
    def get_first(x): return x[0]
    def get_second(x): return x[1]

    return map(
        lambda x: (x[0], reduce(func, map(get_second, x[1]))),
        groupby(sorted(iterable, key=get_first), get_first)
    )


list(reduceByKey(lambda x, y: x+y, b))

[('a', 2), ('b', 1)]

Let's see how we could do the word count with the scripts provided.

First let's create some text:

In [7]:
%%writefile input_.txt
hello world
this is the second line
this is the third line
hello again

Writing input_.txt


In [8]:
!cat input_.txt | python code/mapper.py | sort -k 1 | python code/reducer.py | sort -rnk 2 -k 1b > result.txt

In [9]:
# -rnk reverse, numerically,second key , once got the first step , 
# 1b sort by the 1 key 

# man sort, 可以在jupyterbook 的 terminal里面查询 “sort”命令的功能

In [10]:
!cat result.txt | head -n10

hello	2
is	2
line	2
the	2
this	2
again	1
second	1
third	1
world	1


## Independent practice
---

Now that you have a basic word counter set up in python, try doing some of the following:

1. Process a much larger text file 
    - Start with one of the files in the `resource-datasets/project_gutenberg` folder
    - Try the same for all the files in that folder together
    - Find a text of your choice in the internet, e.g. a page from wikipedia or a blog article
- Try to see how the execution time scales with file size.
- Read [this article](http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html) for some very powerful shell tricks.  Learning to use the shell will save you tons of time munging data on your file system.

In [None]:
!cat input_.txt | python code/mapper.py | sort -k 1 | python code/reducer.py | sort -rnk 2 -k 1b > result.txt

In [14]:
!cat pg844.txt | python code/mapper.py | sort -k 1 | python code/reducer.py | sort -rnk 2 -k 1b > result.txt

In [16]:
!cat result.txt | head -n20

I	786
the	692
to	621
of	484
a	437
you	414
is	408
in	324
and	312
that	263
Jack.	224
Algernon.	211
have	198
be	167
Cecily.	166
at	158
for	154
are	151
it	150
not	145


In [17]:
!cat 1184-0.txt | python code/mapper.py | sort -k 1 | python code/reducer.py | sort -rnk 2 -k 1b > result.txt
#

In [18]:
!cat result.txt | head -n20

the	26233
of	12735
to	12632
and	11175
a	8948
I	6523
in	6155
his	5748
you	5739
he	5434
that	4785
was	4587
with	3854
is	3743
had	3498
not	3460
for	3110
said	3099
have	3096
as	2923
cat: stdout: Broken pipe


In [19]:
!cat pg1342.txt | python code/mapper.py | sort -k 1 | python code/reducer.py | sort -rnk 2 -k 1b > result.txt

In [20]:
!cat result.txt | head -n20

the	4205
to	4121
of	3662
and	3309
a	1945
her	1858
in	1813
was	1795
I	1740
that	1419
not	1356
she	1306
be	1209
his	1167
had	1126
as	1119
with	1040
he	1038
for	1003
you	987
cat: stdout: Broken pipe


In [23]:
pwd

'/Users/paxton615/GA/DSI9-lessons/week11/day2_big_data_and_spark_intro/big-data-intro-lesson'

In [26]:
!sort /Users/paxton615/GA/DSI9-lessons/week11/day2_big_data_and_spark_intro/big-data-intro-lesson/pg1342.txt
















































































































































































































































































































































Mrs. Bennet still continued to wonder and repine at his returning no
Mrs. Bennet treasured up the hint, and trusted that she might soon have
Mrs. Bennet was beyond the reach of reason, and she continued to rail
Mrs. Bennet was in fact too much overpowered to say a great deal while
Mrs. Bennet was perfectly satisfied, and quitted the house under the
Mrs. Bennet was prevented replying by the entrance of the footman with
Mrs. Bennet was profuse in her acknowledgments.
Mrs. Bennet was really in a most pitiable state. The very mention of
Mrs. Bennet's eyes sparkled. "A gentleman and a stranger! It is Mr.
Mrs. Bennet's schemes for this day were ineffectual. Bingley was every
Mrs. Bennet, all amazement, though flattered by having a guest of such
Mrs. Bennet, in short, was in very great spirits; she had seen enough of
Mrs. Bennet, to whose apartment they all repaired, after a few minutes'
Mrs. Bennet, with great civility, begged her ladyship to take some
Mrs. Collin

name had never been voluntarily mentioned before them by her niece; and
name of your admirer. This letter is from Mr. Collins."
name some other period for the commencement of actual felicity--to have
name to her mother on her ladyship's entrance, though no request of
name was scarcely ever mentioned between them.
names to be mentioned in your hearing.' That is his notion of Christian
natural as abhorrence against relationship with Wickham. Brother-in-law
natural consequence of the prejudices I had been encouraging. There
natural modesty, with a stronger dependence on my judgement than on his
natural; and all surprise was shortly lost in other feelings. She was
naturally lively enough. And we all know that Wickham has every charm of
naturally looks for happiness in the marriage state. If therefore she
naturally returned to all his former indolence. His letter was soon
nature inoffensive, friendly, and obliging, his presentation at St.
nature is particularly p

In [34]:
Command:
    ! cat > mix.txt
    abc
    apple


SyntaxError: invalid syntax (<ipython-input-34-1f3bb063e871>, line 1)

## Conclusions

### Word Count Notes

- Word count is used as a standard distributed application

- For a large number of words it is not solvable on a single machine

- A large corpus can require more storage than the disk on a single
  machine

- A large vocabulary can require more memory than on a single machine

- Word count generalizes to other counting applications: such as
  counting clicks by category

### Map Reduce Notes

The map-reduce paradigm is at the heart of Hadoop. In this way, everything is handled around

- Parallelization and distribution (input splits)
- Partitioning (shuffle and sort)
- Fault tolerance
- Scheduling and resource management
- Status and monitoring

### Pros and cons of distributed computing

**PROS:**
- Relatively cheaper
- Easier to maintain (as a user of the cloud system)
- Scalability is unbounded (just add more nodes to the cluster)
- Variety of turn-key solutions available through cloud providers

**CONS:**
- Complex infrastructure 
- Subject matter expertise required to leverage lower-level resources within infrastructure
- Mainly tailored for parallelizable problems
- Relatively small CPU power at the lowest level
- More I/O between machines

The term Big Data refers to the cloud computing case, where commodity hardware with unlimited scalability is used to solve highly parallelizable problems.

## Additional resources

---

- [Google File System paper](https://research.google.com/archive/gfs.html)
- [Google Map Reduce paper](https://research.google.com/archive/mapreduce.html)