# <center> Introduction to Hadoop MapReduce </center>

## Reality of working with Big Data

- Hundreds or thousands of machines to support big data
    - Distribute data for storage (HDFS)
    - Parallelize data computation (Hadoop MapReduce)
    - Handle failure (HDFS and Hadoop MapReduce)

## MapReduce

** What is “map”? **
A function/procedure that is applied to every individual elements of a collection/list/array/…
```
int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]
```
** What is “reduce”? **
A function/procedure that performs an operation on a list. This operation will “fold/reduce” this list into a single value (or a smaller subset)
```
reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24
```

## Implementation of MapReduce Programming Paradigm in Hadoop MapReduce

**Programmers implement:**

- Map function: Take in the input data and return a key,value pair
- Reduce function: Receive the key,value pairs from the mapper and provide a final output as a reduction operation on the pairs
- Optional functions:
    - Partition function: determines the distribution of mappers’ key,value pairs to the reducers
    - Combine functions: initial reduction on the mappers to reduce network traffics
    
**The MapReduce Framework handles everything else**


## WordCount: The *Hello, World* of Big Data

- Count how many unique words there are in a file/multiple files
- Standard parallel programming approach:
    - Count number of files
    - Set number of processes
    - Possibly setting up dynamic workload assignment
    - A lot of data transfer
    - Significant coding effort


## MapReduce WordCount Example

<img src="pictures/11/wordcount01.png" width="700"/>

## MapReduce WordCount Example

<img src="pictures/11/wordcount02.png" width="700"/>

## MapReduce PageRank Example 1

<img src="pictures/11/pagerank01.png" width="700"/>


## MapReduce PageRank Example 2

<img src="pictures/11/pagerank02.png" width="700"/>


## What is "everything else"?

- Scheduling
- Data distribution
- Synchronization
- Error and Fault Handling

## The cost of "everything else"?

- All algorithms must be expressed as a combination of mapping, reducing, combining, and partitioning functions 
- No control over execution placement of mappers and reducers
- No control over life cycle of individual mappers and reducers
- Very limited information about which mapper handles which data block
- Very limited information about which reducer handles which intermediate key

## Additional challenge

** Large scale debugging on big data programming is difficult

- Functional errors are difficult to follow at large scale
- Data-dependent errors are even more difficult to catch and fix

## Applications of MapReduce

- Text tokenization, indexing, and search
    - Web access log stats
    - Inverted index construction
    - Term-vector per host
    - Distributed grep/sort
- Graph creation
    - Web link-graph reversal (Google’s PageRank)
- Data Mining and machine learning
    - Document clustering	
    - Machine learning
    - Statistical machine translation

# <center> Working with Hadoop MapReduce on Cypress </center>

In [None]:
1 8 cores 6gb RAM , 1hour

Python Jupyter notebook supports execution of Linux command inside the notebook cells. This is done by adding the **!** to the beginning of the command line. It should be noted that each command begins with a **!** will create a new bash shell and close this cell once the execution is done:
- Full path is required
- Temporary results and environmental variables will be lost

In [2]:
!module list

Currently Loaded Modulefiles:
  1) anaconda3/4.2.0   3) zeromq/4.1.5
  2) matlab/2015a      4) hdp/0.1


In [3]:
!hdfs

Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  classpath            prints the classpath
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  envvars              display computed Hadoop environment variables
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  mover                run a utility to move block replicas across
                       storage types
 

In [4]:
!hdfs dfs

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <

### Challenge

Create a directory named **intro-to-hadoop** inside your user directory on HDFS

In [6]:
!cypress-kinit

In [None]:
!klist

In [None]:
!hdfs dfs -ls .

In [None]:
!hdfs dfs -mkdir intro-to-hadoop

### Monitoring Hadoop

- Hadoop has a web-based interface
- Cypress supports a comprehensive managemen framework called Ambari, open sourced by Hortonworks
- dscim003.palmetto.clemson.edu:8080
    - user/user
- Windows: MobaXTerm has embedded X11 support
- Mac: need to install XQuartz
    - `ssh -X <username>@palmetto.clemson.edu`
    - `ssh -X <computeNode>`
- Linux:
    - `ssh -X <username>@palmetto.clemson.edu`
    - `ssh -X <computeNode>`