Map Reduce

* Objectives:
    * What is "Big Data" and why is it important?
    * How do distributed filesystems and processing work?
    * Describe distributed systems architecture
    * What is Hadoop HDFS?
    * How does MapReduce work?

1) Criteria For "Big Data" - Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization
* **High Volume** - Data so large that it can't be worked with on a single computer
* **High Velocity** - Data input/output too quick for it to be processed by a single computer
* **High Variety** - Data in many different, disparately or not at all, structured formats (e.g. text, log file, video, etc.) 

2) What is consider "High" Volume or "Big"?

|  Class |        Size       |                  Tools                  |              Storage              |          Examples          |
|:------:|:-----------------:|:---------------------------------------:|:---------------------------------:|:--------------------------:|
| Small  | less than 10GB    | R / Python                              | Fits in a single machine's memory | Thousands of sales figures |
| Medium | 10GB - 1TB        | Python w/ indexed files, large database | Fits on a single machine's disk   | Millions of web pages      |
| Large  | greater than 1 TB | Hadoop, Spark, distributed databases    | Stored across multiple machines   | Billions of web clicks     |

![world_of_data](https://assets.goodstatic.com/s3/magazine/assets/536765/original/open-uri20140703-4674-ghcvv3)

* As more data is generated, and inevitably collected, the size of data you'll likely work with is going to increase
* Tools to store and process these larger data sets are going to be increasingly required

3) Distributed Systems
* Local vs Distributed
    * **Local** - uses resources of one machine. Doesn't communicate with any others
        * (+) Simple, fast when computations are small enough
        * (-) Physical limits to memory, disk space, and CPU power
    * **Distributed** - use the resources, processing and memory, of multiple machines. However, they need to be able to communicate with one another
        * (+) Easily linearly scalable, can be designed to be fault-tolerant
        * (-) Slow to communicate over network, need to solve problems with parallelizable techniques
* Scaling
    ![scaling](scaling.png)
    * **Scale Up** - Make the computer bigger: disk, RAM, CPU cores
    * **Scale Out** - Add more computers: take advantage of parallelizable algorithms
* Do I need to use distributed computing?
    * Because of overhead involved with having multiple computers in a network communicate with each other, and the restrained set of problems that can readily be solved in a parallelizable way, it's not a good enough reason to choose a distributed solution just because things are "taking awhile" on your local machine.

4) Distributed Systems Architecture / HDFS
* High Level Architecture
![master_workers](master_workers.png)
* **Hadoop Distributed File System (HDFS)** - distributed storage system
![hdfs](hdfs.png)
    * **Data nodes** - stores the data
    * **Name node** - keeps track of where the data is stored
* Fault-tolerance through Replication - Each of these blocks is replicated on 3, by default, of the data nodes in the cluster
    ![replication](replication.png)
    * Each file is broken up into blocks, default size is 64/128 MB
    * How many nodes can be lost before the original file isn't recoverable?
        ![name_node](name_node.png)
        * Since the **name node knows about where all the copies of each block is**, it can automatically make new ones if some are lost 

5) Hadoop MapReduce - processes all the distributed data
* Processing in Hadoop:
    * Move the computation portion of our endeavors to the computer that is storing each part of the data in HDFS
    * This requires us to have a processing framework which can natively work on data that isn't all stored in same location
* Advantages of MapReduce:
    * Splits up data
    * Moves data between nodes
    * Manages resources, computational and memory
    * Receives status and monitors health
    * Fault-tolerance
* MapReduce Framework Strategy - divide and conquer
    1. Split a task into smaller subtasks
    2. Solve these independently of one another (in parallel)
    3. Recombine the output of each subtask into a final result
* MapReduce Concepts - Mapping and reducing are concepts that belong to the functional programming paradigm
    * **Map** - applies a function to each of the elements of a data structure
    * **Reduce** - takes a function which aggregates the elements of a data structure
    * This strategy is particularly useful in distributed computing because the elements of the data structure referenced in the map step don't need to be on the same machine, they can be the partitions of a file that live on different data nodes
* MapReduce Diagram
![mr_diagram](mr_diagram.png)
    * **Map Step** - mapping is simply the action of taking in some form of data and filter/tranforming it into another form. As mapping step should operate on a single element of our data and output 0 or more possibly transformed versions of that data
        * e.g. Taking lines from a click-through log file and outputs a tuple of user id and the page id they clicked on from weekdays
        * By default, the "elements" will be passed as a single data points from your input partitions to your mapping function in Hadoop are **lines from a file**
    * **Reduce Step** - reducing is the act of taking a bunch of grouped data and combining it in some way. This grouped data will be passed to it as key-values pairs where Hadoop will automatically bundle like values by their key into an iterable with all the values associated with that key
        * e.g. Reducer takes in user ids as keys and an iterable of page ids they've gone to and outputs the page id that a user visits most frequently
        * Frequently, the input to our reducers will be coming from a mapper, though this isn't strictly necessary
* Efficieny of MapReduce
    * At the end of both the mapping and reducing steps, the output is **written to disk into the HDFS**. This can potentially have efficiency ramifications if we are **performing many mapping and reducing operations** in sequence since writing to disk is **time consuming**
    * This means that we'll want to condense our mapping and reducing operations, which may make our algorithms hard to understand (or use a different framework, e.g. Spark)
* MapReduce Word Count Example:
![word_count](word_count.png)
    ```python
    # wordcounts.py
    from mrjob.job import MRJob
    from string import punctuation

    class MRWordCount(MRJob):
        def mapper(self, _, line):
            for word in line.split():
                yield (word.strip(punctuation).lower(), 1)

        def reducer(self, word, counts):
            yield (word, sum(counts))

    if __name__ == '__main__':
        MRWordCount.run()

    # python wordcounts.py > counts.txt
    ```

6) Advanced MapReduce
* MapReduce deals with most of the details of distributed computing (the ones that we would otherwise like to not think about) and automatically handles most of the concerns
    * Parallelization and distribution of data and computation
    * Fault-tolerance
    * Resouce management
    * Status monitoring
* **Shuffle and Sort** - how data is re-distributed during a MapReduce job to continually take advantage of all the worker nodes
    * **Shuffle Step** - process of deciding how and where data from mappers will go to the reducers
    * **Sort Step** - sorts data it gets from the mappesr by key
    * Tracking these steps:
    ![tracking](tracking.png)
    * **Partitioning** - each mapper takes care of any local sorting it can do to the data it's outputting and reducers take care of merging all the sorted partitions that get sent to them during the shuffle
    ![partitioning](partitioning.png)
    * Scarce Resources:
        * In a distributed computing framework like Hadoop, what is one of the most scarce resources?
            * **Bandwidth** - the time cost of transmitting data around the cluster
        * What if we could decrease the amount of data we have to move in our shuffle and sort, or eliminate it all together?
* **Counters** - used to keep track of events that occur over a given map or reduce step
    * Mostly used for debugging purposes (e.g. verify that all of the folders in your file system are getting accessed correctly)
    ```python
    class MRCountingJob(MRJob):
        def mapper(self, _, value):
            self.increment_counter('group', 'counter_name', 1)
                yield _, value
    ```
    * Counters can also be used to avoid using a reducer in some cases. If all you're trying to do is count by some key then a counter can help you avoid the time consuming process of the shuffle and sort when you use a reducer.
    ```python
    import os
    class MRCountingJob(MRJob):
        def mapper(self, _, line):
            filepath = os.environ['map_input_file']
            filename = filepath.split('/')[-1]
            
            for word in line.split():
                self.increment_counter(filename, 'word_counts', 1)
    ```
* **Combiners** - help decrease the amount of data that needs to be passed from mappers to reducers
    * Frequently, looks like reducers that aggregate like data on the mapper so that there is "less" of it to copy to the reducers
    * The same information needs to get passed to the reducers no matter what, but combiners generally find a way to achieve this while using fewer bits to represent that information
    * Example: illustrate unnecessary amount of data moving
    ```python
    from string import punctuation
    
    class MRWordCount(MRJob):
        def mapper(self, _, line):
            for word in line.split():
                yield (word.strip(punctuation).lower(), 1)
        
        def combiner(self, word, counts):
            yield (word, sum(counts))
            
        def reducer(self, word, counts):
            yield (word, sum(counts))
    # while the combiner and reducer implement the same function in this example, this is not generally the case when using a combiner
    ```

7) Hadoop Ecosystem
* **Apache HBase** - a mutable big data storage system on top of HDFS. HDFS is mutable and batch oriented
* **Apache Hive** - a high level language for writing MapReduce jobs in a SQL style
* **Apache Pig** - another high level language for writing MapReduce jobs in. Kind of like Hive, but with less structure and less familiarity from SQL