Apache Spark Framework

* Objectives:
    * Describe the pros/cons of Spark compared to Hadoop MapReduce
    * Define what an RDD is, by its properties
    * Explain the difference between transformations and actions on an RDD
    * Implement the different transformations through use cases
    * Explain what persisting/caching an RDD means, and situations where this is useful
    * Know how many partitions we should have for a Spark RDD
    * Know how to join two Spark RDDs
    * Define what accumulators and broadcast variables are, and use cases for each
    * Describe what type of input format MLlib machine learning algorithms typically expect

1) Why use Spark?
* (+) Distributed parallel computing (Open-source cluster computing framework)
    * Processes masssive data sets
    * Highly efficient distributed operations
    * More use cases than just MapReduce ("Successor" to Hadoop MapReduce)
    * Scala/Java/Python/R/SQL supported natively
        * Python API is very thorough, but has several drawbacks
            * (-) Python is duck-typed, so Spark has to work out what data type everything is as it is passed on to the JVM
            * (-) Slower execution for Spark code written in PySpark vs Spark code written in Java (except if you switch to DataFrame API)
* (+) Apache Hadoop integration
    * Relatively easy integration into existing ecosystem (HDFS)
    * Scalability, reliability, and resilience
* (+) Machine learning libraries natively available
* Spark vs. MapReduce
    * Storage Compatibility - Spark can be built on top of any filesystem whereas MapReduce only works in HDFS
    * Speed - Spark can be 100x faster than MapReduce in memory, and 10x faster on disk. MapReduce writes data to disk after each map step, and after each reduce step. This I/O is very costly in terms of performance, especially for iterative algorithms
        * Spark uses a lot of memory and tries to keep everything in memory when possible

2) Spark Ecosystem
![spark_ecosystem](spark_ecosystem.png)
![spark_functionality](spark_functionality.png)

3) Resilient Distributed Datasets (RDD)
![rdd](rdd.png)
* **Many data sources** - create from HDFS, S3, HBase, JSON, text, local
    * Create SparkContext in two ways:
        * Parallelize an existing collection of objects (e.g. local)
        ```python
        rdd = sc.parallelize([1,3,4,5,6])
        ```
        * Read in an external data set (e.g. text, HDFS, S3, JSON)
        ```python
        rdd = sc.textFile('path/to/file')
        ```
* **Parallel operation (Partitioned)** - distributed across the cluster as **partitions** (atomic chunks of data)
* **Fault-tolerant** - can recover from errors (node failures, slow process) 
    * traceability of each partition, can re-run the processing (using directed acyclic graph (DAG))
* **Immutable** - cannot modify an RDD in place
* **Lazily Evalulated** - doesn't execute any of the tasks until an action function is executed
* **Cachable (or Persistable)** - keeps certain parts of the data in memory/disk to allow for repeated and faster execution

4) Functional Programming Paradigm
![functional_prog](https://image.slidesharecdn.com/apachesparkmaster-150830121025-lva1-app6892/95/apache-spark-core-20-638.jpg?cb=1498977510)
* Since RDDs are immutable, **only transformations** to an existing RDD to another one can happen
* Spark provides many **transformations functions**
* Programming multiple RDD transformations results in **directed acyclic graph (DAG)**
* Execution of DAG tasks is **passed from client to master (cluster manager/SparkContext/Master Node)**, who then distributes them to workers (**Worker Nodes**), who applies them across their partitions of the RDD
    * **SparkContext** - acts as a gateway between the client and Spark master
        * sends code/data to the master who then sends it to the workers
![cluster_manager](cluster_manager.png)

5) **Directed Acyclic Graph (DAG)** - the sequence of transformations kept by the driver for execution
![dag](dag.png)
* Construct sequence of transformations
* Spark functional programming interface builds up a DAG
* This DAG is sent by the driver for execution to the cluster manager

6) RDD Operations
* Types of RDD Operations:
![rdd_operations](https://summerofhpc.prace-ri.eu/wp-content/uploads/2015/12/project_1605_01.png)
    * **Transformations** - normal functions that do not execute, but start to queue up the DAG
        * e.g. filter, map, flatMap, join, reduceByKey, groupByKey, sortBy
    * **Actions** - executing these functions will cause the DAG to execute (all transformations prior to that action will be run)
    * e.g. first, take, collect, count, reduce, countByKey, saveAsTextFile
* **Pair RDDs** - operations on tuples (key, value)
    * offers better partitioning
    * exposes new functionality
* **Persisting/Caching** - explicitly keeps an RDD in memory
    * Use if you have an RDD that is or will be used for different operations many times
    * Persist allows you to have variation in storage details (memory/disk preferences)
        * **MEMORY_ONLY** - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
        * **MEMORY_AND_DISK** - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
        * **MEMORY_ONLY_SER** - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
        * **MEMORY_AND_DISK_SER** - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
        * **DISK_ONLY** - Store the RDD partitions only on disk.
        * **MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.** - Same as the levels above, but replicate each partition on two cluster nodes.
        * **OFF_HEAP (experimental)** - Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.
* **Transformation Complexity**
![transformation_complexity](transformation_complexity.png)
    * Narrow transformations - doesn't cause shuffling (not computationally expensive)
        * e.g. filter, map
    * Wide transformation - causes shuffling which can be computationally expensive
        * e.g. join, groupByKey, reduceByKey

7) Advanced Spark Programming
* **Partitioning**
    * Deciding number of partitions:
        * by default, spark chooses the number of partitions based off the size of your cluster
        * have option of parallelizing your RDD over more partitions
        * more partitions means more parallel processes, but more overhead (I/O). Need to choose $k$ partitions based on: 
            * Gain in parallelization > Loss in overhead
        * spark documentation recommends 2-4 partitions per CPU (core) in the cluster
        * can setup when or after initializing the RDD:
            * when initializing RDD:
            ```python
            rdd = sc.parallelize([1,3,4,5,6], 16)
            ```
            * after initializing RDD:
            ```python
            rdd.repartition(16)
            ```
    * Partitioning By Key (when RDDs are Pair RDDs):
        * in a distributed program, communication between different machines is often very expensive
        * control the way that Spark partitions our RDDs when they are composed of key/value pairs
        * assure that all key/value pairs for a give key will end up on the same machine (preventing unnecessary shuffling). This can reduce the communication that is necessary between machines and greatly speed up the program
        ```python
        rdd.partitionBy(100)
        ```
            * This will partition the data into 100 partitions by the current key
            * Only useful when a dataset is reused multiple times in key-oriented operations (groupByKey, reduceByKey, join, etc.)
* **Joins**
    * Spark RDDs offer all of our standard SQL joins:
        * e.g. inner(default), left outer, right outer, full outer
    * Each RDD in the join must be in the format of (key,value) pairs, where the key in each corresponds to the same variable
    * Example: Transactions By Customers and Store Lookup Table
    ![joins](joins.png)
    ```python
    transactions_rdd = sc.parallelize([(100156, 1),(100156, 2),(100157, 1)])
    store_lookup_rdd = sc.parallelize([(1,"REI"),(2,"Sports"),(3,"Target"),(4,"Hippy")])
    joined_rdd = transactions_rdd.map(lambda (key, value): (value, key)).join(store_lookup_rdd)
    
    result: [(1, (100156, 'REI')), (1, (100157, 'REI')), (2, (100156, 'Sports!'))]
    ```
* **Accumulators**
    * A type of shared variable across all worker machines
    * Provide a simple syntax for **aggregating** values from worker nodes back to the driver program (Spark's version of MapReduce's counters function)
    * Most common use is to count events that occur during the job for debugging purposes
    ```python
    myfile = sc.textFile(inputFile)
    blank_lines = sc.accumulator(0) # 0 is initialization value
    def extract_call_signs(line):
        global blank_lines
        if(line == ""):
            blank_lines += 1
        return line.split(" ")
    call_signs = myfile.flatMap(extract_call_signs)
    ```
* **Broadcast Variables**
    * Another type of shared variable across all worker machines
    * Allow the program to efficiently send a large, read-only value (or values) to all the worker nodes
    * By default, Spark automatically sends all variables referenced in our functions to the worker nodes for each task, which **can be highly inefficient**. We might end up sending **multiple copies** of the same variables to the same workers
    * Broadcast variables are a solution to this problem
    * Can be particularly useful to **broadcast** a small lookup table across our worker nodes
        * Example: Transactions By Customers and Store Lookup Table (above)
        ```python
        transactions_rdd = sc.parallelize([(100156, 1),(100156, 2),(100157, 1)])
        store_lookup_broadcasted = sc.broadcast({1:"REI",2:"Sports",3:"Target",4:"Hippy"})
        def process_transactions(transaction, store_lookup_broadcasted):
            store_id = transaction[0]
            store_name = store_lookup_broadcasted.value.get(store_id)
            user_id = transaction[1]
            return (store_id, (user_id, store_name))
        transactions_rdd = transactions_rdd.map(lambda (key, value): (value, key))
        lookedup_rdd = transactions_rdd.map(lambda transaction: process_transactions(transaction, store_lookup_broadcasted))

        result: [(1, (100156, 'REI')), (1, (100157, 'REI')), (2, (100156, 'Sports!'))]
        ```

8) Spark MLlib
* Conventions For Supervised Learning:
```python
LabelPoint(target, feature)
    # think of X and y
    # feature is a row of X
    # y is associated label
target(numeric)
feature(numeric_vector)
```
* **RDD-Based** Machine Learning Libraries in pre-Spark 2.0 (Removed in Spark 3.0):
    * Basic Statistics:
        * Summary statistics
        * Correlations
        * Stratified Sampling
        * Hypothesis Testing
        * Streaming Significance Testing
        * Random Data Generation
    * Classification/Regression:
        * Generalized Linear Regression (GLM)
        * Logistic Regression
        * Naive Bayes
        * Support Vector Machines (SVM)
        * Decision Trees/Random Forests
        * Gradient Boosted Trees
        * Isotonic Regression
        * Multilayer Perceptron (e.g. Neural Network)
    * Clustering
        * K-means Clustering
        * Gaussian Mixture
        * Power Iteration Clustering (PIC)
        * Latent Dirichlet Allocation (LDA)
        * Bisecting K-means
        * Streaming K-means
    * Decomposition
        * Singular Value Decomposition (SVD)
        * Principal Component Analysis (PCA)
        * Non-matrix Factorization (NMF)
    * Recommenders/Collaborative Filtering
        * Alternative Least Squares (ALS)
    * Optimization:
        * Stochastic Gradient Descent (SGD)
        * Limit-Memory BFGS (L-BFGS)