## Scaling Up: Local Machine

* Scaling up occurs within the same system hosting the data and running the computaiton
  * Simple to carry out from a physical standpoint
  * from a programmatic standpoint, it's managed by the operating system and programming libraries; does not require additional frameworks
* Scabaility is typically limited by the OS physical resources on the system.
  * RAM upper bound is determined by the OS  or the hardware(e.g., motherboard, number of RAM slots available, etc.)
* The cost of a single machine at the highest configuration may be prohibitive
* May not meet the demands of the workload at hand


## Distributed Systems

* The hardware specs may not be the same across machines, adding another layer of complexity if it doesn't
<center>
<img src="https://bc247.wordpress.com/wp-content/uploads/2014/11/network.jpg" width="500" height="500">
</center>

### Key Requirements for Distributed Systems

* A distributed system needs to meet several crucial criteria, including:

  * Node Communication: Enabling seamless interaction among nodes in the network.
  * Resilience to Faults: Ensuring that both data and operations remain unaffected by system failures.
  * Scalability: The ability to scale computational resources to handle growing workloads.


### Apache Spark

* Apache Spark is an open-source, distributed processing system used for big data workloads.
  * Runs on a commodity cluster
      *  Interconnected, off-the-shelf hardware (commodity hardware)

* It's an enhancement to Hadoop's MapReduce
  * Processes and retains data in memory for subsequent steps
    * For smaller workloads, Spark’s data processing speeds are up to 100x faster than Hadoop's MapReduce

* Written in Scala and runs in the JVM



### Apache Spark -- Cont'd
* Designed for fast processing of large-scale data through distributed computing
* Utilizes in-memory caching and optimized query execution for improved performance
* Provides rich functionality:
  * Over 80 high-level operators beyond Map and Reduce
  * Tools for data pipeline construction and evaluation
* Offers advantages over Hadoop MapReduce:
  * More diverse set of operations and transformations
  * Faster processing due to in-memory computation
* Includes libraries to support:
  * SQL queries (Spark SQL)
    * Provides a wide range of functions for SQL-like operations, machine learning algorithms, and graph data processing

  * Machine learning (MLlib)
  * Graph data analysis (GraphX)
  * Streaming data processing (Spark Streaming)



### What is Apache Spark

* [See Video](https://www.databricks.com/spark/about)

### Spark and Functional Programming

* To manipulate data, Spark uses functional programming
  * The functional programming paradigm is used in many popular languages including Common Lisp, Scheme, Clojure, OCaml, and Haskell
  
* Functional programming is a data oriented paradigm
  * Decomposes a problem into a set of functions.
  * Logic is implemented by applying and composing functions.

* The idea is that functions should be able to manipulate data without maintaining any external state.
  * No external or global variables
 
* In functional programming, we need to always return new data instead of manipulating the data in-place.

### Spark Operations and RDDs: An Overview

What are RDDs? 
 * Resilient Distributed Datasets is the data representation in Spark. 
   * an RDD is conceptually divided into one or more partitions. 
     Each partition is a self-contained unit of data that can be operated on independently and in parallel. 
     These partitions are distributed across the nodes in a Spark cluster for parallel computation.

  * RDDs are read-only collections, partitioned across multiple computing nodes for optimized performance.


### Spark Operations and RDDs: An Overview
  * Each partition is also replicated across nodes.
    * Number of replicates is a configuration parameter.
  * Partitioning enhances fault tolerance and boosts the efficiency of data operations.
  * This allows RDDs to be accessed through parallel operations
    * Data operations can be executed on all partitions at the same time, speeding up data tasks.
    
* In-Memory Caching
  * RDDs are stored in memory -- if the RDD can fit into a node's memory -- facilitating quick iterations over the same dataset.
    * Disk Spilling: If an RDD is too large Spark spills what doesn't fit in RAM to disk



### Overview of Apache Spark's Core Components

* Spark Core: The foundation of the entire Spark ecosystem.
  * Defines the basic data structure for (RDDs).
  * Provides a set of operations (transformations) and actions to process RDDs.
  * Enables distributed data processing, fault tolerance, and in-memory computations.

* Spark SQL: Spark's SQL engine for structured data.
  * Supports ANSI SQL standards for query language.
  * Transforms SQL queries into Spark operations.
  * Allows for SQL-like querying on large datasets, bridging the gap between traditional databases and big data processing.

### Overview of Apache Spark's Core Components - Cont'd

* Spark Streaming: Real-time data processing module in Spark.
  * Processes live streaming data.
  * Offers seamless integration with other Spark components.
  * Enables real-time analytics and data processing, vital for applications like fraud detection, monitoring, and recommendation systems.

* MLlib: Spark's machine learning library.
  * Implements machine learning algorithms on RDDs.
  * Provides algorithms for classification, regression, clustering, and more.
  * Allows for scalable machine learning tasks, leveraging Spark's distributed computing power.

* GraphX: Spark's library for graph processing.
  * Manages and manipulates graph structures.
  * Performs parallel graph operations and computations.
  * Enables graph analytics at scale, useful for social network analysis, recommendation systems, and more.
  

### Core Components of Spark
<img src="https://www.dropbox.com/s/azebxe8nv5nsqne/spark_architecture.png?dl=1" width="900" height="600">


### Spark High-level Components

* Cluster manager: Manages the resources across the Spark cluster.
  * Responsible for allocating resources like CPU and memory to Spark applications.
  
* Application driver: The central orchestrator of the Spark program, housing the main application logic.
  * Contains an isntance of the spark context
    * Requests resources from the Cluster Manager to launch executors.
    * Coordinates the overall data processing workflow.
  
* Executors (workers): These are the worker nodes to which tasks are delegated.
  * They execute the code sent by the driver, specifically focusing on their designated partitions of the dataset.
  * Executors communicate with the Cluster Manager to report status and failures.


![](https://spark.apache.org/docs/latest/img/cluster-overview.png)


### Spark Program Flow

* A typical Spark program adheres to the following structure:
  * Application Driver: Central point of the Spark program, where the main application logic resides. 
    * Responsible for coordinating the entire data processing workflow.
      *  In client mode, the driver runs on the machine where the Spark job is submitted (master node).
        * The professor is the driver  
      *  In cluster mode, the driver runs on one of the worker nodes in the cluster.
        * One of your peers is the driver, in addition to doing, or not doing, some of the work.  
  * Executors (Workers): The worker nodes that tasks are delegated to.
    * Executing the code sent by the driver, specifically focusing their designated partitions of the dataset.
     * Driver send smaller, more specific operations that the executors can carry out.
       * Referred to as a task plan.
    * Result Aggregation: results sent by the executor to the application driver for aggregation
      * Often a final layer of computation to produce the output.


### Spark Application Lifecycle

1. Python Program Initialization
    * The `SparkContext` object is created; it's a client to interact with the Spark cluster.
  
2. Resource Request to Cluster Manager:
    * `SparkContext` contacts the Cluster Manager to request resources (CPU, memory) for your application.
  
3. Cluster Manager Allocates Resources:
    * The Cluster Manager allocates the necessary resources for the application and decides where to place the executors across the cluster's nodes.
  
4. Application Driver Initialization
    * The main logic of your Spark application is in the Application Driver.
    * It becomes the master node for your application, coordinating tasks between the cluster and your Python program.
5. Executor Launch
    * Based on the resources allocated by the Cluster Manager, executors for the Spark application are launched on the worker nodes.
6. Task Division and Execution Plan
    * The Application Driver divides the job into tasks and builds an execution plan.
7. Sending Task Plans to Executors
    * Instead of sending raw code, the Application Driver sends the execution plan (or task plans) to the Executors.
8. Task Execution on Executors
    * Executors run the tasks on their designated partition of the dataset.
9. Result Aggregation
    * After task execution, the Executors send the results back to the Application Driver.
10. Final Computation and Output Retrieval
    * The Application Driver may perform some final computations.
    * Results are returned to your Python program through the `SparkContext` object.
11. Resource Release
    * Once the application completes, the resources are released back to the Cluster Manager for use by other applications.


### Setting up a Docker Cluster
 
* Installing Spark and all its components manually can be challenging and time-consuming.
  * Configuring and optimizing a Spark cluster from scratch requires substantial effort.
* Easier deployment options are available on cloud services, such as:
  * [Amazon's EMR](https://aws.amazon.com/emr/features/spark/) for Spark support
  * [Databricks' Community Edition](https://www.databricks.com/product/faq/community-edition) and paid offerings
  * Other providers like Google's Dataproc, Microsoft's HDInsight, etc.

* We'll utilize [Databricks Community Edition](https://community.cloud.databricks.com/)

### Installing Via Docker For ICS438 (Optional)

* Note that the following is optional. All the work we will using pySpark will be done on a Databricks cluster in community edition. Also, this solutions assumes that you have docker running on your machine.

* It is easy to use Docker to install locally. We will use the following Docker image
  
```  
jupyter/all-spark-notebook
```
* There are other docker images, including (jupyter/pyspark-notebook), which does not include the jobs dashboard `http://localhost:4040`

* We will run the infrastructure as follows:

```
docker run --rm -p 4040:4040 -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook
```

* This configuration created a master and compute nodes locally in a docker instance
 
* While you're probably not going to need to, you can log into the running container using: `docker exec -it <CONTAINER_ID> bash`

 * where <CONTAINER_ID> of the container currently running the `jupyter/all-spark-notebook` image


* The Docker instance has all the libraries installed and ready to go.

* Make sure you run a Jupyer notebook on the Docker instnac
  * If the code below fails, this means you're not running in the Docker instance.

In [None]:

# pip install pyspark

from pyspark import SparkContext
# sc = SparkContext()
sc


In [None]:
# help(SparkContext)

In [None]:
print(f"Spark version is {sc.version}")

print(f"Python version is {sc.pythonVer}")

print(f"The name of the master is {sc.master}")


In [None]:
sc.getConf().getAll()

### Creating a Text RDD in Spark

* Resilient Distributed Datasets (RDDs) can be created in various ways in Spark. Here are two commonly used methods:

    * `parallelize()`: This function allows you to transform Python collections, such as lists or arrays, into an RDD.
        * It distributes the elements of the passed collection across multiple nodes, making the RDD fault-tolerant.

    * `textFile()`: This function reads in a text file and creates an RDD where each object corresponds to a line in the file.

* Example using `parallelize()`:
  ```python
  from pyspark import SparkContext
  sc = SparkContext()
  my_list = [1, 2, 3, 4, 5]
  my_rdd = sc.parallelize(my_list)
  # or
  my_text_rdd = sc.textFile("path/to/text/file.txt")
  ```

### Fundamental Operations on RDDs

* Several fundamental operations are available to manipulate and transform Spark RDDs. 
  * These operations are generally called 'transformations.'
    * `map`: Applies a given function to each element of the RDD and returns a new RDD consisting of the results.
    * `filter`: Returns a new RDD containing only the elements that satisfy a given predicate.
    * `reduce`: Aggregates the elements of the RDD using a given function, 
      * The function should be commutative and associative so that it can be computed in parallel.

* Each transformation on an RDD produces a new RDD without modifying the original one, making RDDs immutable.

* `flatMap`: Another commonly used transformation, which first applies a function to all elements of the RDD and then flattens the results. It is conceptually equivalent to Python's `itertools.chain()`.



In [None]:
my_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
my_rdd

In [None]:
my_rdd.collect()

In [None]:
my_rdd.getNumPartitions()

In [None]:
# Spark's default behavior is sufficient and often near-optimal.
partitions_data = my_rdd.glom().collect()
partitions_data

In [None]:
doubled_rdd = my_rdd.map(lambda x: x * 2)
doubled_rdd

In [None]:
doubled_rdd.collect()

In [None]:
even_rdd = my_rdd.filter(lambda x: x % 2 == 0)
even_rdd

In [None]:
even_rdd.collect()

In [None]:
sum_of_elements = my_rdd.reduce(lambda a, b: a + b)

In [None]:
sum_of_elements

In [None]:
mapped_rdd = my_rdd.map(lambda x: (x, x * 3))

In [None]:
mapped_rdd.collect()

In [None]:
flat_rdd = my_rdd.flatMap(lambda x: (x, x * 3))

In [None]:
flat_rdd.collect()

### Transformations and Actions in Spark

* Transformations: These operations transform your data and produce a new RDD. They come in two types:
    * Narrow-dependency transformations:
        * Transformations where each partition of the parent RDD is used to build only one partition of the new RDD, e.g., `map` and `filter`.
        * This means the operation can be performed independently on each partition, which allows for better parallelism and less data movement.
    * Wide-dependency transformations:
        * Transformations where a single partition of the parent RDD may be used to build multiple partitions of the child RDD, e.g., `groupBy` and `reduceByKey`.
        * These operations are generally expensive in terms of performance as they typically require shuffling, or  redistributing, data across partitions

* Actions: These operations trigger the computation and execute the job, producing a result.
    * Each action initiates a Spark job, which may consist of multiple stages depending on the transformations involved.

#### Why Distinguish Between Transformations and Actions?

  * One reason is query optimization. For instance, performing a `filter` operation before a `groupBy` is usually more efficient than doing it afterward.


* Before Join

* Partition 1 of RDD1: 
```
(1, "apple")
(2, "banana")
```
* Partition 2 of RDD1: 
```
(3, "cherry")
(4, "date")
```
  
* Partition 1 of RDD2: 
```
(3, "red")
```
* Partition 2 of RDD2: 
```
(4, "brown")
(2, "yellow")
```

Compute a jon between RDD1 and RDD2 based on the keys
1 . shuffle the data around to group all occurrences of the same key together. 
* Data after shuffle

  * Shuffle output targeting Partition 1 (from RDD1 and RDD2):
    ```
    (2, "banana")
    (2, "yellow")
    ```
    
  * Shuffle output targeting Partition 2 (from RDD1 and RDD2):
    ```
    (3, "cherry")
    (3, "red")
    (4, "date")
    (4, "brown")
    ```

* Data Partitions After Join

* Partition 1: 
```
(2, ("banana", "yellow"))
```
* Partition 2
```
(3, ("cherry", "red")), (4, ("date", "brown"))
```

* What if we want to only join where key is 2?

### Understanding the Concept of a Stage in Spark

* A "stage" is a sequence of transformations that can be executed in parallel.
  * I.e., narrow-dependency transformations.
* Stages are separated by operations that require data to be rearranged
  * I.e., wide dependencies.
* Managing the complexity of operations in Spark is easier when tasks are grouped into stages.
* Each stage either reads data, performs computations, or writes data.

* Stages are executed in sequence, one after the other.
  * Within each stage, tasks are executed in parallel.
* Computation moves to where the data resides.


### Decomposition into Stages

```python
flights_df = spark.read.option("head", "true").option("inferSchema", "true").csv("flights_info.csv")
flights_data_partitioned_df = flights_data.repartition(minPartitions=4)

counts_df = flights_df.where("duration > 120")
                                       .select("dep", "dest", "carrier", "durations")
                                       .groupBy("carrier")
                                       .count()
counts_df.collect()
```


### Stages
1. Reading Data
* Reads the partitioned data into memory.
  * A task for each partition.
* This task has no dependency
2. Filter, Select, and GroupBy
* Applies .where(), .select(), and .groupBy().
* Each task applies all these transformations on a partition.
* GroupBy is "Wide" dependency (Needs to shuffle data between partitions for grouping)
3. Count
* Applies .count() to each group.
* Each task calculates the count for groups in its partition.
  * Each group is guaranteed to be in the same partition
* This task has no dependency


### PySpark: Job, Stages and Tasks

<img src="https://www.dropbox.com/s/5qa1fb7p867i787/Page5.jpg?dl=1" width="900" height="600">

    

In [None]:
pip install randomuser

In [None]:
# Insstall using the following if not already installed 

from randomuser import RandomUser

# # Generate a single user
user = RandomUser({"nat": "us"})
print(f"user object  is {user}")
def get_user_info(u):

    user_dict = {
        "user_id": u.get_id()["number"], 
        "first_name": u.get_first_name(), 
        "last_name": u.get_last_name(), 
        "state": u.get_state(),
        "zip": u.get_zipcode(),
        "lat_long": u.get_coordinates()
    }
    return user_dict

user_json = get_user_info(user)
print(f"user json representation  is\n {user_json}")



In [None]:
my_users = RandomUser.generate_users(500, {"nat": "us"})
print(len(my_users))
my_users[0:3]

In [None]:
# Generate a list of 10 random users

user_dicts = list(map(get_user_info, my_users))

user_dicts[0:3]

In [None]:
users_rdd = sc.parallelize(user_dicts)
users_rdd_size  = users_rdd.count()
print(f"The number of objects in my RDD is: {users_rdd_size}")


In [None]:
users_rdd.takeSample(False, 3)

In [None]:
select_users_rdd = users_rdd.filter(lambda x: x['state'] in ["Hawaii", "Idaho"])
select_users_rdd

In [None]:
# collect the result means grab them from all the chunk nodes
select_users_rdd.collect()[:5]

In [None]:
# Building an RDD from a text file.
text = sc.textFile('dbfs:/FileStore/pride_and_prejudice.txt', minPartitions=4)
### Number of items in the RDD
text.getNumPartitions()

In [None]:
text_rdd_size = text.count()
print(f"number of objects in the RDD is {text_rdd_size}")


In [None]:
dbutils.fs.ls("/FileStore/pride_and_prejudice.txt")[0]

In [None]:
subset_x = text.take(10)
print(f"len of subset_x is: {len(subset_x)}\n")
print(f"type of subset_x is: {type(subset_x)}\n")
print(f"subset_x is:\n{subset_x}")

In [None]:
import re

def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()

words = text.map(clean_split_line)
words.take(60)

In [None]:
import re

def clean_split_line(line):
    a = re.sub('\d+', '', line)
    b = re.sub('[\W]+', ' ', a)
    return b.upper().split()

words = text.flatMap(clean_split_line)
words.take(60)

In [None]:
words.count()

In [None]:
# We want to do something like the following
# words_mapped = words.map(lambda x: (x,1))

words_mapped = words.map(lambda x: (x,1))
words_mapped.take(10)

In [None]:
sorted_map = words_mapped.sortByKey()
sorted_map

In [None]:
sample = sorted_map.sample(withReplacement=False, fraction= 0.001)
sample.collect()

In [None]:
counts = words_mapped.reduceByKey(lambda x,y: x+y)
counts.collect()[:50]

In [None]:
# As functional programming always returns new data instead of manipulating the data in-place, we can rewrite the above as:

%%time
counts_test_2 = text.flatMap(clean_split_line).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
counts_test_2.take(100)
