### Just Enough Spark! Core Concepts Revisited !!

**Apache Spark™** is a unified analytics engine for large-scale data processing. Its a lightning-fast engine for big data and machine learning. The largest open source project in data processing. It works seamlessly on almost all open source big data technologies.


### Spark Basic Architecture

A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. Now a group of machines sitting somewhere alone is not powerful, you need a framework to coordinate work across them. Spark is a tailor-made engine exactly for this, managing and coordinating the execution of tasks on data across a cluster of computers. 

The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like Spark’s Standalone cluster manager, YARN - Yet Another Resource Negotiator, Kubernetes. We then submit Spark Applications to these cluster managers which will grant resources to our application so that we can complete our work. 
```
                                                            +-----------------------------------------+
                                                            |  Worker Node                            |
                                                            |  +----------------+  +----------------+ |
                                                            |  |    Executor    |  |    Executor    | |
                             +----------------------------> |  | +----+  +----+ |  | +----+  +----+ | |
                             |                              |  | |Task|  |Task| |  | |Task|  |Task| | |
                             |                          +-> |  | +----+  +----+ |  | +----+  +----+ | |
                             |                          |   |  +----------------+  +----------------+ |
       +----------------+    |                          |   +-----------------------------------------+
       | Driver Program | <--+    +-----------------+   |
       |                |         |                 | <-+
       | +------------+ | <-----> | Cluster Manager |
       | |SparkContext| |         |                 | <-+
       | +------------+ | <--+    +-----------------+   |   +-----------------------------------------+
       +----------------+    |                          |   |  Worker Node                            |
                             |                          |   |  +----------------+ +----------------+  |
                             |                          +-> |  |    Executor    | |    Executor    |  |
                             |                              |  | +----+  +----+ | | +----+  +----+ |  |
                             +----------------------------> |  | |Task|  |Task| | | |Task|  |Task| |  |
                                                            |  | +----+  +----+ | | +----+  +----+ |  |
                                                            |  +----------------+ +----------------+  |
                                                            +-----------------------------------------+
```

Spark Applications consist of a driver process and a set of executor processes. In the illustration we see above, our driver is on the left and two executors on the right.

### Deployment modes

1. **Cluser mode:**
    * The Spark Driver is launched on a worker node.
    * The cluster manager is responsible for Spark processes.
```
               +-----------------+
               | Cluster Manager |
               +--------+--------+
                        |
       +----------------+----------------+
      _|_              _|_              _|_
      \_/              \_/              \_/
+-------------+  +-------------+  +-------------+
| Worker node |  | Worker node |  | Worker node |
| +---+ +---+ |  | +---+ +---+ |  | +---+ +---+ |
| | D | | E | |  | | E | | E | |  | | E | | E | |
| +---+ +---+ |  | +---+ +---+ |  | +---+ +---+ |
+-------------+  +-------------+  +-------------+
```
2.  **Client mode:**
    * The Spark Driver is on the client machine (runs on a machine that is not part of the cluster). That client is responsible for mantaining the Spark driver.
    * The cluster manager is responsible for Spark processes.
```
               +-----------------+              +----------------+
               | Cluster Manager |<-------------| Client machine |
               +--------+--------+              | +--------+     |
                        |                       | | Driver |     |
       +----------------+----------------+      | +--------+     |
      _|_              _|_              _|_     +----------------+
      \_/              \_/              \_/
+-------------+  +-------------+  +-------------+
| Worker node |  | Worker node |  | Worker node |
| +---+ +---+ |  | +---+ +---+ |  | +---+ +---+ |
| | E | | E | |  | | E | | E | |  | | E | | E | |
| +---+ +---+ |  | +---+ +---+ |  | +---+ +---+ |
+-------------+  +-------------+  +-------------+
```
3.  **Local mode:**
    * Spark Driver runs on a single machine and the driver and executors run as separate processes inside the same machine.

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder
    .appName("Spark app")
    .master("local[*]")
    .getOrCreate()

In [None]:
spark.sparkContext.uiWebUrl

### Cluster Manager
1. One node manages the state of the cluster
2. The other nodes do the work
3. Communicate via driver/worker processes

Examples
* Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
* Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
* Hadoop YARN – the resource manager in Hadoop 2 and 3.
* Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications. 

### What is a Worker node?
Any node that can run application code in the cluster

### What is a Driver?

The driver process runs your main() function, sits on the node in the cluster and is responsible for 3 main things:

1. Maintaining information about the spark application.
2. Responding to user’s program or input.
3. Analyzing, distributing and scheduling work across the executors.

### What is a Executor?

The executors are responsible for carrying out the work that the driver assigns them. 

Executors are launched in JVM containers with their own memory/CPU resources.

1. Execute code assigned to it by the driver.
2. Reporting the state of the computation on that executor back to driver.

### What is a JVM?

The JVM manages system memory and provides a portable execution environment for Java-based applications.

**Technical definition:** The JVM is the specification for a software program that executes code and provides the runtime environment for that code. 

**Everyday definition:** The JVM is how we run our Java programs. We configure the JVM's settings and then rely on it to manage program resources during execution. 

The Java Virtual Machine (JVM) is a program whose purpose is to execute other programs. 

The JVM has two primary functions: 

1. To allow Java programs to run on any device or operating system (known as the "Write once, run anywhere" principle)
2. To manage and optimize program memory 

### What is Cores/Slots/Threads?

Spark parallelizes at two levels. One is the splitting the work among executors. The other is the slot. Each executor has a number of slots. Each slot can be assigned a Task. 

For example: the diagram below is showing 2 Core Executor nodes: 
```
                                         +--------+
                                         | Driver |
                                         +--------+JVM
                                              |
         +------------------------+-----------+------------+------------------------+
         |                        |                        |                        |
         |                        |                        |                        |
+----------------+       +----------------+       +----------------+       +----------------+
|    Executor    |       |    Executor    |       |    Executor    |       |    Executor    |
| +----+  +----+ |       | +----+  +----+ |       | +----+  +----+ |       | +----+  +----+ |
| |Task|  |Task| |       | |Task|  |Slot| |       | |Slot|  |Slot| |       | |Task|  |Task| |
| +----+  +----+ |       | +----+  +----+ |       | +----+  +----+ |       | +----+  +----+ |
+----------------+JVM    +----------------+JVM    +----------------+JVM    +----------------+JVM
```

* The JVM is naturally multithreaded, but a single JVM, such as our Driver, has a finite upper limit.
* By creating Tasks, the Driver can assign units of work to Slots on each Executor for parallel execution.
* Additionally, the Driver must also decide how to partition the data so that it can be distributed for parallel processing (see below).
* Consequently, the Driver is assigning a Partition of data to each task - in this way each Task knows which piece of data it is to process.
* Once started, each Task will fetch from the original data source (e.g. An Azure Storage Account) the Partition of data assigned to it.

### What is a Partition?

In order to allow every executor to perform work in parallel, Spark breaks up the data into chunks, called partitions.
```
+-------------------+           +-------------------+
|                   |   --->    |    PARTITION 1    |
|                   |           +-------------------+
|                   |   --->    |    PARTITION 2    |
|                   |           +-------------------+
|       BIG         |   --->    |    PARTITION 3    |
|      TABLE        |           +-------------------+
|                   |   --->    |    PARTITION 4    |
|                   |           +-------------------+
|                   |   --->    |    PARTITION 5    |
|                   |           +-------------------+
|                   |   --->    |        ...        |
|                   |           +-------------------+
|                   |   --->    |    PARTITION N    |
+-------------------+           +-------------------+
```
A partition is a collection of rows that sit on one physical machine in our cluster. A DataFrame’s partitions represent how the data is physically distributed across your cluster of machines during execution:

* If you have one partition, Spark will only have a parallelism of one, even if you have thousands of executors. 
* If you have many partitions, but only one executor, Spark will still only have a parallelism of one because there is only one computation resource. 

An important thing to note is that with DataFrames, we do not (for the most part) manipulate partitions manually (on an individual basis). We simply specify high level transformations of data in the physical partitions and Spark determines how this work will actually execute on the cluster.

In [None]:
val clients_df = spark.read.parquet("../../resources/data/parquet/big_clients")
val contracts_df = spark.read.parquet("../../resources/data/parquet/big_contracts")

In [None]:
import org.apache.spark.sql.{functions => f}
clients_df.groupBy(f.spark_partition_id()).count().show()
contracts_df.groupBy(f.spark_partition_id()).count().show()

### What is a DAG?
Directed Acyclic Graph ( DAG ) in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDDs.

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan to a physical execution plan (using stages).

After an action has been called, SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as a set of tasks for execution.

The fundamental concepts of DAGScheduler are jobs and stages that it tracks through internal registries and counters.

In [None]:
contracts_df.join(clients_df, Seq("id"), "outer").count()

### Transformations Vs Actions

#### Transformations

In Spark, the core data structures are immutable meaning they cannot be changed once created. In order to "change" a DataFrame you will have to instruct Spark how you would like to modify the DataFrame you have into the one that you want. These instructions are called transformations. Examples – **Select, Filter, GroupBy, Join, Union, Repartition, etc**

#### Actions

Transformations allow us to build up our logical transformation plan. To trigger the computation, we run an action. An action instructs Spark to compute a result from a series of transformations. Examples – **count, write, show and collect**.
### Narrow Transformations Vs Wide Transformations

There are two types of transformations: Narrow and Wide.

For **narrow transformations**, the data required to compute the records in a single partition reside in at most one partition of the parent dataset. 

Examples include:

* filter(..)
* drop(..)
* coalesce()

For **wide transformations**, the data required to compute the records in a single partition may reside in many partitions of the parent dataset. 

Examples include:

* distinct()
* groupBy(..).agg()
* repartition(n)
* join()

Remember, spark partitions are collections of rows that sit on physical machines in the cluster. Narrow transformations mean that work can be computed and reported back to the executor without changing the way data is partitioned over the system. Wide transformations require that data be redistributed over the system. This is called a shuffle. 

Shuffles are triggered when data needs to move between executors. 



In [None]:
contracts_df.join(clients_df, ["id"], "outer").count()

### What is a Shuffling?

A Shuffle refers to an operation where data is re-partitioned across a Cluster - i.e. when data needs to move between executors.

join and any operation that ends with ByKey will trigger a Shuffle. It is a costly operation because a lot of data can be sent via the network.

For example, to group by letters, it will serve us best if...

All the A's are in one partitions
All the B's are in a second partition
All the C'S are in a third
From there we can easily sum/count/average all of the A's, B's, and C's.

```
+---------------------------+                 +---------------------------+
|          STAGE 1          |                 |          STAGE 2          |
| +-------+       +-------+ |                 | +-------+                 |
| |       | ----> |AAAAAAA| |                 | |AAAAAAA|       +-------+ |
| |       | ----> |AAAAAAA| |                 | |AAAAAAA| ----> |   4   | |
| |       | ----> |BBBBBBB| |                 | |AAAAAAA|       +-------+ |
| |       | ----> |CCCCCCC| |                 | |AAAAAAA|                 |
| +-------+       +-------+ |                 | +-------+                 |
| +-------+       +-------+ |                 | +-------+                 |
| |       | ----> |BBBBBBB| |         >       | |BBBBBBB|       +-------+ |
| |       | ----> |CCCCCCC| |   ------>>>     | |BBBBBBB| ----> |   4   | |
| |       | ----> |BBBBBBB| |   ------>>>>>   | |BBBBBBB|       +-------+ |
| |       | ----> |AAAAAAA| |   ------>>>     | |BBBBBBB|                 |
| +-------+       +-------+ |         >       | +-------+                 |
| +-------+       +-------+ |                 | +-------+                 |
| |       | ----> |BBBBBBB| |                 | |CCCCCCC|       +-------+ |
| |       | ----> |AAAAAAA| |                 | |CCCCCCC| ----> |   4   | |
| |       | ----> |CCCCCCC| |                 | |CCCCCCC|       +-------+ |
| |       | ----> |CCCCCCC| |                 | |CCCCCCC|                 |
| +-------+       +-------+ |                 | +-------+                 |
+---------------------------+                 +---------------------------+
             MAP                  SHUFFLE                REDUCE
```

### What is a Job?

A **Job** is a sequence of stages, triggered by an action such as count(), collect(), read() or write(). 

* Each parallelized action is referred to as a Job.
* The results of each Job (parallelized/distributed action) is returned to the Driver from the Executor.
* Depending on the work required, multiple Jobs will be required.

### What is a Stage?
Each job that gets divided into smaller sets of tasks is a **stage**.

A **Stage** is a sequence of Tasks that can all be run together - i.e. in parallel - without a shuffle. For example: using "**.read**" to read a file from disk, then runnning "**.filter**" can be done without a shuffle, so it can fit in a single stage. The number of Tasks in a Stage also depends upon the number of Partitions your datasets have.
```
                                           +-------+
                                           | Stage |---+
                                           +-------+ge |---+
+--------+                  +-----+            +-------+ge |---+
| Action | ---> Submit ---> | Job | ------------>  +-------+ge |---+
+--------+                  +-----+                    +-------+ge |
                                                           +-------+
```

### What is a Task?

A task is the smallest unit of work that is sent to the executor. Each stage has some tasks, one task per partition. The same task is done over different partitions of the RDD.

```
+-------------------+                                               +-------------------+
|        RDD        |                                               |       Stage       |-+
| +---------------+ |                                               | +---------------+ | |-+
| |  PARTITION 1  | |                                               | |     Task      | | | |
| +---------------+ |                                               | +---------------+ | | |
| +---------------+ |                                               | +---------------+ | | |
| |  PARTITION 2  | |                                               | |     Task      | | | |
| +---------------+ |                                               | +---------------+ | | |
| +---------------+ |      +--------+                  +-----+      | +---------------+ | | |
| |  PARTITION 3  | | ---> | Action | ---> Submit ---> | Job | ---> | |     Task      | | | |
| +---------------+ |      +--------+                  +-----+      | +---------------+ | | |
| +---------------+ |                                               | +---------------+ | | |
| |  PARTITION 4  | |                                               | |     Task      | | | |
| +---------------+ |                                               | +---------------+ | | |
|        ...        |                                               |        ...        | | |
| +---------------+ |                                               | +---------------+ | | |
| |  PARTITION 15  | |                                               | |     Task      | | | |
| +---------------+ |                                               | +---------------+ | | |
+-------------------+                                               +-------------------+ | |
                                                                      +-------------------+ |
                                                                        +-------------------+
```

In the example of **Stages** above, each **Step** is a **Task**.



### Spark app execution

* An action triggers a Job

* A job is split into stages
    * each stage is dependent on the stage before it
    * a stage must fully complete before the next stage can start
    * for performance (usually) minimize the number of stages

* A stage has tasks
    * task = smallest unit of work
    * tasks are run by executors

* An RDD/DF/DS has partitions

### Concepts Relationship
App decomposition
* 1 job = 1 or more stages
* 1 stage = 1 or more tasks

Tasks & executors
* 1 task is run by 1 executor
* each executor can run 0 or more tasks

Partitions and tasks
* processing 1 partition = one task

Partitions & executors
* 1 partition stays on 1 executor
* each executor can load 0 or more partitions in memory or disk

Executors & nodes
* 1 executor = 1 JVM on 1 physical node
* each physical node can have 0 or more executors

### Catalyst Optimizer

1. Catalyst Optimizer and Tungsten Execution Engine was introduced in Spark 1.x
2. Cost-Based Optimizer was introduced in Spark 2.x
3. Adaptive Query Execution now got introduced in Spark 3.x

To disable the Adaptive Query Execution -> spark.conf.set("spark.sql.adaptive.enabled", False)

Only works with DF and DS

![title](../../resources/img/Catalist_Optimizer.PNG)

* Parser: The query is parsed to create an abstract syntax tree (AST), representing the logical structure of the query.
* Analyzer: It performs semantic analysis on the AST. It resolves table and column names, checks for syntax errors, and applies type checking to ensure the query is valid.
* Optimizer: It applies a set of logical optimizations -Simplify boolan expressions, push filters to data sources, etc.
* Planner: It generates various alternative physical plans based on the logical plan. It explores different execution strategies. It leverages statistics and cost models to estimate the cost of each physical plan. It considers factors like data distribution, network latency, disk I/O, CPU usage, and memory consumption. Using these cost estimates, it selects the physical plan with the lowest estimated cost.
* Query execution: Once the physical plan is selected, the Catalyst Optimizer generates efficient Java bytecode or optimized Spark SQL code for executing the query.

In [None]:
val ds1 = spark.range(1, 100000000)
val ds2 = spark.range(1, 100000000, 2)
val ds3 = ds1.repartition(7)
val ds4 = ds2.repartition(9)
val ds5 = ds3.selectExpr("id * 3 as id")
val joined = ds5.join(ds4, "id")
val filtered = joined.filter("id%2 == 1")
val sum = filtered.selectExpr("sum(id)")
sum.explain(true)