- #### Arquitectura
- #### Flujo de ejecución
- #### Transformaciónes vs Acciones
- #### Optimizador Catalist
- #### DAG

## Spark architecture

![Spark_architecture](../../resources/img/SPARK.png)

#### Driver program
The process running the main() function of the application and creating the SparkContext. **Starts the application and sends work to the executors.**

#### Cluster manager
An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN, Kubernetes)

   - **Standalone** – a simple cluster manager included with Spark that makes it easy to set up a cluster.
   - **Apache Mesos** – a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
   - **Hadoop YARN** – the resource manager in Hadoop 2 and 3.
   - **Kubernetes** – an open-source system for automating deployment, scaling, and management of containerized applications.

#### Worker node	
Any node that can run application code in the cluster

#### Executor
A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. **Launched in JVM containers with their own memory and CPU resources**

#### Deploy mode	
Distinguishes where the driver process runs. 

- **Cluster mode** - the Spark driver is launched on a worker node
- **Client mode** - the Spark driver is on the client machine
- **Local mode** - the entire application runs on the same machine.

## Spark app execution

- An action triggers a **Job**
  
- A job is split into **stages**
    - each stage is dependent on the stage before it
    - a stage must fully complete before the next stage can start
    - for performance (usually) minimize the number of stages

- A stage has **tasks**
    - task = smallest unit of work
    - tasks are run by executors

- **An RDD/DF/DS has partitions**

In [1]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
        .appName("sesion_2")
        .master("local[*]")
        .getOrCreate()

val sc = spark.sparkContext

sc = org.apache.spark.SparkContext@d36df2f


spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@52207dc


org.apache.spark.SparkContext@d36df2f

In [2]:
import org.apache.spark.rdd.RDD

// Stage 1
val employees: RDD[String] = sc.textFile("../../resources/data/csv/employees.csv")
val empTokens: RDD[Array[String]] = employees.map(line => line.split(","))
val empDetails: RDD[(String, Float)] = empTokens.map(tokens => (tokens(4), tokens(7).toFloat))
// Stage 2
val empGroups: RDD[(String, Iterable[Float])] = empDetails.groupByKey(2)
val avgSalaries: RDD[(String, Float)] = empGroups.mapValues(salaries => salaries.sum / salaries.size)

avgSalaries = MapPartitionsRDD[5] at mapValues at <console>:33


employees: org.apache.spark.rdd.RDD[String] = ../../resources/data/csv/employees.csv MapPartitionsRDD[1] at textFile at <console>:28
empTokens: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:29
empDetails: org.apache.spark.rdd.RDD[(String, Float)] = MapPartitionsRDD[3] at map at <console>:30
empGroups: org.apache.spark.rdd.RDD[(String, Iterable[Float])] = ShuffledRDD[4] at groupByKey at <console>:32


MapPartitionsRDD[5] at mapValues at <console>:33

In [3]:
avgSalaries

MapPartitionsRDD[5] at mapValues at <console>:33

#### **TRANSFORMATIONS ARE LAZY - SO ARE EXECUTED WHEN AN ACTION IS TRIGGERED**

In [4]:
avgSalaries.collect().foreach(print)

(105,67491.2)(103,75527.37)

![Image](../../resources/img/Stages_tasks.PNG)

#### Shuffle

- Exchange of data between executors
- happens in between stages
- must complete before next stage starts

- Expensive because of
    - transferring data
    - serialization/deserialization
    - loading new data from shufflefiles
- Shuffles are performance bottlenecks because
    - exchanging data takes time
    - they need to be fully completed before next computations start
- Shuffles limit parallelization

  

#### Concepts Relationship
App decomposition
- 1 job = 1 or more stages
- 1 stage = 1 or more tasks

Tasks & executors
- 1 task is run by 1 executor
- each executor can run 0 or more tasks

Partitions and tasks
- processing 1 partition = one task

Partitions & executors
- 1 partition stays on 1 executor
- each executor can load 0 or more partitions in memory or disk

Executors & nodes
- 1 executor = 1 JVM on 1 physical node
- each physical node can have 0 or more executors

## Transformations vs Actions

- Transformations describe how new DFs are obtained
- Actions start executing Spark code
- Transformations return RDDs/DFs
- Actions return something else e.g. Unit, number, etc.

#### Narrow Transformations

- given a parent partition, a single child partition depends on it
- fast to compute
- examples: union, map, flatMap, select, filter

#### Wide transformations

- given a parent partition, more than one child partitions depend on it
- involve a shuffle = data transfer between Spark executors
- are costly to compute
- examples: joining, grouping, sorting

## Catalyst Optimizer

![Image](../../resources/img/Catalist_Optimizer.png)

1. Catalyst Optimizer and Tungsten Execution Engine was introduced in Spark 1.x
2. Cost-Based Optimizer was introduced in Spark 2.x
3. Adaptive Query Execution now got introduced in Spark 3.x

    - To disable the Adaptive Query Execution -> spark.conf.set("spark.sql.adaptive.enabled", False)

Only works with DF and DS

In [5]:
import org.apache.spark.sql.{DataFrame, Dataset}
import java.lang

val ds1: Dataset[lang.Long] = spark.range(1, 100000000)
val ds2: Dataset[lang.Long] = spark.range(1, 100000000, 2)
val ds3: Dataset[lang.Long] = ds1.repartition(7)
val ds4: Dataset[lang.Long] = ds2.repartition(9)
val ds5: DataFrame = ds3.selectExpr("id * 3 as id")
val joined: DataFrame = ds5.join(ds4, "id")
val sum: DataFrame = joined.selectExpr("sum(id)")
sum.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(id#8L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=44]
      +- HashAggregate(keys=[], functions=[partial_sum(id#8L)])
         +- Project [id#8L]
            +- SortMergeJoin [id#8L], [id#2L], Inner
               :- Sort [id#8L ASC NULLS FIRST], false, 0
               :  +- Exchange hashpartitioning(id#8L, 200), ENSURE_REQUIREMENTS, [plan_id=36]
               :     +- Project [(id#0L * 3) AS id#8L]
               :        +- Exchange RoundRobinPartitioning(7), REPARTITION_BY_NUM, [plan_id=26]
               :           +- Range (1, 100000000, step=1, splits=8)
               +- Sort [id#2L ASC NULLS FIRST], false, 0
                  +- Exchange hashpartitioning(id#2L, 200), ENSURE_REQUIREMENTS, [plan_id=37]
                     +- Exchange RoundRobinPartitioning(9), REPARTITION_BY_NUM, [plan_id=29]
                        +- Range (1, 100000000, step=2, splits=8)




sum = [sum(id): bigint]


ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
ds3: org.apache.spark.sql.Dataset[Long] = [id: bigint]
ds4: org.apache.spark.sql.Dataset[Long] = [id: bigint]
ds5: org.apache.spark.sql.DataFrame = [id: bigint]
joined: org.apache.spark.sql.DataFrame = [id: bigint]


[sum(id): bigint]

## DAG

DAG = Directed Acyclic Graph = visual representation of Spark jobs


In [6]:
sc.uiWebUrl

Some(http://23LAP5CD20860NP.indra.es:4041)

In [None]:
sum.count() // Call an action to trigger transformations

![Image](../../resources/img/DAG.png)