## 1. Getting Started with Spark Internals

In this series, we aim to investigate the internals of Spark. We will analyze physical plans, explore basic concepts such as ``partition`` and we'll try to improve the performance of our queries. At Invent we have been working with Spark for several years; but it seems we are still lacking a deeper understanding of it. But why? I could come up with two reasons:

- Spark internals are not well documented.

- Spark is written in Scala. Most of us did not write a single line of Scala.

So, we often rely on third party tutorials. They become outdated quickly or do not cover the material thoroughly. This project is an attempt to fill that gap. You are encouraged to run the code in the notebooks.

**Pre-requisites:**

- Installing ``pyspark``
- a working knowledge of writing queries in Spark

In [48]:
from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F

spark = SparkSession.builder.config("spark.sql.shuffle.partitions", 16).getOrCreate()
spark.sparkContext.getConf().get("spark.sql.shuffle.partitions")

'16'

### Physical plans

When we write a query in Spark, the result is a computation graph. Only when we call an action method, computation is performed. This lazy evaluation model is well explained in many other places. Therefore, we won't go into details here.


We use ``DataFrame.explain`` to see the physical plan that will be used to compute the result of a query. This is basically a computation graph.

In [53]:
df = spark.createDataFrame(
    [
        (1, 2),
        (1, 2),
        (2, 4),
        (2, 4),
        (3, 6),
    ],
    ["col1", "col2"]
)

**Basics of physical plans:**

In [54]:
df.agg(F.sum("col1")).explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[sum(col1#238L)])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#490]
      +- HashAggregate(keys=[], functions=[partial_sum(col1#238L)])
         +- Project [col1#238L]
            +- Scan ExistingRDD[col1#238L,col2#239L]





- We read physical plans from bottom to top.
- The step at the bottom shows how the dataframe is created. In this case, we are creating a dataframe from a collection that resides in the driver memory.

See:

       +- Scan ExistingRDD[col1#28L,col2#29L]

The step just above the scan is ``Project``. This ``select``s just the necessary columns for our query to save memory. In this case we only need ``col1``.

See:

    +- Project [col1#28L]
   
You can ignore other steps in the physical plan for now. We will investigate them further in the following notebooks.

Also note that this is not the final plan. Spark can change this plan at runtime as it has more information about the data. See:

    AdaptiveSparkPlan isFinalPlan=false


We often read parquet datasets. In that case we would see a ``Filescan`` step at the bottom of the query instead of ``Scan ExistingRDD``.

In [55]:
df.repartition("col1").write.partitionBy("col1").parquet("tmp-demo", mode="overwrite")
df = spark.read.parquet("tmp-demo")
df.explain()

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [col2#251L,col1#252] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/workspaces/rocks/Untitled Folder/tmp-demo], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col2:bigint>




**PartitionFilter and PushdownFilter**

One optimization when reading parquet is that we can employ ``PartitionFilters``. In the above physical plan, it is empty. In the following case, we filter the dataset by ``col1``. Since the parquet is partitioned by ``col1``, ``Filescan`` now only reads the required partitions.

See:

         PartitionFilters: [isnotnull(col1#96), (col1#96 = 1)]

Example:

In [37]:
spark.read.parquet("tmp-demo").filter(F.col("col1") == 1).explain()

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [col2#95L,col1#96] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/workspaces/rocks/Untitled Folder/tmp-demo], PartitionFilters: [isnotnull(col1#96), (col1#96 = 1)], PushedFilters: [], ReadSchema: struct<col2:bigint>




If we instead filter by ``col2``, obviously we can't have this optimization because data is not partitioned by ``col2``. But we have something similar.

See:

    PushedFilters: [IsNotNull(col2), EqualTo(col2,1)], ReadSchema: struct<col2:bigint>
    
Example:

In [56]:
spark.read.parquet("tmp-demo").filter(F.col("col2") == 1).explain()

== Physical Plan ==
*(1) Filter (isnotnull(col2#255L) AND (col2#255L = 1))
+- *(1) ColumnarToRow
   +- FileScan parquet [col2#255L,col1#256] Batched: true, DataFilters: [isnotnull(col2#255L), (col2#255L = 1)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/workspaces/rocks/Untitled Folder/tmp-demo], PartitionFilters: [], PushedFilters: [IsNotNull(col2), EqualTo(col2,1)], ReadSchema: struct<col2:bigint>




The difference is that ``PartitionFilters`` allow us to skip unwanted partitions completely. In the case of ``PushedFilters``, filter is implemented by the data source itself. It queries the whole table but only the rows that satisfy the filter conditions are loaded in Spark executors. In both cases we save a lot of memory, but the ``PartitionFilter`` is faster.

**Query Optimization**

Even if we specify the filter later in the query, Spark understands that ``PartitionFilter`` or ``PushdownFilter`` can be employed. This is due to the ``Catalyst Optimizer`` in Spark. Spark takes our query and optimizes it as much as it can. Spark can also optimize the query further at runtime with ``Adaptive Query Execution``.

Example:

In [60]:
# Complicated query with a partition filter at the end.
df = spark.read.parquet("tmp-demo")
df = (
    df
    .withColumn("foo", F.col("col1") * 1000)
    .withColumn("bar", F.col("col2") * 1000)
    .filter(F.col("col1") == 1)
)
df.explain()

== Physical Plan ==
*(1) Project [col2#286L, col1#287, (col1#287 * 1000) AS foo#290, (col2#286L * 1000) AS bar#294L]
+- *(1) ColumnarToRow
   +- FileScan parquet [col2#286L,col1#287] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/workspaces/rocks/Untitled Folder/tmp-demo], PartitionFilters: [isnotnull(col1#287), (col1#287 = 1)], PushedFilters: [], ReadSchema: struct<col2:bigint>




There is two stages of query optimization:

**1. Catalyst Optimizer**: When we call ``explain``, we see the query optimized by Catalyst.

**2. Adaptive Query Execution**: Further optimizes the query at runtime. Covered well by Databricks. See: https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html


**Key Takeaways:**

- Read the physical plan from bottom to top.

- ``PartitionFilter`` and ``PushdownFilter`` saves us a lot memory.

- Spark optimizes our query to save memory and computation.