# Spark session optimization



In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkConf

## 1. Understand spark memory management

As we explained before, spark has different mode:
- local mode
- standalone
- yarn
- k8s

The memory management is quite different for each mode.

### 1.1 Local mode

In local mode, all Spark components (driver and executors) run inside **a single JVM process** on your machine. `unlike YARN or Kubernetes where resources are containerized and limited.`
A Spark driver/executor has two memory parts:
- JVM heap memory:
- OffHeap memory: stores `JVM metaspace, thread stacks, off-heap buffers (if enabled), native libs, Python/R processes in PySpark/SparkR`

```text
+------------------+        +-------------------+
| JVM Heap Memory  |        | Off-Heap Memory   |
|  - Objects       |        |  - Shuffle pages  |
|  - Small buffers |        |  - Column batches |
| GC managed       |        | Manual free()     |
+------------------+        +-------------------+
        ↑                            ↑
     Garbage                       Tungsten
     Collector                     MemoryMgr
```

#### 1.1.1 OffHeap memory size

By default, Spark sets :
- `spark.driver.memoryOverhead` to `max(384 MB, 0.10 * spark.driver.memory)`
- `spark.executor.memoryOverhead` to `max(384 MB, 0.10 * spark.executor.memory)`

For example, if I set `spark.driver.memory = 16GB`, then the default `overhead = 1.6GB`.

#### 1.1.2 Use OffHeap memory explicitly

By default, spark will use `Heap memory to store shuffle page and cache`. To work with large dataset, we need to allocate large JVM heaps. But Large JVM heaps (e.g., 20+ GB) cause **longer garbage collection pauses**. In this kind of situation, spark allows us to use OffHeap memory to store shuffle page and cache.


**Off-heap memory can help in cases:**
- Storing big data buffers (especially serialized blocks) off-heap: avoids longer garbage collection pauses
- Faster shuffle and caching: `Spark’s Tungsten engine` uses binary memory layouts that are well-suited for off-heap.
- Optimize size: Off-heap avoids object overhead (~16 bytes per object in heap).
- Columnar processing: Off-heap buffers align better with SIMD operations, making columnar execution (Arrow, Parquet) faster.


Below is an example of how to enable offHeap memory
```shell
# to enable spark to use offheap memory
spark.memory.offHeap.enabled = true
# when offheap enabled, the size must be set
spark.memory.offHeap.size = 4g

# we can also overwrite the default value of memory overhead
spark.driver.memoryOverhead = 2g
```

> The above config has a problem. Because the offHeap = 4g, and memoryOverHead = 2g. When a big join or groupby is executed, a OOM may happen.
> Because OffHeap is a part of the memoryOverhead. Best practice is memoryOverhead>=offHeap

#### 1.1.3 Spark driver memory architecture

```text
+--------------------  JVM Heap (spark.driver.memory / spark.executor.memory)
|  Reserved (JVM overhead: thread stacks, metaspace, GC)
|
|  Spark Unified Memory
|    +----------------- Execution Memory
|    |                 Storage Memory
|
+---------------------------------------------------------------------------------
   ↑ JVM-managed only

+--------------------  Memory Overhead (spark.driver/executor.memoryOverhead)
| Used for:
|    - JVM metaspace, thread stacks,
|    - native libs,
|    - Python/R processes in PySpark/SparkR
+--------------------  Off-Heap Memory (spark.memory.offHeap.size)
|  Managed by Spark Tungsten engine
|  Used for:
|    - Shuffle buffers
|    - Serialized cached blocks
|    - Columnar / Arrow / Parquet I/O
+---------------------------------------------------------------------------------
   ↑ OS-level RAM allocation (outside JVM heap)
```
### 1.1.4 A memory config example

Suppose, we have a server with 32GB RAM running Spark in local mode, here’s an optimal memory sizing recommendation balancing JVM heap and off-heap:


- spark.driver.memory=20g: JVM heap for Spark driver. Big enough for dataset + tasks
- spark.memory.offHeap.enabled=true: Enable off-heap to reduce JVM GC pressure
- spark.memory.offHeap.size=6g: Off-heap buffers for shuffle, caching, serialization
- spark.driver.memoryOverhead=6g: Soft overhead reserve to cover off-heap and native libs

Why these values?
 - JVM heap (20 GB) + off-heap (6 GB) + OS & other processes (~6 GB) ≈ total 32 GB physical RAM.
 - Off-heap size (6 GB) lets Spark offload shuffle and caching buffers outside JVM heap, reducing GC pauses.
 - Leaves enough free RAM (~6 GB) for OS, background apps, and potential spikes.

In [2]:
spark = (
    SparkSession.builder
    .appName("LocalMode_memo_config")
    .master("local[*]")
    # JVM memory allocation
    .config("spark.driver.memory", "16g")  # Half of RAM for driver
    .config("spark.driver.maxResultSize", "4g")  # Avoid OOM on collect()
    # Shuffle & partition tuning
    .config("spark.sql.shuffle.partitions", "12")  # Lower than default 200
    .config("spark.sql.files.maxPartitionBytes", "128m")  # Avoid large partitions in memory
    .config("spark.reducer.maxSizeInFlight", "48m")  # Limit shuffle buffer
    # Unified memory management
    .config("spark.memory.fraction", "0.7")  # Reduce pressure on execution memory
    .config("spark.memory.storageFraction", "0.3")  # Smaller cache area
    .config("spark.memory.offHeap.enabled", "true")
    .config("spark.memory.offHeap.size", "1g")
    # Spill to disk early instead of crashing
    .config("spark.shuffle.spill", "true")
    .config("spark.shuffle.spill.compress", "true")
    .config("spark.shuffle.compress", "true")
    # optimize jvm GC
    .config("spark.driver.extraJavaOptions",
            "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:+HeapDumpOnOutOfMemoryError")
    # Use Kryo serializer
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    # Optional: buffer size for serialization
    .config("spark.kryoserializer.buffer", "64m")
    .config("spark.kryoserializer.buffer.max", "512m")
    .getOrCreate()
)