#### Spark Memory Management


![spark-memory](/images/spark-execution-memory.png)


In Apache Spark, understanding memory management is essential to optimizing performance. Spark has a unified memory management model (since Spark 1.6+), which governs how memory is divided and used by different components. Here's a breakdown of the key memory types:


###  1. **Reserved Memory**

* A small portion of JVM heap reserved and not usable by Spark.
* **Default size:** 300MB (configurable via `spark.memory.storage.reserved` but not usually changed).
* Reserved for internal metadata, JVM tasks, and to prevent OOM errors.



###  2. **On-Heap Memory**

*  Memory within the JVM heap, used by Spark when `spark.memory.offHeap.enabled = false`.
* **Total size:** Determined by `spark.executor.memory`.
* **Used for:**

  * Execution (shuffle, joins, aggregations, sorts)
  * Storage (caching/persisted RDDs or DataFrames)
  * User memory (custom objects, UDFs, broadcast vars)


### 3. **Off-Heap Memory**

*  Memory outside JVM heap, accessed using unsafe APIs.
* **Enabled via:** `spark.memory.offHeap.enabled = true`
* **Size set by:** `spark.memory.offHeap.size`
* **Use cases:**

  * Tungsten’s binary data storage
  * External shuffle
  * More efficient, less GC pressure


### 4. **Unified Memory (Spark 1.6+)**

Spark divides usable memory (excluding reserved) into:

```
spark.executor.memory - reservedMemory
  └──→ unifiedMemory = execution + storage
```

#### a. **Execution Memory**

* For tasks like joins, aggregations, sorts, and shuffles.
* **Dynamic:** Can borrow from storage if needed and available.
* **Evicts:** Cached blocks only when absolutely necessary.

####  b. **Storage Memory**

* To store cached or persisted RDD/DataFrame blocks and broadcast variables.
* **Eviction policy:** Least recently used (LRU).
* **Dynamic:** Can borrow from execution memory, but only if execution is not actively using it.

##### Storage & Execution share memory — dynamic allocation helps better memory utilization.


###  5. **User Memory**

*  \~25% of `spark.executor.memory`, not governed by unified memory manager.
* **Used for:**

  * Custom data structures
  * UDF intermediate states
  * Broadcast variables (partial)
  * Spark internal bookkeeping
* **Not tunable directly**, but indirectly via reducing UDF usage or tuning executor memory.


### 6. **Overhead Memory**

*  Memory for non-JVM needs like YARN/Mesos container overhead, native libraries, Python/R processes (if using PySpark or SparkR).
* **Configurable via:**

  * `spark.yarn.executor.memoryOverhead`
  * `spark.executor.memoryOverhead`
* **Default:** max(384MB, 0.10 \* spark.executor.memory)


### Summary Table

| Memory Type | Purpose                               | Where?         | Configurable?         |
| ----------- | ------------------------------------- | -------------- | --------------------- |
| Reserved    | JVM internals                         | On-heap        | No (hardcoded default) |
| Execution   | Shuffles, joins, aggregations         | Unified memory | Yes                     |
| Storage     | Cached/persisted RDDs, broadcast vars | Unified memory | Yes                    |
| User        | UDFs, custom objects                  | On-heap        | No (implicit)          |
| Off-heap    | External shuffle, Tungsten binary     | Off-heap       | Yes                    |
| Overhead    | Native code, containers, Python procs | Off-heap       | Yes                    |



#### Caching data

- Caching is the process of storing intermediate results (DataFrames/RDDs) in memory to avoid recomputation in future actions.
- Spark evaluates lazily, so without caching, each action triggers full recomputation of the DAG.

##### Where Is Data Stored When Cached?

- Primary location: In Storage Memory (part of Unified Memory).
- Fallback: If not enough memory, data is spilled to disk (depends on storage level).
- Optional: Can store off-heap, serialized, or disk-only via persist().



**`.cache()`**

```py
df.cache()
```

- Shortcut for: `.persist(StorageLevel.MEMORY_AND_DISK)`
- Caches data in memory, spills to disk if memory is full.
- Common and safe default for general use.

**`.persist(storageLevel)`**

```
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_ONLY)
```

- Gives control over how and where data is stored.

##### StorageLevel Option

(As of Spark 3.4)

- DISK_ONLY: CPU efficient, memory efficient, slow to access, data is serialized when stored on disk
- DISK_ONLY_2: disk only, replicated 2x
- DISK_ONLY_3: disk only, replicated 3x
- MEMORY_AND_DISK: spills to disk if there's no space in memory
- MEMORY_AND_DISK_2: memory and disk, replicated 2x
- MEMORY_AND_DISK_DESER(default): same as MEMORY_AND_DISK, deserialized in both for fast access
- MEMORY_ONLY: CPU efficient, memory intensive
- MEMORY_ONLY_2: memory only, replicated 2x - for resilience, if one executor fails



- SER is CPU intensive, memory saving as data is compact while DESER is CPU efficient, memory intensive
- Size of data on disk is lesser as data is in serialized format, while deserialized in memory as JVM objects for faster access

**When to use what?**

| Storage Level          | Space Used | CPU Time | In Memory | On Disk | Serialized |
| ---------------------- | ---------- | -------- | --------- | ------- | ---------- |
| MEMORY\_ONLY           | High       | Low      | Yes       | No      | No         |
| MEMORY\_ONLY\_SER      | Low        | High     | Yes       | No      | Yes        |
| MEMORY\_AND\_DISK      | High       | Medium   | Some      | Some    | Some       |
| MEMORY\_AND\_DISK\_SER | Low        | High     | Some      | Some    | Yes        |
| DISK\_ONLY             | Low        | High     | No        | Yes     | Yes        |  


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CustomersCache").getOrCreate()

Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/07/03 18:26:44 WARN Utils: Your hostname, krishnagopi-trng2224dat-g3q9nc1wf47, resolves to a loopback address: 127.0.0.1; using 10.0.5.2 instead (on interface eth0)
25/07/03 18:26:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/03 18:26:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from pyspark.sql.functions import *

customers_df = spark.read.parquet("file:///workspace/TRNG-2224-data-engineering/week2/final_customers.parquet")

customers_df.show()

                                                                                

+--------------------+--------------+--------------------+----+-------+--------------------+-----------+-------------------+---------+-----------+---------+---------------+------------------+-------------+-----------------+
|         customer_id|          name|               email| age| gender|             country|signup_date|         last_login|is_active|total_spent|age_group|pref_newsletter|pref_notifications|pref_language|       name_parts|
+--------------------+--------------+--------------------+----+-------+--------------------+-----------+-------------------+---------+-----------+---------+---------------+------------------+-------------+-----------------+
|0e99a07c-c7a5-43d...|   Thomas Lamb|robinjackson@wrig...|50.0| Female|              France| 2023-03-01|2025-05-29 22:36:25|     true|     1438.4|    Adult|           true|              push|           en|   {Thomas, Lamb}|
|3a69ac3e-6726-431...|Kimberly Blake|susan51@johnson-g...|20.0|   Male|       Guinea-Bissau| 2020-12-14|

In [10]:
from pyspark import StorageLevel
base_customers_df = customers_df.withColumn("first_name", col("name_parts.first_name"))\
    .withColumn("last_name", col("name_parts.last_name")) \
    .drop("name", "name_parts")


base_customers_df.persist(StorageLevel.MEMORY_AND_DISK_2)

DataFrame[customer_id: string, email: string, age: double, gender: string, country: string, signup_date: date, last_login: timestamp, is_active: boolean, total_spent: double, age_group: string, pref_newsletter: boolean, pref_notifications: string, pref_language: string, first_name: string, last_name: string]

In [4]:
spark

In [11]:
final_df = base_customers_df.withColumn(
    "spend_category", when(col("total_spent")>3000, "High")
    .when(((col("total_spent")>1000) & (col("total_spent")<=3000)), "Medium")
    .otherwise("Low")
)

final_df.explain(True)
final_df.show()

== Parsed Logical Plan ==
'Project [unresolvedstarwithcolumns(spend_category, CASE WHEN '`>`('total_spent, 3000) THEN High WHEN 'and('`>`('total_spent, 1000), '`<=`('total_spent, 3000)) THEN Medium ELSE Low END, None)]
+- Project [customer_id#0, email#2, age#3, gender#4, country#5, signup_date#6, last_login#7, is_active#8, total_spent#9, age_group#10, pref_newsletter#11, pref_notifications#12, pref_language#13, first_name#770, last_name#772]
   +- Project [customer_id#0, name#1, email#2, age#3, gender#4, country#5, signup_date#6, last_login#7, is_active#8, total_spent#9, age_group#10, pref_newsletter#11, pref_notifications#12, pref_language#13, name_parts#14, first_name#770, name_parts#14.last_name AS last_name#772]
      +- Project [customer_id#0, name#1, email#2, age#3, gender#4, country#5, signup_date#6, last_login#7, is_active#8, total_spent#9, age_group#10, pref_newsletter#11, pref_notifications#12, pref_language#13, name_parts#14, name_parts#14.first_name AS first_name#770]
     

25/07/03 18:34:41 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
25/07/03 18:34:41 WARN BlockManager: Block rdd_24_0 replicated to only 0 peer(s) instead of 1 peers


In [8]:
base_customers_df.unpersist()

DataFrame[customer_id: string, email: string, age: double, gender: string, country: string, signup_date: date, last_login: timestamp, is_active: boolean, total_spent: double, age_group: string, pref_newsletter: boolean, pref_notifications: string, pref_language: string, first_name: string, last_name: string]