# Performance Optimizations in PySpark

## Caching and Persisting DataFrames

- PySpark transformations are lazy, meaning they don't execute until an action is triggered. YOu can cache or persist DataFrames to avoid recomputing them multiple times
- Scenarios: where transformations are applied multiple times

### Key Storage Levels in PySpark:
MEMORY_ONLY: Stores data in memory only (default for cache()).

MEMORY_AND_DISK: Stores data in memory; if memory is insufficient, it spills over to disk.

DISK_ONLY: Stores data only on disk.

MEMORY_ONLY_SER: Stores data in memory as serialized objects (more efficient in terms of memory usage).

MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spills to disk when necessary.


### When to Cache or Persist:
Cache: When you are accessing the same DataFrame multiple times and the DataFrame is small enough to fit in memory.

Persist: When you need more control over how the data is stored or if the DataFrame is too large for memory.

### Important Considerations:
Memory Management: Ensure that you have enough memory to cache/persist large datasets, as Spark will try to store them in memory, and may spill to disk if the memory is insufficient.

Garbage Collection: Spark will clean up cached/persisted DataFrames when no longer needed, but you can manually uncache them using df.unpersist().







In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
from pyspark.storagelevel import StorageLevel

spark = SparkSession.builder.appName('PerformanceTest').getOrCreate()

df = spark.read.csv('./resources/7_employee.csv', header=True, inferSchema=True)
# Cache the dataframe to avoid recomputing
df.cache()

# Optionally, you can persist it with a specific storage level:
# eg: MEMORY_AND_DISK - data stored in memory and if not enough memory, spilled to disk,
# DISK_ONLY - stores data only on disk, 
# MEMORY_ONLY - stores data only in memory
# MEMORY_ONLY_SER - stores day in memory as serialized objects (more efficient in terms of memory usage)
# MEMORY_AND_DISK_SER - similar to MEMORY_ONLY_SER, but spills to disk when necessary

df.persist(StorageLevel.MEMORY_AND_DISK)

df.show()

window_specification = Window.partitionBy('department').orderBy(col('salary'))

df_ranked = df.withColumn('rank', rank().over(window_specification))
df_ranked.cache()
df_ranked.show()
df_ranked.count()

25/03/29 17:51:20 WARN CacheManager: Asked to cache already cached data.
25/03/29 17:51:20 WARN CacheManager: Asked to cache already cached data.
25/03/29 17:51:20 WARN CacheManager: Asked to cache already cached data.


+-----------+---------+----------+------+
|employee_id|     name|department|salary|
+-----------+---------+----------+------+
|          1|     John|        HR| 55000|
|          2|     Jane|   Finance| 80000|
|          3|    James|        HR| 60000|
|          4|     Anna|   Finance| 90000|
|          5|      Bob| Marketing| 75000|
|          6|    Emily| Marketing| 82000|
|          7|    David|        HR| 65000|
|          8|   George|   Finance| 95000|
|          9|   Olivia| Marketing| 68000|
|         10|     Liam|        HR| 54000|
|         11|   Sophia|   Finance| 85000|
|         12|    Lucas| Marketing| 78000|
|         13| Isabella|   Finance| 92000|
|         14|    Mason|        HR| 63000|
|         15|   Amelia| Marketing| 79000|
|         16|    Ethan|        HR| 67000|
|         17|  Abigail|   Finance| 87000|
|         18|    Aiden|        HR| 56000|
|         19|Charlotte| Marketing| 81000|
|         20|     Jack|        HR| 69000|
+-----------+---------+----------+

20