In [None]:
1. Optimize Memory Configuration
a. Increase Executor Memory
   --executor-memory 8G
b. Increase Driver Memory
   --driver-memory 4G
c. Tune Memory Fractions
   spark.conf.set("spark.memory.fraction", 0.7)
   spark.conf.set("spark.memory.storageFraction", 0.4)



In [None]:
2. Reduce Data Volume per Executor
Ensure that no single partition is too large.
If your job has very few partitions, each executor may get a huge partition and blow memory.
Rule of thumb: 128–256 MB per partition is ideal.

In [None]:
3. Avoid Unnecessary Caching
Caching holds entire DataFrames in memory.

In [None]:
4. Use Disk Storage (instead of Memory)
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)

In [None]:
5. Control Data Skew
Skew means one partition gets a lot more data than others (typical in joins or groupBy on skewed keys).

In [None]:
6. Use Broadcast Joins for Small Tables
Avoids shuffle and heavy memory usage during join.

In [None]:
7. Reduce Shuffle Pressure
Shuffles create large temporary data on disk/memory.

In [None]:
8. Avoid Collecting to Driver
df.collect() and df.toPandas() load all data into driver memory.

In [None]:
9. Enable Off-Heap Memory (Optional Advanced) 
If you want to let Spark spill large data to off-heap (outside JVM heap):

In [None]:
10. Use Compression for Shuffle Files
    Compressing shuffle data reduces memory and disk usage:
    spark.conf.set("spark.shuffle.compress", True)
    spark.conf.set("spark.shuffle.spill.compress", True)

In [None]:
11. Monitor Spark UI

Use Spark UI (port 4040):
Storage tab → cached data size.
Executors tab → memory used per executor.
SQL tab → large shuffle read/write sizes → potential OOM areas.

In [None]:
12. Tune Garbage Collection
Large Java heaps (8+ GB) can cause long GC pauses.
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails"