In [None]:
What is Garbage Collection (GC)?

Spark runs on the Java Virtual Machine (JVM) — even when you use PySpark.
All Spark executors (and the driver) are JVM processes, so they manage memory using the JVM heap.

Garbage Collection (GC) is the process by which the JVM:

Frees up heap memory by removing unused Java objects.

In [None]:
How GC Works (Simplified)

Spark keeps a lot of short-lived objects in memory — rows, partitions, shuffle buffers, serialized data, etc.

The JVM constantly allocates and frees memory as tasks create new objects.

The GC scans the heap periodically to find objects that are no longer referenced and reclaims them.

There are two main types of garbage collections:

Minor GC – clears the young generation (short-lived objects). Fast, frequent.

Major (Full) GC – clears the entire heap, including long-lived objects. Slow, expensive.

In [None]:
If GC happens too often or takes too long, it steals CPU cycles from Spark tasks.

Key effects:

Executors spend more time in GC instead of doing real computation.

Task latencies increase (slow stages).

Executors appear idle but are busy cleaning up memory.

GC pauses block all threads (stop-the-world events).

If GC can’t free enough memory → OutOfMemoryError or executor loss.

In [None]:
Symptoms of GC Problems in PySpark

You can identify GC pressure in multiple ways:

1️⃣ In the Spark UI (Executors tab)

Columns show “GC Time” for each executor.

If GC time > 10–20% of total task time → it’s a red flag.

Example:

Executor ID | Total Task Time | GC Time | GC % | Peak Memory Used
---------------------------------------------------------------
1            | 1200s           | 500s    | 41%  | 7.8GB


Here, ~40% of time is wasted on GC.

In [None]:
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"

In [None]:
java -Xmx4g -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 org.apache.spark.executor.CoarseGrainedExecutorBackend ...

In [None]:
-XX:+UseG1GC
This tells the JVM to use the G1 (Garbage-First) Garbage Collector, which is a modern, low-pause collector designed for large heaps 
                                                                                                                   (multi-GB memory).

In [None]:
Why use G1GC for Spark

Spark creates many short-lived objects (rows, shuffle buffers, etc.)

Default collector (ParallelGC) pauses all threads during GC — bad for distributed jobs.

G1GC divides heap into small regions and collects them concurrently (while tasks still run), reducing long GC pauses.