# PySpark: Zero to Hero
## Module 24: Dynamic Allocation and Resource Management

In a shared cluster environment, managing resources efficiently is critical. 

*   **Static Allocation (Default):** You request a fixed number of executors (e.g., `--num-executors 10`) at the start. These resources are reserved for your app until it finishes, even if the app is idle. This leads to resource wastage.
*   **Dynamic Allocation:** Spark adds executors when there are pending tasks (Scale Up) and removes executors when they have been idle for a specific time (Scale Down).

### Agenda:
1.  **Static vs. Dynamic:** Understanding the difference.
2.  **Configuration:** Enabling Dynamic Allocation properties.
3.  **Shuffle Tracking:** Handling shuffle data when executors are removed.
4.  **Hands-on:** Configuring a SparkSession with Dynamic Allocation.

In [None]:
from pyspark.sql import SparkSession
import time

# Note: Dynamic Allocation is most effective on a Cluster Manager (Standalone, YARN, K8s).
# In 'local' mode, these settings might not trigger physical scaling of processes 
# as everything runs in one JVM, but this is how you configure it for Production.

spark = SparkSession.builder \
    .appName("Dynamic_Allocation_Demo") \
    .master("local[*]") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.shuffleTracking.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "0") \
    .config("spark.dynamicAllocation.maxExecutors", "5") \
    .config("spark.dynamicAllocation.initialExecutors", "1") \
    .config("spark.dynamicAllocation.executorIdleTimeout", "60s") \
    .config("spark.dynamicAllocation.schedulerBacklogTimeout", "1s") \
    .getOrCreate()

print("Spark Session Active with Dynamic Allocation Configured")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

## 2. Key Configurations Explained

1.  **`spark.dynamicAllocation.enabled`**: Master switch to turn the feature on.
2.  **`spark.dynamicAllocation.shuffleTracking.enabled`**: 
    *   When an executor is removed, the shuffle files it wrote to disk are usually lost.
    *   In older versions, you needed an "External Shuffle Service" running on each worker node.
    *   In newer versions (and on K8s), **Shuffle Tracking** allows the driver to track where shuffle data resides and prevents killing executors that hold active shuffle data.
3.  **`...minExecutors` / `...maxExecutors`**: The boundaries for scaling.
4.  **`...executorIdleTimeout`**: How long an executor must be idle (no running tasks) before it is removed.
5.  **`...schedulerBacklogTimeout`**: How long pending tasks must wait before Spark requests *new* executors.

In [None]:
# Let's run a job that requires computation to trigger executor allocation.
# In a real cluster, you would see the number of executors increase in the 'Executors' tab of Spark UI.

print("Starting Job...")

# Creating a large range and performing a map transformation
# This generates tasks. If we were on a cluster, Spark would request more executors 
# to handle these tasks in parallel if the backlog timeout is exceeded.
rdd = spark.sparkContext.range(0, 10000000)
count = rdd.map(lambda x: x * x).count()

print(f"Job Finished. Count: {count}")

## 3. How to Observe Scaling (Concept)

If you were running this on a **Spark Standalone Cluster** or **YARN**:

1.  **Initial State:** Application starts with `initialExecutors` (e.g., 1).
2.  **Job Submission:** When you run the code above, tasks pile up in the scheduler.
3.  **Scale Up:** After `schedulerBacklogTimeout` (e.g., 1s), Spark requests more executors from the Resource Manager up to `maxExecutors`.
4.  **Job Completion:** Once the job is done, the executors become idle.
5.  **Scale Down:** After `executorIdleTimeout` (e.g., 60s), Spark starts killing the idle executors to free up cluster resources for other users.

In [None]:
# If running on a cluster, we would wait here for > 60 seconds to see executors disappear.
print("Waiting to simulate idle time...")
# time.sleep(70) 
print("Check Spark UI 'Executors' tab. In a real cluster, idle executors would be removed now.")

## Summary

1.  **Resource Efficiency:** Dynamic Allocation allows clusters to be utilized more efficiently by releasing unused resources.
2.  **Shuffle Data:** The biggest challenge is preserving shuffle data when executors die. Use **External Shuffle Service** or **Shuffle Tracking**.
3.  **Databricks vs. Spark:** 
    *   **Spark Dynamic Allocation:** Scales **Executors** (processes) inside existing nodes.
    *   **Databricks Autoscaling:** Scales the actual **VMs/Nodes** (infrastructure) up and down.

**Next Steps:**
In the next module, we will look at **Spark Speculation**, a mechanism to handle slow-running tasks (stragglers).