# PySpark: Zero to Hero
## Module 22: Distributed Shared Variables

Spark runs in a distributed manner, which means variables defined in your driver program are copied to each executor. However, updates made by executors are not sent back to the driver by default.

To handle shared data efficiently, Spark provides two types of shared variables:
1.  **Broadcast Variables:** Read-only variables cached on each machine (efficient for lookups).
2.  **Accumulators:** Write-only variables used for counters and sums (efficient for metrics).

### Agenda:
1.  **The Problem:** Why standard Python variables don't work across clusters.
2.  **Broadcast Variables:** Broadcasting a lookup table for joins.
3.  **Accumulators:** Counting errors or specific events across the cluster.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

spark = SparkSession.builder \
    .appName("Shared_Variables") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

In [None]:
# Transaction Data (Employee ID, Sales Amount)
data = [
    ("101", 1000),
    ("102", 1500),
    ("103", 800),
    ("101", 500),
    ("104", 1200)
]

df = spark.createDataFrame(data, ["emp_id", "sales"])

# Lookup Dictionary (Small dataset: Employee ID -> Name)
# In a real scenario, this might come from a database or another small file.
emp_lookup = {"101": "John", "102": "Jane", "103": "Bob", "104": "Alice"}

print("Data Prepared")
df.show()

## 1. Broadcast Variables

**Problem:** If we use `emp_lookup` directly in a UDF or map function, Spark serializes and sends a copy of this dictionary *with every task*. If the dictionary is large (e.g., 100MB) and you have 1000 tasks, that's huge network overhead.

**Solution:** A **Broadcast Variable** sends the data to each **executor node** only once. All tasks on that node share the same read-only copy. This is highly efficient for "Map-Side Joins" or Lookups.

In [None]:
# Step 1: Create the Broadcast Variable
broadcast_emp = spark.sparkContext.broadcast(emp_lookup)

# Step 2: Use it in a UDF
def get_emp_name(emp_id):
    # Access the broadcasted value using .value
    return broadcast_emp.value.get(emp_id, "Unknown")

# Register UDF
name_udf = udf(get_emp_name, StringType())

# Apply
df_with_names = df.withColumn("emp_name", name_udf(col("emp_id")))

print("--- DataFrame using Broadcast Lookup ---")
df_with_names.show()

## 2. Accumulators

**Problem:** If you define `counter = 0` in your driver and try to increment it inside a Spark transformation (like `foreach`), the update happens on the copy of the variable in the executor, not the driver. The driver's counter remains 0.

**Solution:** **Accumulators** are variables that are only "added" to through an associative and commutative operation. They are perfect for global counters (e.g., counting bad records).

In [None]:
# Step 1: Initialize Accumulator
high_sales_counter = spark.sparkContext.accumulator(0)

# Define a function to update the accumulator
def count_high_sales(row):
    if row.sales > 1000:
        high_sales_counter.add(1)

# Step 2: Use foreach (Action) to iterate and update
# Note: Accumulators do not change the DataFrame, they just update the variable.
df.foreach(count_high_sales)

# Step 3: Read the value back on the Driver
print(f"Number of High Value Sales (> 1000): {high_sales_counter.value}")

## Summary

1.  **Broadcast Variables:**
    *   Use for read-only lookup data (dictionaries, sets).
    *   Reduces network I/O by sending data once per node, not per task.
    *   Access via `.value`.

2.  **Accumulators:**
    *   Use for global counters across the cluster (e.g., error counts, processed rows).
    *   Only the driver can read the value (`.value`). Executors can only write to it (`.add()`).
    *   **Warning:** Be careful putting accumulators inside transformations (like map) because if a task re-runs due to failure, the accumulator might over-count. `foreach` (Action) is safer.

**Next Steps:**
In the next module, we will look at **Spark Memory Management** and how to tune executor memory to avoid OOM errors.