# PySpark: Zero to Hero
## Module 17: Spark Architecture and Cluster Deployment

Up until now, we have been running Spark in "Local Mode" (inside our Jupyter environment). In a real production environment, Spark runs on a **Cluster** of multiple machines.

### Agenda:
1.  **Spark Architecture:** Driver, Executors, and Cluster Manager.
2.  **Deployment Modes:** Client Mode vs. Cluster Mode.
3.  **Spark Submit:** How to submit jobs to a cluster.
4.  **Resource Allocation:** Configuring Cores and Memory.

In [None]:
# Even though we are simulating this locally, understanding the config is key.
from pyspark.sql import SparkSession

# We explicitly set master to local[*] here, but in a real cluster, 
# you would point this to the Cluster Manager URL (e.g., spark://master-node:7077)

spark = SparkSession.builder \
    .appName("Cluster_Architecture_Demo") \
    .master("local[*]") \
    .getOrCreate()

print(f"Spark UI URL: {spark.sparkContext.uiWebUrl}")

## 1. Spark Architecture Components

*   **Driver Program:** The "brain" of the operation. It runs your `main()` function, creates the `SparkContext`, and converts your code into tasks.
*   **Cluster Manager:** Allocates resources (CPU/RAM). Examples: Standalone, YARN, Kubernetes.
*   **Executors:** The "workers". They run the actual tasks on the worker nodes and store data in memory/disk.

**Flow:**
1.  Driver asks Cluster Manager for resources.
2.  Cluster Manager launches Executors on Worker Nodes.
3.  Driver sends tasks (code) to Executors.
4.  Executors run tasks and return results to Driver.

## 2. Client Mode vs. Cluster Mode

This setting determines **where the Driver Program runs**.

1.  **Client Mode (Default for Notebooks/Shell):**
    *   **Driver:** Runs on the machine where you submitted the job (e.g., your laptop or edge node).
    *   **Pros:** Interactive, you see logs immediately.
    *   **Cons:** If you close your laptop, the job dies. High network latency if the client is far from the cluster.

2.  **Cluster Mode (Production):**
    *   **Driver:** Runs on one of the Worker Nodes inside the cluster.
    *   **Pros:** Fire and forget. You can submit the job and disconnect. The cluster manages the driver.
    *   **Cons:** Harder to debug (logs are stored on the cluster).

## 3. The `spark-submit` Command

In production, we don't usually run jobs from Jupyter notebooks. We package our Python code into a `.py` file and submit it using the command line tool `spark-submit`.

**Basic Syntax:**
```bash
spark-submit \
  --master <master-url> \
  --deploy-mode <client|cluster> \
  --conf <key>=<value> \
  your_script.py

In [None]:
# Creating a Script for Submission

script_content = """
from pyspark.sql import SparkSession
import time

# Create Spark Session
spark = SparkSession.builder.appName("SparkSubmitTest").getOrCreate()

# Create a small DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "Value"])

# Show Data (This goes to stdout logs)
df.show()

# Sleep to keep the UI alive for a bit so we can inspect it
time.sleep(10)

spark.stop()
"""

# Write the script to a file
with open("my_spark_job.py", "w") as f:
    f.write(script_content)

print("Created 'my_spark_job.py' successfully.")

In [None]:
import os

# We use the '!' magic command to run shell commands in Jupyter.
# In a real terminal, you would just type the command without the '!'.

print("Submitting the job...")

# This submits the job to our local "cluster" (local[*])
!spark-submit \
  --master local[*] \
  --name "My Submitted Job" \
  my_spark_job.py

print("Job Finished.")

## 4. Resource Allocation

When submitting jobs, you can control how much power your job gets.

**Key Configurations:**
*   `--num-executors`: Total number of worker processes (YARN/K8s only).
*   `--executor-cores`: Number of CPU cores per executor.
*   `--executor-memory`: RAM per executor (e.g., `4g`).
*   `--driver-memory`: RAM for the driver program.

**Example Command:**
```bash
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 4 \
  --executor-cores 2 \
  --executor-memory 4g \
  my_spark_job.py

## Summary

1.  **Architecture:** Driver controls the execution; Executors do the work.
2.  **Deployment:** Use **Client Mode** for development/interactive work. Use **Cluster Mode** for production jobs.
3.  **`spark-submit`:** The standard tool for launching Spark applications.
4.  **Configuration:** Tuning memory and cores is essential for performance and stability.

**Next Steps:**
In the next module, we will explore **Spark Configurations** in deeper detail, including environment variables and performance tuning properties.