# Module 0: Why PySpark? Compute, Platforms & Scaling
**Scenario:** Working for a Global Consultant (e.g., Accenture, Deloitte, Capgemini).

**Client Situation:** A Retail Client (e.g., Tesco) has 10TB of clickstream logs. They are trying to analyze it using Excel/Python on a laptop, and it keeps crashing.

**Key Questions:**
1.  **Why PySpark?** Why can't we just use Pandas?
2.  **How much Compute do we need?** (Sizing the Cluster).
3.  **Which Platform?** (AWS EMR vs Azure Synapse vs Databricks).

This notebook explores the *Foundational Theory* before you write code.

---
## 1. The Scale Problem: "It works on my machine"
*   **Small Data (Excel):** < 1 Million Rows.
*   **Medium Data (Pandas):** < 10 Million Rows (Fits in RAM, e.g., 16GB).
*   **Big Data (Spark):** Billions of Rows (TB/PB). Needs **Distributed Computing**.

### Scenario: The Crash
Your manager asks you to process a 50GB CSV file.
*   **Laptop:** 16GB RAM.
*   **Process:** Load 50GB into RAM.
*   **Result:** `MemoryError` (Crash).

### The Solution: Distributed Computing (Spark)
Instead of 1 super-computer, use 10 cheap computers (Nodes).
*   Split the 50GB file into 10 chunks of 5GB.
*   Each computer processes 5GB.
*   Combine results.
*   **Result:** Success.

---
## 2. Platforms used in Service Companies
You won't install Spark on your laptop in a real job. You will use a **Cloud Platform**.

| Platform | Cloud Provider | Common Clients |
| :--- | :--- | :--- |
| **Databricks** | Azure / AWS / GCP | Everyone (Gold Standard) |
| **AWS EMR** (Elastic MapReduce) | AWS | Startups, Tech-focused corps |
| **Azure Synapse Analytics** | Azure | Banks, Enterprise Corps (Microsoft shops) |
| **Google Dataproc** | GCP | Retailers using Google Analytics |
| **Cloudera** | On-Premise | Govt, Old Banks (Security focus) |

---
## 3. Interactive Compute Sizing Calculator
How do you decide how many machines ("Nodes") to ask for? 
*   **Rule of Thumb:** 1 Core can process ~2-4GB of data efficiently in a task.
*   **Shuffle Partition:** Aim for 128MB - 200MB per partition.

In [None]:
# --- Interactive: Compute Scaling Calculator ---
# Scenario:
# Client Data: 50 GB
# Deadline: "Need this processed in 15 minutes."

def calculate_cluster_requirements(data_size_gb, processing_time_minutes):
    """
    Very rough estimation based on common cluster performance.
    Assumption: 1 Core processes ~10GB/Hour (0.16 GB/Min) in complex ETL.
    """
    processing_speed_per_core_per_min = 0.15 # GB processed per minute per CPU core

    total_processing_capacity_needed = data_size_gb / processing_time_minutes
    cores_needed = total_processing_capacity_needed / processing_speed_per_core_per_min

    print(f"--- Sizing Report for {data_size_gb} GB Data ---")
    print(f"Goal: Finish in {processing_time_minutes} minutes.")
    print(f"Throughput Needed: {total_processing_capacity_needed:.2f} GB/min")
    print(f"Estimated CPU Cores Needed: {int(cores_needed)}")

    # Typical machines have 4 or 8 cores available for Spark.
    nodes_8_core = int(cores_needed / 8) + 1
    print(f"Recommendation: Request a cluster with {nodes_8_core} Workers (8-core machines).")

# Simulate different scenarios
calculate_cluster_requirements(50, 60)   # 50GB in 1 Hour
print("\n")
calculate_cluster_requirements(1000, 60) # 1TB in 1 Hour
print("\n")
calculate_cluster_requirements(5, 120)   # 5GB in 2 Hours (Micro-batch)

## 4. When NOT to use Spark (The Anti-Pattern)
Many junior engineers use Spark for EVERYTHING.
*   **Problem:** 2MB CSV file -> Spin up a 5-node Cluster ($50 cost).
*   **Correct Way:** 2MB CSV file -> Use Pandas on a single cheap VM ($0.10 cost).

**Interview Tip:** Always say "I evaluate the Volume, Velocity, and Variety of data before choosing a tool."

---
## 5. Next Steps
Now that you understand the **WHY**, proceed to **Module 1** to start coding.