# PySpark: Zero to Hero
## Module 5: Environment Setup (Docker Cluster)

In this module, we will set up a professional PySpark environment on your local machine using **Docker**.

### Why Docker?
*   **Consistency:** It works exactly the same on Windows, Mac, and Linux.
*   **Cluster Simulation:** We can simulate a real production environment (Master Node + Worker Nodes) on a single laptop.
*   **No Mess:** It doesn't install Java, Scala, or Python globally on your laptop, keeping your OS clean.

### Prerequisites
1.  **Docker Desktop:** Download and install from [docker.com](https://www.docker.com/products/docker-desktop/).
2.  **Git:** (Optional but recommended) to clone the repository.

## Option A: Standalone Mode (Quick Start)
Use this if you just want to run code quickly without simulating a full cluster.

**Steps:**
1.  Open your Terminal (Mac/Linux) or Command Prompt (Windows).
2.  Run the following command to download and start the container:
    ```bash
    docker run -p 8888:8888 -p 4040:4040 self/pyspark-jupyter-lab-old:latest
    ```
3.  **Access:**
    *   Look at the terminal logs for a URL like `http://127.0.0.1:8888/lab?token=...`
    *   Copy that token and paste it into your browser at `localhost:8888`.

## Option B: Cluster Mode (Master + 2 Workers)
**This is the recommended setup for this course.** It creates a Master node and 2 Worker nodes to simulate distributed computing.

**Steps:**
1.  **Clone the Repository:**
    Open your terminal and run:
    ```bash
    git clone https://github.com/subhamkharwal/docker-images.git
    ```
    *(Alternatively, download the ZIP from the GitHub link if you don't have Git).*

2.  **Navigate to the Folder:**
    ```bash
    cd docker-images/pyspark-cluster-with-jupyter
    ```

3.  **Start the Cluster:**
    Run the Docker Compose command:
    ```bash
    docker-compose up
    ```
    *Note: The first run will take time as it downloads the images.*

4.  **Access Points:**
    *   **Jupyter Lab:** [http://localhost:8888](http://localhost:8888)
    *   **Spark Master UI:** [http://localhost:8080](http://localhost:8080) (To see your workers)
    *   **Spark Application UI:** [http://localhost:4040](http://localhost:4040) (Only visible when a job is running)

## Important: Where to put your data files?

When running in Cluster Mode, you cannot just read files from your Desktop or Documents folder. You must place your CSV/JSON files in the specific mapped folder.

1.  Inside the folder you cloned (`pyspark-cluster-with-jupyter`), there is a folder named **`data`**.
2.  **Action:** Any file you want to read in PySpark **must be pasted into this `data` folder**.
3.  **In Code:** When reading files, use the path `/data/filename.csv`.

*Example:*
If you put `users.csv` in the local `data` folder, PySpark reads it as:
`spark.read.csv("/data/users.csv")`

In [None]:
from pyspark.sql import SparkSession

# 1. Initialize Spark Session
# We don't need to specify Master here because the Docker environment sets it automatically.
spark = SparkSession.builder \
    .appName("Environment_Test") \
    .getOrCreate()

print("Spark Session Created Successfully!")

# 2. Print System Info
print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")

# 3. Test Computation
# We create a simple range of numbers to test if the workers are functioning.
df = spark.range(100)
print(f"Count Test: {df.count()}")

# 4. Check UI
print("Go to http://localhost:4040 to see this job in the Spark UI.")

## Troubleshooting

*   **Port Conflicts:** If `docker-compose up` fails saying "Port is already allocated", ensure you don't have another Docker container or Service running on port 8888 or 8080.
*   **File Not Found:** If PySpark says "Path does not exist", verify you pasted the file into the `data` folder inside the cloned repository, not your general system download folder.
*   **Hidden Characters:** When copying the token from the terminal, ensure you don't copy hidden space characters or the `token=` prefix.