# Local Environment Setup: PySpark with Docker

## 1. Objective
In this notebook, we will set up a robust Data Engineering environment on our local machine using **Docker**. This setup includes:
*   **Jupyter Lab:** For writing and executing code.
*   **Apache Spark:** For data processing.
*   **Cluster Mode:** Setting up a Master node and Worker nodes to simulate a real distributed environment.

---

## 2. Prerequisites: Docker Desktop

Before proceeding, ensure you have Docker Desktop installed.

1.  Go to [Docker Hub](https://hub.docker.com/).
2.  Search for **Docker Desktop**.
3.  Download the version appropriate for your OS (Windows, Mac Intel, or Mac Apple Silicon).
4.  Install and launch Docker Desktop.

---

## 3. Option A: Standalone Setup (Quick Start)

If you want a simple, single-container setup without a cluster, follow these steps in your **Terminal/Command Prompt**:

**1. Pull the Docker Image:**
We will use the pre-configured image `self/pyspark-jupyter-lab-old` which contains Spark 3.3.0 and Python 3.7.

```bash
docker pull self/pyspark-jupyter-lab-old:latest
```

**2. Run the Container:**
Map the necessary ports:

> 4040: Spark UI

> 8888: Jupyter Lab

```bash
docker run -p 4040:4040 -p 8888:8888 self/pyspark-jupyter-lab-old:latest
```

**3. Access Jupyter Lab:**

> Look at the terminal logs for a URL containing a token (e.g., http://127.0.0.1:8888/lab?token=...).

> Open that URL in your browser.

---

## 4. Option B: Cluster Setup (Recommended)

To simulate a real production environment with a Master and Workers, we will use docker-compose.

**1. Clone the Repository:**
We will use the configuration provided in the ease-with-data GitHub repository.

```bash
# Clone the repo
git clone https://github.com/subhamkharwal/docker-images.git

# Navigate to the specific project folder
cd docker-images/pyspark-cluster-with-jupyter
```

**2. Start the Cluster:**
Run the following command to download images and start the containers (Master, Workers, History Server, Jupyter).

```bash
docker-compose up
```

**3. Access Points:**

> Jupyter Lab: localhost:8888

> Spark Master UI: localhost:8080

> Spark Worker UI: localhost:8081

**Note on Data Persistence:**
To read local files in Cluster Mode, ensure your data is placed in the mounted volume folder. By default, the configuration maps your local directory to /data inside the container.

---

## 5. Environment Verification

Once your Docker container is running and you have opened this notebook inside Jupyter Lab, run the following code cells to verify the installation.

In [None]:
# 1. Import SparkSession
from pyspark.sql import SparkSession

# 2. Create Spark Session
# If running in Cluster mode via the provided docker-compose, 
# the master URL is usually handled by the environment, 
# but for verification, 'local[*]' or the specific master URL can be used.
spark = SparkSession.builder \
    .appName("Setup_Verification") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Created Successfully!")

In [None]:
# 3. Validation: Run a simple Spark Job
# We will create a small DataFrame and display it to ensure the engine is working.

try:
    df = spark.range(10)
    print("DataFrame created with 10 rows.")
    df.show()
    print("Test Passed: PySpark is functioning correctly.")
except Exception as e:
    print(f"Test Failed: {e}")

## 6. Check Spark UI

1.  While the session above is active, open a new browser tab.
2.  Go to [http://localhost:4040](http://localhost:4040).
3.  You should see the Spark UI showing the job we just ran (`showString`).

## 7. Clean Up

Always stop your Spark session when done to free up resources.

In [None]:
spark.stop()
print("Spark Session Stopped.")