# Spark Streaming with PySpark
## Module 3: Environment Setup (Docker, Spark & Kafka)

In this module, we will set up the complete infrastructure required for this course. Unlike standard batch processing, Streaming requires a **Message Broker**. We will use **Apache Kafka**.

To make the setup easy and consistent across Windows, Mac, and Linux, we will use **Docker**.

### The Architecture
We are building a multi-container environment:
1.  **Jupyter Lab (PySpark):** Where we write our code.
2.  **Apache Kafka:** The streaming message broker.
3.  **Zookeeper:** Manages the Kafka cluster.

### Prerequisites
1.  **Docker Desktop:** Download and install from [docker.com](https://www.docker.com/products/docker-desktop/).
2.  **Git:** (Optional) to clone the repository, or you can download the ZIP.

## Step 1: Get the Docker Compose File
We need the configuration files to tell Docker how to build our cluster.

1.  Go to the GitHub Repository: [https://github.com/subhamkharwal/docker-images](https://github.com/subhamkharwal/docker-images)
2.  **Clone** or **Download ZIP** of the repository.
3.  Navigate to the folder:
    > `docker-images/pyspark-jupyter-kafka`

## Step 2: Start the Cluster
1.  Open your **Terminal** or **Command Prompt**.
2.  Change directory (`cd`) into the folder you just downloaded (`pyspark-jupyter-kafka`).
3.  Run the build command:
    ```bash
    docker-compose up
    ```
4.  Wait for the logs to stop scrolling. You should see messages indicating Kafka and Jupyter are running.

## Step 3: Access Jupyter Lab
1.  Open your browser to [http://localhost:8888](http://localhost:8888).
2.  **Token:** Check the terminal logs for a URL like `http://127.0.0.1:8888/lab?token=...`. Copy the token string.
3.  **Login:** Paste the token and set a password (e.g., `1234`) for future easy access.

In [None]:
from pyspark.sql import SparkSession

# 1. Initialize Spark Session
spark = SparkSession.builder \
    .appName("Environment_Check") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Created Successfully!")
print(f"Spark Version: {spark.version}")

# 2. Check UI
print("You can view the Spark UI at: http://localhost:4040")

## Verification 2: Kafka Connection

To ensure Kafka is working, we will perform a manual check using the terminal inside the Kafka container.

**Steps:**
1.  Open **Docker Desktop**.
2.  Find the container named **`ed-kafka`**.
3.  Click on the **"Exec"** or **"Terminal"** tab (or the CLI icon).
4.  Run the following commands inside that terminal:

**A. Create a Test Topic:**
```bash
kafka-topics --create --topic test-topic --bootstrap-server ed-kafka:9092

# Expected Output: Created topic test-topic.

kafka-topics --list --bootstrap-server ed-kafka:9092

# Expected Output: You should see test-topic listed.

### **Useful Commands**

```markdown
## Cheat Sheet: Docker Commands

Save these commands for reference throughout the course.

| Action | Command (Run in local terminal) |
| :--- | :--- |
| **Start Cluster** | `docker-compose up` |
| **Stop Cluster** | Press `Ctrl + C` or run `docker-compose down` |
| **Check Containers** | `docker ps` |

### Spark UI Ports
*   **Jupyter Lab:** [localhost:8888](http://localhost:8888)
*   **Spark UI:** [localhost:4040](http://localhost:4040) (Active only when a SparkSession is running)