# Spark Streaming with PySpark
## Module 9: Apache Kafka Basics & CLI

Before we integrate Kafka with Spark, we need to understand what Kafka is and how to interact with it.

### What is Kafka?
Apache Kafka is a distributed event streaming platform. It acts as a high-throughput, low-latency **message bus** that decouples data producers from data consumers.

### Core Concepts:
1.  **Producer:** The application sending data (e.g., IoT sensor, Web Server).
2.  **Consumer:** The application reading data (e.g., Spark Streaming).
3.  **Broker:** A single Kafka server. A group of brokers forms a **Cluster**.
4.  **Topic:** A category or feed name to which records are published. Think of it as a "Folder" for messages.
5.  **Partition:** Topics are split into partitions to allow parallel processing.
6.  **Offset:** A unique ID assigned to every message within a partition.

## Kafka Architecture - Pub/Sub Model

*   **Publish/Subscribe:** Producers publish messages to a Topic. Consumers subscribe to that Topic.
*   **Retention:** Unlike a standard queue (RabbitMQ), Kafka stores messages for a configurable time (e.g., 7 days). This allows consumers to replay old data.

```mermaid
graph LR
    P[Producer] -->|Writes| T(Topic: 'sensor-data')
    subgraph Kafka Cluster
        T --> Part0[Partition 0]
        T --> Part1[Partition 1]
    end
    Part0 -->|Reads| C1[Consumer A]
    Part1 -->|Reads| C2[Consumer B]
```

### **Setup - Accessing Kafka CLI**


#### Accessing Kafka CLI

We will use the terminal inside our Docker container to run Kafka commands.

**Step 1:** Open a Terminal.
**Step 2:** Connect to the Kafka container:
```bash
docker exec -it ed-kafka /bin/bash
```
#### List Topics (Run via Python wrapper)

Note: While you should run these in the terminal, we can use Python `os` module to execute them here for documentation purposes.

```python
import os

# Command to list topics
# We assume 'ed-kafka' is accessible via localhost:9092 from the host or mapped port.
# NOTE: Running kafka-topics.sh requires the script to be in your PATH or specific location.
# It is RECOMMENDED to run these in the Docker Terminal as shown in the video.

print("Run this in your Docker Terminal:")
print("kafka-topics --list --bootstrap-server localhost:9092")
```

## Cheat Sheet: Kafka CLI Commands

Run these inside the Kafka container.

### 1. Create a Topic
Create a topic named `test-topic` with 1 partition and replication factor of 1.
```bash
kafka-topics --create --topic test-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
```

### 2. List Topics
See all available topics.
```bash
kafka-topics --list --bootstrap-server localhost:9092
```

### 3. Describe Topic
Get details about partitions, leaders, and replicas.
```bash
kafka-topics --describe --topic test-topic --bootstrap-server localhost:9092
```

### 4. Start a Producer (Write Data)
Opens an interactive shell to type messages.
```bash
kafka-console-producer --topic test-topic --bootstrap-server localhost:9092
> Hello Kafka
> This is message 2
```

### 5. Start a Consumer (Read Data)
Reads messages from the beginning.
```bash
kafka-console-consumer --topic test-topic --from-beginning --bootstrap-server localhost:9092
```

## Partitions & Offsets

When you created the topic, you saw the flag `--partitions`.

*   **Scaling:** If you set partitions = 3, Kafka splits the data into 3 parts.
*   **Parallelism:** Spark can run 3 tasks in parallel to read from these 3 partitions simultaneously.
*   **Ordering:** Kafka guarantees order **only within a partition**, not across the whole topic.

**Offset:**
*   Message 1 in Partition 0 has Offset 0.
*   Message 2 in Partition 0 has Offset 1.
*   Spark tracks these offsets in the **Checkpoint Directory** to know exactly what it has read.