

## Message Brokers
Message brokers are middleware that enable communication between different applications by translating messages from the formal messaging protocol of the sender to the formal messaging protocol of the receiver. They are primarily used for decoupling applications and enabling asynchronous communication.

  1. Use Case: "Suitable for scenarios where each message needs to be processed by a single consumer. Commonly used in distributed systems, microservices, and event-driven architectures."
  2. Examples: "RabbitMQ, ActiveMQ, Azure Service Bus, Amazon SQS, Mosquitto."
  3. Protocols: "Supports various messaging protocols such as AMQP, MQTT, STOMP, and others."
  4. Pushed based model: Messages are pushed to consumers for processing.


Considering Azure Service Bus as an example, it provides two messaging patterns: queues and topics. Queues are used for point-to-point communication, where each message is processed by a single consumer. Topics are used for publish-subscribe communication, where messages are delivered to multiple subscriptions. The choice between queues and topics depends on the specific use case and requirements of the application. Still it cannot replace the need for a database, as it is not designed for long-term storage of data.

    |------------------|------------------------------------|---------------------------------------------|
    | Feature          | Queues                             | Topics                                      |
    |------------------|------------------------------------|---------------------------------------------|
    | Purpose          | Point-to-point communication       | Publish-subscribe communication             |
    | Message Handling | Processed by a single consumer     | Delivered to multiple subscriptions         |
    | Use Case         | Task scheduling, load balancing    | Event distribution, notification systems    |
    | Persistence      | Stored until consumed or expired   | Stored until consumed or expired            |
    | Ordering         | FIFO message delivery              | Order maintained within a subscription      |
    | Delivery         | Single consumer                    | Multiple consumers                          |
    | State            | Stateless                          | Stateless                                   |
    |------------------|------------------------------------|---------------------------------------------|



## Streaming Platforms
Streaming platforms are systems that allow the continuous ingestion, processing, and delivery of data streams in real-time. They are essential for applications that require real-time analytics, monitoring, and event-driven architectures.

1. Use Case: "Ideal for scenarios where data needs to be processed and analyzed in real-time. Commonly used in IoT, Log Aggregation, Data Pipelines, financial services, and social media analytics."
2. Examples: "Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub, Azure Event Hubs."
3. Protocols: "Supports various streaming protocols such as Kafka protocol, HTTP/2, WebSockets, and others."
4. Pull based model: Consumers pull data from the stream for processing.

Considering Apache Kafka as an example, it provides a distributed streaming platform that can handle high throughput and low latency. Kafka is used for building real-time data pipelines and streaming applications. It allows for both publish-subscribe and point-to-point communication patterns. Kafka's storage mechanism ensures data durability and fault tolerance.

        |------------------|---------------------------------------------|
        | Feature          | Kafka                                       |
        |------------------|---------------------------------------------|
        | Purpose          | Distributed streaming platform              |
        | Message Handling | Processed by multiple consumers             |
        | Use Case         | Real-time analytics, data pipelines         |
        | Persistence      | Durable storage with configurable retention |
        | Ordering         | Guaranteed ordering within partitions       |
        | Delivery         | Multiple consumers                          |
        | State            | Stateful                                    |
        |------------------|---------------------------------------------|


-----------------------------------------------------------------------------------------------------------------------------------------------------------------


# Apache Kafka

Apache Kafka is a distributed streaming platform that is designed to handle high throughput and low latency. It is widely used for building real-time data pipelines and streaming applications. Kafka allows for both publish-subscribe and point-to-point communication patterns, making it a versatile choice for various use cases.

### Key Features of Apache Kafka:
1. **Distributed Architecture**: Kafka's distributed nature allows it to scale horizontally, handling large volumes of data across multiple servers.
2. **High Throughput**: Kafka can process millions of messages per second, making it suitable for applications with high data ingestion rates.
3. **Low Latency**: Kafka ensures low latency in message delivery, which is crucial for real-time applications.
4. **Durable Storage**: Kafka provides durable storage with configurable retention policies, ensuring data persistence and fault tolerance.
5. **Fault Tolerance**: Kafka replicates data across multiple nodes, ensuring data availability even in the event of hardware failures.
6. **Scalability**: Kafka can easily scale by adding more brokers to the cluster, accommodating increasing data loads.
7. **Flexibility**: Kafka supports both publish-subscribe and point-to-point communication patterns, offering flexibility in data consumption and processing.
8. **State Management**: Kafka can maintain state information, which is essential for complex event processing and stateful applications.

### Common Use Cases:
- **Real-time Analytics**: Kafka is used to process and analyze data in real-time, providing immediate insights and enabling quick decision-making.
- **Data Pipelines**: Kafka serves as a backbone for data pipelines, ensuring reliable data flow between different systems and applications.
- **Log Aggregation**: Kafka collects and aggregates logs from various sources, making it easier to monitor and analyze system behavior.
- **Event Sourcing**: Kafka captures and stores events, allowing applications to reconstruct state and maintain consistency.

Overall, Apache Kafka is a powerful and flexible streaming platform that is well-suited for modern data-driven applications.

---

## Architecture


![Kafka Architecture](../images/kafka-architecture.png)


### Components

1. **Producers**: Publish messages to Kafka topics.
2. **Kafka Cluster**: A group of brokers working together.
3. **Brokers**: Servers in the Kafka cluster responsible for storing partitions and serving client requests.
4. **KRaft (Kafka Raft Metadata Quorum)**: Manages metadata using the Raft consensus protocol, replacing ZooKeeper.
5. **Topics and Partitions**: Topics are logical channels for categorizing messages, and partitions are subdivisions of a topic for scalability.
6. **Consumer Groups**: A group of consumers working together subscribe to topics and process messages.

### Kafka Workflow
1. Producers push messages to topics in the Kafka cluster.
2. Brokers within the cluster store and replicate data across partitions.
3. Consumers in a consumer group consume messages from partitions for processing.
4. KRaft ensures metadata management, leader election, and fault tolerance.




---

## Kafka Cluster


![Kafka Cluster](../images/kafka-cluster.png)


### Kafka Cluster
- A **Kafka Cluster** is a distributed system composed of multiple brokers that work together to store and process real-time data streams.
- It provides:
  - **Scalability**: Handles increased workloads seamlessly by adding more brokers to the cluster.
  - **Fault Tolerance**: Ensures data availability and fault tolerance by replicating partitions across brokers, allowing other brokers to take over responsibility in case of a failure..
  - **High Availability**: Ensures seamless failover and leader election to maintain continuous service.
  - **Partition Management**: Partitions are distributed across brokers for load balancing and parallel processing, with one broker serving as the leader and others acting as replicas.

---

## Kafka Broker

![Kafka Broker](../images/kafka-broker.png)

- A **Kafka Broker** is a server within a Kafka Cluster that is responsible for storing and managing data streams (messages) sent by producers and consumed by consumers.
- Each broker serves as a node in the distributed Kafka architecture and contributes to scalability, fault tolerance, and data distribution.

#### Key Responsibilities:
- **Partition Storage**:
  - Each broker stores one or more partitions of topics, ensuring data durability and fault tolerance through replication.
  - Brokers are assigned leader and follower roles for different partitions to manage read/write operations effectively.

- **Processing Requests**:
  - **From Producers**: Brokers handle incoming messages, appending them to the correct partition within a topic.
  - **To Consumers**: Brokers serve messages to consumers from the specified partitions.

- **Leader and Follower Roles**:
  - Each partition has a leader broker that handles all read and write requests for that partition.
  - Follower brokers replicate the leader's partition data for fault tolerance and act as backups in case the leader fails.

- **Metadata Management**:
  - Brokers maintain metadata about topics, partitions, and replicas, coordinating with the metadata manager (like KRaft or ZooKeeper) for updates.

- **Scalability and Load Balancing**:
  - By distributing partitions across multiple brokers, Kafka achieves high throughput and parallel processing.
  - Brokers balance the workload by managing their assigned partitions.

---

## ZooKeeper and KRaft in Kafka

Kafka uses **ZooKeeper** (in older versions) or **KRaft (Kafka Raft)** (in newer versions) to manage metadata, leader elections, and cluster coordination.

![Zookeeper/KRaft](../images/zookeeper-kraft.png)


#### ZooKeeper

- **Overview**:
  - A centralized service used for distributed system coordination.
  - Manages metadata such as brokers, topics, partitions, and their leaders.

- **Key Responsibilities**:
  - **Leader Election**: Coordinates which broker becomes the leader for a partition.
  - **Metadata Management**: Stores metadata about brokers, topics, and partitions.
  - **Cluster Coordination**: Ensures consistency and fault tolerance across the Kafka cluster.

- **Challenges with ZooKeeper**:
  - **Dependency**: Kafka clusters depend on an external ZooKeeper service.
  - **Scalability**: ZooKeeper struggles with large-scale Kafka deployments.
  - **Complexity**: Requires additional setup and maintenance.



#### KRaft (Kafka Raft)

- **Overview**:
  - Introduced as a replacement for ZooKeeper.
  - Built into Kafka, eliminating the need for an external dependency.

- **Key Responsibilities**:
  - **Leader Election**: Uses the Raft consensus algorithm to elect partition leaders.
  - **Metadata Management**: Handles metadata natively within Kafka.
  - **Fault Tolerance**: Ensures high availability and resilience through the Raft protocol.

- **Advantages of KRaft**:
  - **No External Dependency**: Simplifies the Kafka architecture.
  - **Improved Scalability**: Designed for large-scale Kafka clusters.
  - **Reduced Latency**: Optimized metadata propagation across brokers.


#### Transition from ZooKeeper to KRaft

| **Feature**               | **ZooKeeper**                       | **KRaft**                          |
|---------------------------|-------------------------------------|-------------------------------------|
| Metadata Management       | External (ZooKeeper service)       | Native to Kafka                     |
| Leader Election Algorithm | ZooKeeper-based                    | Raft Consensus Algorithm            |
| Setup Complexity          | Requires separate ZooKeeper setup  | Simplifies architecture             |
| Scalability               | Limited by ZooKeeper performance   | Optimized for large-scale clusters  |
| Dependency                | Requires ZooKeeper                 | No external dependency              |




#### Visual Workflow


```plaintext
ZooKeeper:
Kafka Cluster → ZooKeeper (Manages metadata and leader election)

KRaft:
Kafka Cluster (Integrated metadata and leader election)


---

## Topics and Partitions

A **Kafka Topic** is a category or channel where messages are published by producers and consumed by consumers. Each topic is divided into multiple **partitions**, enabling scalability, parallel processing, and fault tolerance.

![Topics and Partitions](../images/topics-partitions.png)

#### Key Concepts

1. **Topic**:
   - A Kafka Topic is a logical channel to which producers send messages.
   - Topics are used to categorize messages, making it easier to consume relevant data.

2. **Partition**:
   - Each topic is divided into one or more partitions.
   - A partition is an ordered, immutable sequence of messages.
   - Messages within a partition are assigned unique offsets (0, 1, 2, etc.) to preserve order.

3. **Parallelism**:
   - Partitions enable parallelism as multiple consumers in a consumer group can process data from different partitions simultaneously.

4. **Fault Tolerance**:
   - Partitions are replicated across multiple brokers to ensure high availability.



#### Example from the Image

1. **Partition 0**:
   - Contains messages with offsets: 0, 1, 2, 3, 4.
   - Represents one slice of the topic's data.

2. **Partition 1**:
   - Contains messages with offsets: 0, 1, 2, 3.
   - A separate slice of data that can be processed independently.

3. **Partition 2**:
   - Contains messages with offsets: 0, 1, 2, 3.
   - Another independent slice of the topic's data.



#### How It Works

1. **Producers**:
   - Send messages to a topic.
   - Messages are distributed across partitions (round-robin or based on a key).

2. **Consumers**:
   - Consume messages from partitions.
   - Each consumer in a consumer group is assigned one or more partitions.

3. **Order Guarantee**:
   - Kafka guarantees the order of messages within a partition but not across partitions.

4. **Scalability**:
   - Adding more partitions increases parallelism, allowing more consumers to process data.



#### Benefits of Partitioning

- **Parallel Processing**:
  - Multiple consumers can process data simultaneously, increasing throughput.

- **Fault Tolerance**:
  - Replication of partitions ensures data availability even if a broker fails.

- **Scalability**:
  - Adding partitions allows Kafka to handle increased producer and consumer workloads.



---

## Kafka Topic Messages and Offsets

In Kafka, each **topic** is a log of events or messages, and each message in the topic is assigned a unique **offset**.


![Kafka Topic Messages and Offsets](../images/offsets.png)


#### Key Components

1. **Topic**:
   - A topic is a logical channel for organizing and storing messages.
   - Messages in a topic are stored in an immutable sequence.

2. **Messages (M1, M2, ..., Mn)**:
   - Messages are the data sent by producers.
   - Each message in a topic represents an event, transaction, or log entry.

3. **Offsets**:
   - Each message within a partition of a topic has a unique sequential ID called an **offset**.
   - Offsets act as a pointer to messages, allowing consumers to track where they left off.
   - Example: The first message is assigned offset `0`, the second `1`, and so on.


#### Example from the Image

- **Topic**:
  - Contains messages `M1, M2, ..., Mn` stored sequentially.

- **Offsets**:
  - Messages are indexed using offsets:
    - Message `M1` → Offset `0`
    - Message `M2` → Offset `1`
    - Message `Mn` → Offset `n`


#### How Offsets Work

1. **Producers**:
   - Append messages to the end of the topic, increasing offsets sequentially.

2. **Consumers**:
   - Read messages from a topic using offsets to keep track of the read position.
   - Example: If a consumer processes message `M2` (offset `1`), it knows to fetch offset `2` next.

3. **Log Retention**:
   - Kafka retains messages for a configurable period or until storage limits are reached.
   - Consumers must track offsets to avoid re-reading or missing messages.


#### Benefits of Offsets

- **Message Ordering**:
  - Kafka guarantees that messages within a partition are read in the order of their offsets.

- **Tracking Progress**:
  - Consumers use offsets to track their progress in processing messages.

- **Reprocessing**:
  - Consumers can reprocess messages by resetting the offset to an earlier value.


---

## Kafka Message/Record Structure

A **Kafka Message** is the fundamental unit of data in Kafka, created by the producer and sent to a topic. Each message contains multiple components, providing metadata and flexibility for reliable processing.


![Kafka Message](../images/record-message.png)



#### Components of a Kafka Message

1. **Key (Binary)**:
   - Represents the key associated with the message.
   - Used to determine the partition for the message.
   - Can be `null` if partitioning is not required.

2. **Value (Binary)**:
   - The actual data (payload) of the message.
   - Can be `null` if only metadata is being sent.

3. **Compression Type**:
   - Defines the type of compression used for the message.
   - Supported types: `none`, `gzip`, `snappy`, `lz4`, `zstd`.

4. **Headers (Optional)**:
   - Additional metadata in the form of key-value pairs.
   - Useful for adding context without modifying the message payload.

5. **Partition + Offset**:
   - Specifies the partition the message belongs to and its unique offset within the partition.
   - Guarantees the order of messages within the partition.

6. **Timestamp**:
   - Represents the time the message was created.
   - Can be set by the producer (user-defined) or assigned by the Kafka broker (system-defined).


---

## Consumer Groups

In Kafka, **consumer groups** are a mechanism to enable horizontal scalability and fault tolerance when consuming messages from topics.


![Consumer Groups](../images/consumer-groups.png)


#### Key Concepts

1. **Topic**:
   - A logical channel that organizes data.
   - Each topic is divided into multiple **partitions**, allowing data to be distributed and processed in parallel.

2. **Partitions**:
   - A single partition is a sequence of ordered messages.
   - Partitions are processed independently and assigned to consumers within a group.

3. **Consumer Groups**:
   - A consumer group is a collection of consumers working together to consume data from a topic.
   - Each consumer in the group processes messages from a unique set of partitions.


#### Example from the Image

1. **Partitions**:
   - **Partition 0**: Contains messages with offsets `0, 1, 2, ..., 9`.
   - **Partition 1**: Contains messages with offsets `0, 1, 2, ..., 6`.
   - **Partition 2**: Contains messages with offsets `0, 1, 2, ..., 10`.
   - **Partition 3**: Contains messages with offsets `0, 1, 2, ..., 5`.

2. **Consumer Group**:
   - **Consumer C1**:
     - Assigned **Partition 0** and **Partition 3**.
     - Processes all messages from these partitions.
   - **Consumer C2**:
     - Assigned **Partition 1**.
     - Processes all messages from this partition.
   - **Consumer C3**:
     - Assigned **Partition 2**.
     - Processes all messages from this partition.


#### Workflow of Consumer Groups

1. **Message Assignment**:
   - Kafka assigns partitions to consumers in the group.
   - Each partition is consumed by only one consumer in the group at a time.

2. **Parallel Processing**:
   - Messages from different partitions are processed in parallel by different consumers.

3. **Rebalancing**:
   - If a consumer joins or leaves the group, Kafka reassigns partitions to maintain balanced consumption.


#### Benefits of Consumer Groups

1. **Horizontal Scalability**:
   - By adding more consumers to a group, partitions can be processed faster.

2. **Fault Tolerance**:
   - If a consumer fails, its assigned partitions are reassigned to other consumers in the group.

3. **Order Guarantee**:
   - Kafka ensures messages within a partition are consumed in order by the assigned consumer.


#### Summary Workflow

```plaintext
Topic → Partitions → Consumer Group (C1, C2, C3)
Partition 0 → C1
Partition 1 → C2
Partition 2 → C3
Partition 3 → C1

---