# Kafka
- [Doc](https://kafka.apache.org/documentation/)
- [Code](https://github.com/apache/kafka)
  - D:\workspace\rtfsc\kafka
- [The Internals of Apache Kafka](https://books.japila.pl/kafka-internals/)

> Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Books:
- Kafka: a Distributed Messaging System for Log Processing, 2011.
- I Heart Logs: Event Data, Stream Processing, and Data Integration, 2014.
- Kafka: The Definitive Guide, 2nd Edition, 2021.
  - 1st edition book: 0.9.0.1, 0.10.0; version in practice: 3.4.0.
  - 2nd edition book: kafka_2.13-2.7.0. (P.23)
- Kafka Streams in Action, 2018.
- Kafka Connect, 2023.

## References
1 Getting Started:

2 APIs:
- Admin API
- Producer API
- Consumer API
- Kafka Streams API
- Kafka Connect API

3 Configuration:
- Broker Configs
- Topic Configs
- Producer Configs
- Consumer Configs
- Kafka Connect Configs
  - Source Connector Configs
  - Sink Connector Configs
- Kafka Streams Configs
- AdminClient Configs
- MirrorMaker Configs
- System Properties
- Tiered Storage Configs

4 Design:
- 4.3 Efficiency
[`sendfile(2)`](https://man7.org/linux/man-pages/man2/sendfile.2.html), [Efficient data transfer through zero copy](https://developer.ibm.com/articles/j-zerocopy/)
  - Java: `java.nio.channels.FileChannel#transferTo`
- 4.6 Message Delivery Semantics: https://kafka.apache.org/documentation/#semantics
Since 0.11.0.0, the Kafka producer also supports **an idempotent delivery option** which guarantees that resending will not result in duplicate entries in the log.
Also beginning with 0.11.0.0, the producer supports **the ability to send messages to multiple topic partitions using transaction-like semantics**: i.e. either all messages are successfully written or none of them are.

5 Implementation:

6 Operations:

7 Security:

8 Kafka Connect:

9 Kafka Streams:

# Model
- topic
- partition
- replica
  - leader
  - follower

## Physical Storage
- tired storage
- partition allocation
- file management
- file format
- indexes
- compaction `procedure`
- deleted events

# Components
- broker, controller
  - ZooKeeper, KRaft
- client
  - producer
  - consumer
  - consumer group
  - admin client

# Procedures

## Replication

## Request Processing
- [Kafka Protocol Guide](https://kafka.apache.org/protocol.html)
```
- Protocol Primitive Types
- grammars
- Common Request and Response Structure
- Structure
- Record Batch
- Constants
	- Error Codes
	- Api Keys
- The Messages
```

requests:
- produce request
- fetch request
- admin request

# Dev
- Kafka Producers: writing message to Kafka
- Kafka Consumers: reading data from Kafka
- reliable data delivery
- exactly-once semantics
  - idempotent producer
  - transactions
- building data pipelines

# Ops
- installation
- managing Kafka programatically
- cross-cluster data mirroring
- securing Kafka
- administering Kafka
- monitoring Kafka

## Installation

### Strimzi
https://strimzi.io/
> Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations.

Usage example: 
- [[book.Vert.x in Action#5.7 Designing a reactive application]]
- [[book.Reactive Systems in Java#5.4.2 The Event Bus The Backbone]]

### Confluent
- [A Complete Comparison of Apache Kafka vs Confluent](https://www.confluent.io/apache-kafka-vs-confluent/)： Used by over 70% of the Fortune 500, Apache Kafka has become the foundational platform for streaming data, but self-supporting the open source project puts you in the business of managing low-level data infrastructure. With Kafka at its core, Confluent offers *complete, fully managed, cloud-native data streaming* that's available everywhere your data and applications reside.
- [Quick Start for Confluent Platform](https://docs.confluent.io/platform/current/platform-quickstart.html#ce-docker-quickstart): In this quick start, you create *Apache Kafka*® topics, use *Kafka Connect* to generate mock data to those topics, and create *ksqlDB* streaming queries on those topics. You then go to *Confluent Control Center* to monitor and analyze the event streaming queries. When you finish, you’ll have a real-time app that consumes and processes data streams by using familiar SQL statements.
	- As of Confluent Platform 7.5, ZooKeeper is deprecated for new deployments
- [Confluent Platform All-In-One](https://github.com/confluentinc/cp-all-in-one): docker-compose.yml files for cp-all-in-one , cp-all-in-one-community, cp-all-in-one-cloud, Apache Kafka Confluent Platform.

## Configuration

## Monitoring

Prometheus
- [kafka_exporter](https://github.com/danielqsj/kafka_exporter)
- [JMX Exporter](https://github.com/prometheus/jmx_exporter)

- Burrow: lag monitoring
- Xinfra Monitor/Kafka Monitor: end-to-end monitoring

# Kafka Streams
https://kafka.apache.org/documentation/streams/

> Kafka Streams is **a client library for building applications and microservices, where the input and output data are stored in Kafka clusters**. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.

1. Concepts

- Stream Processing Topology
	- Stream
	- Processor: source, sink
- Time: event time, processing time, ingestion time
- Duality of Streams and Tables
- Aggregation
- Windowing
- State: state store
- Processing Guarantees
	- exactly one: transactional and idempotent producer (since 0.11.0.0) , KIP-129: Streams Exactly-Once Semantics
	- exactly-once v2: KIP-447: Producer scalability for exactly once semantics
	- out-of-order handling: versioned state store, offset-based semantics/timestamp-based semantics

2. Architecture

> Kafka Streams simplifies application development by **building on the Kafka producer and consumer libraries** and **leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity**.

- Stream Partitions and Tasks
- Thread Model
- Local State Stores
- Fault Tolerance

3. Developer Guide

- Streams DSL
The Kafka Streams DSL (Domain Specific Language) is built on top of the Streams Processor API. It is the recommended for most users, especially beginners. Most data processing operations can be expressed in just a few lines of DSL code.
- Processor API
The Processor API allows developers to define and connect custom processors and to interact with state stores. With the Processor API, you can define arbitrary stream processors that process one received record at a time, and connect these processors with their associated state stores to compose the processor topology that represents a customized processing logic.


# Kafka Connect

# KIP
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

- KIP-98 - Exactly Once Delivery and Transactional Messaging
- KIP-975: Docker Image for Apache Kafka

KRaft:
- KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
- KIP-595: A Raft Protocol for the Metadata Quorum
- KIP-631: The Quorum-based Kafka Controller

Read from Follower:
- KIP-392: Allow consumers to fetch from closest replica

Physical storage:
- KIP-405: Kafka Tiered Storage

Idempotent/Transactional Producer:
- KIP-360: Improve reliability of idempotent/transactional producer
- KIP-447: Producer scalability for exactly once semantics

MirrorMaker2:
- KIP-656: MirrorMaker2 Exactly-once Semantics