# Setup Kafka in Kubernetes


Kafka is an open-source distributed event streaming platform that is used for building real-time data pipelines and streaming applications. It was originally developed by LinkedIn and is now maintained by the Apache Software Foundation. Kafka is designed to handle high volume, high throughput, and low latency data streams, making it a popular choice for building scalable and reliable data streaming solutions.

Some of the benefits of Kafka include:

* Kafka is designed to handle large-scale data streams and can handle millions of events per second, making it highly scalable. It can be easily scaled horizontally by adding more Kafka brokers to the cluster to accommodate increasing data volumes
* Kafka provides fault tolerance by replicating data across multiple brokers in a Kafka cluster. This ensures that data is highly available and can be recovered in case of any failures, making Kafka a reliable choice for critical data streaming applications
* Kafka supports a variety of data sources and data sinks, making it highly versatile. It can be used for building a wide range of applications, such as real-time data processing, data ingestion, data streaming, and event-driven architectures
* Kafka stores all published messages for a configurable amount of time, allowing consumers to read data at their own pace. This makes Kafka suitable for use cases where data needs to be retained for historical analysis or replayed for recovery purposes

Deploying Kafka on Kubernetes, a widely-used container orchestration platform, offers several additional advantages. Kubernetes enables dynamic scaling of Kafka clusters based on demand, allowing for efficient resource utilization and automatic scaling of Kafka brokers to handle changing data stream volumes. This ensures that Kafka can handle varying workloads without unnecessary resource wastage or performance degradation. It provides easy deployment, management, and monitoring of Kafka clusters as containers, making them highly portable across different environments and enabling faster deployment and updates. This allows for seamless migration of Kafka clusters across various cloud providers, data centers, or development environments. Fruther, Kubernetes includes built-in features for handling failures and ensuring high availability of Kafka clusters. For example, it automatically reschedules failed Kafka broker containers and supports rolling updates without downtime, ensuring continuous availability of Kafka for data streaming applications, thereby enhancing the reliability and fault tolerance of Kafka deployments. Overall, running Kafka on Kubernetes provides scalability, flexibility, and high availability, making it a powerful combination for building and managing robust data streaming applications.

When combined with object storage like MinIO, Kafka can offer great advantages in building data streaming solutions. MinIO is a high-performance, distributed object storage system that provides scalable and durable storage for unstructured data. When used as a data sink with Kafka, MinIO can provide reliable and scalable storage for data streams, allowing organizations to store and process large volumes of data in real-time.

Some benefits of combining Kafka with MinIO include:

* **Scalable Storage**: MinIO can handle large amounts of data and scale horizontally across multiple nodes, making it a perfect fit for storing data streams generated by Kafka. This allows organizations to store and process massive amounts of data in real-time, making it suitable for big data and high-velocity data streaming use cases
* **Durability**: MinIO provides durable storage, allowing organizations to retain data for long periods of time. This is useful for scenarios where data needs to be stored for historical analysis, compliance requirements, or for data recovery purposes
* **Fault Tolerance**: MinIO supports data replication across multiple nodes, providing fault tolerance and ensuring data durability. This complements Kafka's fault tolerance capabilities, making the overall solution highly reliable and resilient
* **Easy Integration**: MinIO can be easily integrated with Kafka using Kafka Connect, which is Kafka's built-in framework for connecting Kafka with external systems. This makes it straightforward to stream data from Kafka to MinIO for storage, and vice versa for data retrieval, enabling seamless data flow between Kafka and MinIO


In this notebook, we will walk through how to set up Kafka on Kubernetes using Strimzi, an open-source project that provides operators to run Apache Kafka and Apache ZooKeeper clusters on Kubernetes and OpenShift.

## Prerequisites

Before we start, ensure that you have the following prerequisites:

* A running Kubernetes cluster
* kubectl command-line tool
* MinIO cluster up and running
* mc command line tool for MinIO
* Helm package manager

### Install Strimzi Operator
The first step is to install the Strimzi operator on your Kubernetes cluster. The Strimzi operator manages the lifecycle of Kafka and ZooKeeper clusters on Kubernetes.

Add Strimzi Helm chart repository

In [37]:
!helm repo add strimzi https://strimzi.io/charts/

"strimzi" already exists with the same configuration, skipping


Install chart with release name `my-release`:

In [38]:
!helm install my-release strimzi/strimzi-kafka-operator --namespace=kafka --create-namespace

NAME: my-release
LAST DEPLOYED: Mon Apr 10 20:03:12 2023
NAMESPACE: kafka
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing strimzi-kafka-operator-0.34.0

To create a Kafka cluster refer to the following documentation.

https://strimzi.io/docs/operators/latest/deploying.html#deploying-cluster-operator-helm-chart-str


Above command will install the latest version (0.34.0 at the time of this writing) of the operator in `kafka` namespace by creating it, for additional configurations refer to [this](https://github.com/strimzi/strimzi-kafka-operator/tree/main/helm-charts/helm3/strimzi-kafka-operator#configuration) page.

### Create Kafka Cluster

Now that we have installed the Strimzi operator, we can create a Kafka cluster. In this example, we will create a Kafka cluster with three Kafka brokers and three ZooKeeper nodes.

Lets create a YAML file as shown [here](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/kafka/kafka-persistent.yaml)

In [39]:
%%writefile deployment/kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-kafka-cluster
  namespace: kafka
spec:
  kafka:
    version: 3.4.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.4"
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 100Gi
        deleteClaim: false
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
  entityOperator:
    topicOperator: {}
    userOperator: {}

Overwriting deployment/kafka-cluster.yaml


Let's create the cluster by deploying the YAML file, it will take sometime for the cluster to be up and running

In [40]:
!kubectl apply -f deployment/kafka-cluster.yaml

kafka.kafka.strimzi.io/my-kafka-cluster created


We can check the status of the cluster by running the below command,

In [42]:
!kubectl -n kafka get kafka my-kafka-cluster

my-kafka-cluster   3                        3                     True    


Now that we have the cluster up and running, let try to produce and consume sample topic events, first lets create a kafka topic `my-topic`

### Create Kafka Topic

Create a YAML file for the kafka topic `my-topic` as shown below and apply it.

In [76]:
%%writefile deployment/kafka-my-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: my-topic
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
spec:
  partitions: 3
  replicas: 3

Overwriting deployment/kafka-my-topic.yaml


In [77]:
!kubectl apply -f deployment/kafka-my-topic.yaml

kafkatopic.kafka.strimzi.io/connect-offsets created


Check the status of the topic using the below command

In [45]:
!kubectl -n kafka get kafkatopic my-topic

NAME       CLUSTER            PARTITIONS   REPLICATION FACTOR   READY
my-topic   my-kafka-cluster   3            3                    True


### Produce and Consume Messages

With the Kafka cluster and topic set up, we can now produce and consume messages.

Create a Kafka producer pod to produce messages to the my-topic topic, try the below commands in a terminal rather than executing it in the notebook

```shell
kubectl -n kafka run kafka-producer -ti --image=quay.io/strimzi/kafka:0.34.0-kafka-3.4.0 --rm=true --restart=Never -- bin/kafka-console-producer.sh --broker-list my-kafka-cluster-kafka-bootstrap:9092 --topic my-topic
```

This will give us a prompt the send messages to the producer. In parallel we can bring the consumer to start consuming the messages that we sent to producer

```shell
kubectl -n kafka run kafka-consumer -ti --image=quay.io/strimzi/kafka:0.34.0-kafka-3.4.0 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-kafka-cluster-kafka-bootstrap:9092 --topic my-topic --from-beginning
```

The consumer will replay all the messages that we sent to producer earlier and if we add any new messages to the producer that will also start showing up at the consumer side.


You can delete the `my-topic` by using the below command

In [47]:
!kubectl -n kafka delete kafkatopic my-topic

kafkatopic.kafka.strimzi.io "my-topic" deleted



Now that the Kafka cluster up and running with dummy topic/producer/consumer next we can start consume topics directly into MinIO directly using the Kafka Connectors in the next Notebook.