# Python Kafka

There exists a neat Python package for communicating with a running Kafka cluster. You can even admin the cluster using it:

In [1]:
!pip install kafka-python



Confluent advertises it's own client, [`confluent-kafka`](https://pypi.org/project/confluent-kafka/)

But `kafka-python` works with any Kafka installation, not only Confluent

In [2]:
# use Confluent webpage to get that info
params = {
    # take this from "Cluster Overview->Cluster Settings->General->Identification"
    "bootstrap_servers": "pkc-4r297.europe-west1.gcp.confluent.cloud:9092",
    "security_protocol": "SASL_SSL",
    "sasl_mechanism": "PLAIN",
    # Data Integration->API Keys->Add key
    # username = key
    "sasl_plain_username": "4KLAOHXKUM6GFMLJ",
    # password = secret
    "sasl_plain_password": "tdpFCiSB+pMISzXXsLnkBeYZAKc9+lb4rIRWcuv7simAlUWUbqRi8KxUt21w5+XY"
}

In [3]:
# you can use another name for testing

MAIN_TOPIC = "main_topic"

In [4]:
from kafka import KafkaAdminClient

admin_client = KafkaAdminClient(**params)
# with AdminClient you can do anything with Kafka
print(admin_client.list_topics())
admin_client.delete_topics(admin_client.list_topics())
print(admin_client.list_topics())

[]
[]


In [5]:
from kafka.admin import NewTopic

print(admin_client.list_topics())
admin_client.create_topics([
    NewTopic(
        name=MAIN_TOPIC,
        # we won't use topic partitioning for this example
        # thus leaving the number of partitions to 1
        num_partitions=1,
        # replication_factor is defined by cluster's configuration
        replication_factor=3,
        # the topic will start discarding old data when it becomes larger than that
        topic_configs={"retention.bytes": 2 ** 30}
    )
])
print(admin_client.list_topics())

[]
['main_topic']


# Writing Data to Kafka

Mind that one can write only `bytes` to Kafka, not strings!

In [6]:
from kafka import KafkaProducer

producer = KafkaProducer(**params)
for i in range(10):
    producer.send(MAIN_TOPIC, str(i).encode("utf-8"))

# Reading Data from Kafka

Reading is done with a consumer

In [7]:
from kafka import KafkaConsumer

consumer = KafkaConsumer(**params)

# Topics, Partitions, and Offsets

Messages in Kafka are organised into topics:

In [8]:
consumer.topics()

{'main_topic'}

Topic can have several partitions with different starting and ending offsets:

In [9]:
from kafka import TopicPartition

partitions = [
    TopicPartition(MAIN_TOPIC, partition)
    for partition in consumer.partitions_for_topic(MAIN_TOPIC)
]
print(consumer.beginning_offsets(partitions))
print(consumer.end_offsets(partitions))

{TopicPartition(topic='main_topic', partition=0): 0}
{TopicPartition(topic='main_topic', partition=0): 10}


Before reading something from Kafka, one should assign the consumer to a topic and partition:

# Reading from a given offset of a partition

In [10]:
# before consuming data, one needs to assign partitions
consumer.assign(partitions)

One can read from any offset of the partition:

In [11]:
# you can check, at which offset the partition is currently read
print(consumer.position(partitions[0]))
# and set the offset as you wish
consumer.seek(partitions[0], 0)
print(consumer.position(partitions[0]))

10
0


Reading data can be done by batches of any desired size:

In [13]:
for _ in range(10):
    data = consumer.poll(
        timeout_ms=200,
        max_records=1
    )[partitions[0]][0].value
    print(data.decode("utf-8"))
print(consumer.position(partitions[0]))

0
1
2
3
4
5
6
7
8
9
10


# Do it Yourself

* use two notebooks
* in the first one, write an endless stream of random float numbers to Kafka
* in the second one, read the stream in batches of 1024 values,  and compute their averages
* write the stream of averages to another topic
* monitor the topics status in Confluent