# Python Kafka

There exists a neat Python package for communicating with a running Kafka cluster. You can even admin the cluster using it:

In [0]:
!pip install kafka-python

Collecting kafka-python
  Downloading kafka_python-2.0.2-py2.py3-none-any.whl (246 kB)
[?25l[K     |█▎                              | 10 kB 22.0 MB/s eta 0:00:01[K     |██▋                             | 20 kB 3.6 MB/s eta 0:00:01[K     |████                            | 30 kB 5.3 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 4.3 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 4.3 MB/s eta 0:00:01[K     |████████                        | 61 kB 5.1 MB/s eta 0:00:01[K     |█████████▎                      | 71 kB 4.9 MB/s eta 0:00:01[K     |██████████▋                     | 81 kB 4.7 MB/s eta 0:00:01[K     |████████████                    | 92 kB 5.2 MB/s eta 0:00:01[K     |█████████████▎                  | 102 kB 5.1 MB/s eta 0:00:01[K     |██████████████▋                 | 112 kB 5.1 MB/s eta 0:00:01[K     |████████████████                | 122 kB 5.1 MB/s eta 0:00:01[K     |█████████████████▎              | 133 kB 5.1 MB/s e

# Kafka cluster

To use `kafka-python`, you need a Kafka cluster running somewhere.

You can create a Kafka cluster using [Confluent on Google Cloud](https://console.cloud.google.com/marketplace/product/confluent-prod/apache-kafka-on-confluent-cloud), paying with your Google Cloud credits for education.

In class, the instructor will provide you with the credentials to connect to an existing cluster.

In [0]:
# cluster connection parameters
params = {
    "bootstrap_servers": "put the bootstrap server address here",
    "security_protocol": "SASL_SSL",
    "sasl_mechanism": "PLAIN",
    "sasl_plain_username": "use API key as username",
    "sasl_plain_password": "use API secret as password"
}

In [0]:
# use another name to avoid conflicts with other students when in class

MAIN_TOPIC = "main_topic"

In [0]:
from kafka import KafkaAdminClient

admin_client = KafkaAdminClient(**params)
# with AdminClient you can do anything with Kafka
print(admin_client.list_topics())
admin_client.delete_topics(admin_client.list_topics())
print(admin_client.list_topics())

[]
[]


In [0]:
from kafka.admin import NewTopic

print(admin_client.list_topics())
admin_client.create_topics([
    NewTopic(
        name=MAIN_TOPIC,
        # we won't use topic partitioning for this example
        # thus leaving the number of partitions to 1
        num_partitions=1,
        # replication_factor is defined by cluster's configuration
        replication_factor=3,
        # the topic will start discarding old data when it becomes larger than that
        topic_configs={"retention.bytes": 2 ** 30}
    )
])
print(admin_client.list_topics())

[]
['main_topic']


# Writing Data to Kafka

Mind that one can write only `bytes` to Kafka, not strings!

In [0]:
from kafka import KafkaProducer

producer = KafkaProducer(**params)
for i in range(10):
    producer.send(MAIN_TOPIC, str(i).encode("utf-8"))

# Reading Data from Kafka

Reading is done with a consumer

In [0]:
from kafka import KafkaConsumer

consumer = KafkaConsumer(**params)

# Topics, Partitions, and Offsets

Messages in Kafka are organised into topics:

In [0]:
consumer.topics()

Out[8]: {'main_topic'}

Topic can have several partitions with different starting and ending offsets:

In [0]:
from kafka import TopicPartition

partitions = [
    TopicPartition(MAIN_TOPIC, partition)
    for partition in consumer.partitions_for_topic(MAIN_TOPIC)
]
print(consumer.beginning_offsets(partitions))
print(consumer.end_offsets(partitions))

{TopicPartition(topic='main_topic', partition=0): 0}
{TopicPartition(topic='main_topic', partition=0): 10}


Before reading something from Kafka, one should assign the consumer to a topic and partition:

# Reading from a given offset of a partition

In [0]:
# before consuming data, one needs to assign partitions
consumer.assign(partitions)

One can read from any offset of the partition:

In [0]:
# you can check, at which offset the partition is currently read
print(consumer.position(partitions[0]))
# and set the offset as you wish
consumer.seek(partitions[0], 0)
print(consumer.position(partitions[0]))

10
0


Reading data can be done by batches of any desired size:

In [0]:
for _ in range(10):
    data = consumer.poll(
        timeout_ms=200,
        max_records=1
    )[partitions[0]][0].value
    print(data.decode("utf-8"))
print(consumer.position(partitions[0]))

0
1
2
3
4
5
6
7
8
9
10


# Do it Yourself

Use three notebooks:

1. write an endless stream of random float numbers to Kafka
1. work with the stream:
    1. read from the input topic in batches of 1024 values
    1. compute their averages
    1. write the stream of averages to another topic
1. read from the output topic to verify the results