# Kafka Consumer

In [1]:
# set this variable with one of the following values
# -> 'local'
# -> 'docker_cluster'
CLUSTER_TYPE ='docker_cluster'

In [2]:
import os

KAFKA_BOOTSTRAP_SERVERS = ''

if CLUSTER_TYPE == 'local':

    KAFKA_HOME = '<PATH_TO_YOUR_kafka_2.13-2.7.0_FOLDER>'
    KAFKA_BOOTSTRAP_SERVERS = ['localhost:9092']
    
    # Start Zookeeper    
    os.system('{0}/bin/zookeeper-server-start.sh {0}/config/zookeeper.properties'.format(KAFKA_HOME)) 
    
    # Start one Kafka Broker
    os.system('{0}/bin/kafka-server-start.sh {0}/config/server.properties'.format(KAFKA_HOME)) 
    
elif CLUSTER_TYPE == 'docker_cluster':

    KAFKA_BOOTSTRAP_SERVERS = ['kafka-broker:9092']

In [3]:
! pip install kafka-python



In [4]:
from kafka import KafkaConsumer

Kafka consumers can be instantiated via the KafkaConsumer class

```python
#--- A TYPICAL CONSUMER
consumer = KafkaConsumer(
    bootstrap_servers=['62.30.10.23:9092'],  #<<<--- list of brokers
    security_protocol="SSL",                 #<<<--- security protocol (if any) 
    ssl_cafile="./ca.pem",                   #<<<--- certificate details (if any)
    ssl_certfile="./service.cert",           #           ...
    ssl_keyfile="./service.key",             #           ...
    value_deserializer=msgpack.unpackb,      #<<<--- message value deserialization function (e.g. unpack the message from a specific format)
    auto_offset_reset='earliest',            #<<<--- automatically bring the reading offset to the earliest message
    group_id="group_A",                      #<<<--- identify this consumer as part of group_A
)
```


Once more we'll use a simple implementation of the consumer, with no specific configurations used in this example.

In [5]:
consumer = KafkaConsumer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
                         consumer_timeout_ms=10000)

Inspect the brokers for the available topics

In [73]:
consumer.topics()

And subscribe to the topics of choice.
Subscribing doesn't mean any message is actually received/consumed... it only means that from now on the consumer will be able to poll from the partitions of the chosen topics hosted on the brokers.

In [74]:
consumer.subscribe('my_awesome_topic')
consumer.subscription()

We can inspect how many partitions the specific topic is made of:

In [75]:
consumer.partitions_for_topic('my_awesome_topic')

And set the consumer polling logic to a custom set of parameters.

In [76]:
consumer.poll(timeout_ms=0,         #<<--- do not enable dead-times before one poll to the next
              max_records=None,     #<<--- do not limit the number of records to consume at once 
              update_offsets=True   #<<--- update the reading offsets on this topic
             )

This enables to start reading from the topic:

In [77]:
# this consumer will keep polling and reading for messages until stopped (or it reaches the consumer_timeout_ms)
for message in consumer:
    print (message)

The reading offset can also be brought back to the beginning of the topic, to re-read the entire topic:

In [78]:
consumer.seek_to_beginning()

for message in consumer:
    print (message)

In [79]:
from datetime import datetime

consumer.seek_to_beginning()

# break down the message into its main components
for message in consumer:
    print ("%d:%d [%s] k=%s v=%s" % (message.partition,
                          message.offset,
                          message.timestamp, #datetime.fromtimestamp(message.timestamp/1000).time(),
                          message.key,
                          message.value))

Let's change the topic to which the consumer is subscribed to a partitioned one:

In [80]:
consumer.subscribe('a_partitioned_topic')
consumer.subscription()

By inspecting the number of partitions for this topic we do see now 2 partitions: partition #0 and partition #1

In [None]:
consumer.partitions_for_topic('a_partitioned_topic')

Reading out from a partitioned topic it's easy to see that the messages are sent to the two partitions in a seemengly arbitrary way:

In [None]:
import json

consumer.seek_to_beginning()

for message in consumer:
    print ("%d:%d:\t v=%s" % (message.partition,
                          message.offset,
                          json.loads(message.value)))

### Creating a consumer accessing only one partition

Publishing records to a partitioned topic is typycally* transparent for the user: the producer publishes to the topic, and the kafka cluster will redirect the message to the partition leader, later replicating that to the followers.

The same goes for a generic* consumer. As we have just seen data is consumed from all partitions within the topic.

In some cases it can however be more suitable to instantiate multiple consumers, each reading from a specific partition of a topic.

Let's assign a consumer specific to access the data of partition #0 of the previous partitioned topic.

In [82]:
from kafka import TopicPartition

consumer_part_0 = KafkaConsumer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
                                client_id='consumer_n_0',
                                consumer_timeout_ms=10000)

consumer_part_0.assign([TopicPartition('a_partitioned_topic', 0)]) # <<--- name of the topic, partition id

In [None]:
consumer_part_0.seek_to_beginning()

for message in consumer_part_0:
    print ("%d:%d:\t v=%s" % (message.partition,
                          message.offset,
                          json.loads(message.value)))

### Creating a consumer group

Multiple consumers can read from the same topic.

In kafka, each consumer is part of a consumer group. 
A consumer group is a number (1 or more) of cooperating consumers gathering data from the same topic, balancing the load across them and redistributing the consume calls dynamically.

If a consumer inside a consumer group fails, the others from the same group will keep reading the whole data from the topic to which they are subscribed.

In [68]:
consumer_one = KafkaConsumer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
                             client_id='consumer_one',
                             group_id='my_group',
                             consumer_timeout_ms=10000)

In [69]:
consumer_one.subscribe('a_partitioned_topic')

Each consumer within a group is going to be an independent process (should be run in parallel from the others) and will provide access to a fraction of the incoming data

In [None]:
# Use multiple consumers in parallel --> typically you would run each on a different thread / process / executor
for message in consumer_one:
    print ("%d:%d: k=%s v=%s" % (message.partition,
                          message.offset,
                          message.key,
                          json.loads(message.value)))

## Reading from the Kafka+Spark results topic

Let's subscribe to the `results` topic and monitor the frauds

In [None]:
# consumer.subscribe('results')
for message in consumer:
    print ("%d:%d: k=%s v=%s" % (message.partition,
                          message.offset,
                          message.key,
                          message.value))
    print ('--> sending alert message to user %s',message.key)