# Kafka Producer

In [1]:
# set this variable with one of the following values
# -> 'local'
# -> 'docker_cluster'
CLUSTER_TYPE ='docker_cluster'

In [2]:
import os

KAFKA_BOOTSTRAP_SERVERS = ''

if CLUSTER_TYPE == 'local':

    KAFKA_HOME = '<PATH_TO_YOUR_kafka_2.13-2.7.0_FOLDER>'
    KAFKA_BOOTSTRAP_SERVERS = ['localhost:9092']
    
    # Start Zookeeper    
    os.system('{0}/bin/zookeeper-server-start.sh {0}/config/zookeeper.properties'.format(KAFKA_HOME)) 
    
    # Start one Kafka Broker
    os.system('{0}/bin/kafka-server-start.sh {0}/config/server.properties'.format(KAFKA_HOME)) 
    
elif CLUSTER_TYPE == 'docker_cluster':

    KAFKA_BOOTSTRAP_SERVERS = ['kafka-broker:9092']

## Interacting with Kafka Producer from shell

Apache Kafka provides a set of bash scripts to interact and operate with the cluster for basic operations and testing such as:
- topics creation, configuration and inspection
- shell-based message producer 
- shell-based message consumer
- shell-based performance testing
- ...

Let's first operate with the kafka cluster from shell by connecting to the broker and issuing a shell commands:

#### For docker_cluster users
```console
$ docker ps

!!! replace your_kafka-broker_procid with your own kafka-broker process id
!!!                          ||| 
!!!                          vvv  
$ docker exec -it <your_kafka-broker_procid> /bin/sh 

# ./bin/bash
# cd /usr/bin/kafka_2.13-2.7.0/bin
# ls
# ./kafka-topics.sh --create --topic my_awesome_topic --bootstrap-server kafka-broker:9092                
# ./kafka-topics.sh --list --bootstrap-server kafka-broker:9092
# ./kafka-topics.sh --describe --topic my_awesome_topic --bootstrap-server kafka-broker:9092
# ./kafka-console-producer.sh --topic my_awesome_topic --bootstrap-server kafka-broker:9092
```

#### For local/VBox users
```console

!!! replace KAFKA_HOME with your own path to kafka_2.13-2.7.0 folder
!!!     ||| 
!!!     vvv  
$ cd KAFKA_HOME/bin 

$ ls 
$ ./kafka-topics.sh --create --topic my_awesome_topic --bootstrap-server localhost:9092                
$ ./kafka-topics.sh --list --bootstrap-server localhost:9092
$ ./kafka-topics.sh --describe --topic my_awesome_topic --bootstrap-server localhost:9092
$ ./kafka-console-producer.sh --topic my_awesome_topic --bootstrap-server localhost:9092
```

At this point you should be able to send messages to the topic you just created via the kafka-console-producer.

So far, no consumer is available to process or even display those messages... 
Yet the messages are succesfully sent to the topic, increasing the log(s) in the (possibly more than one) partition(s).

Let's create a console consumer and subscribe to the topic:

#### For docker_cluster users
```console
# ./kafka-console-consumer.sh --topic my_awesome_topic --bootstrap-server kafka-broker:9092 [--from-beginning]
```

#### For local/VBox users
```console
$ ./kafka-console-consumer.sh --topic my_awesome_topic --bootstrap-server localhost:9092 [--from-beginning]
```

## kafka-python Producer

Various python modules are available to interact with kafka, including:
- kafka-python
- confluent-kafka-python
- pyKafka

We'll use kafka-python to handle topics and producers

In [3]:
! pip install kafka-python

Collecting kafka-python
  Downloading kafka_python-2.0.2-py2.py3-none-any.whl (246 kB)
[K     |████████████████████████████████| 246 kB 11.8 MB/s eta 0:00:01
[?25hInstalling collected packages: kafka-python
Successfully installed kafka-python-2.0.2


Kafka producers can be instantiated via the KafkaProducer class

```python
#--- A TYPICAL PRODUCER
producer = KafkaProducer(
    bootstrap_servers=['62.30.10.23:9092'],  #<<<--- list of brokers
    security_protocol="SSL",                 #<<<--- security protocol (if any) 
    ssl_cafile="./ca.pem",                   #<<<--- certificate details (if any)
    ssl_certfile="./service.cert",           #           ...
    ssl_keyfile="./service.key",             #           ...
    value_serializer=msgpack.dumps           #<<<--- message value serialization function (e.g. interpred the message as a specific format)
)
```


We'll play with the vanilla version of the producer.
No certificates or specific serialization is used in this example.

A simple producer instantiated by pointing it to the kafka brokers:

In [None]:
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)

Let's try to publish a message to the topic we previously created without specifying any given key.

In [None]:
producer.send('my_awesome_topic', b'message 1')

The output message `<kafka.producer.future...>` is telling us explicitely that the record has been created and will be sent.
However it has not been sent just yet...

`KafkaProducer.send()` is in fact an asynchronous publish method.

This means that the producer will enqueue the message on an internal queue which is later (after a tunable max buffering time / given number of messages) sent to the broker if a leader is available, else wait some more time for it to respond.

This behaviour is perfectly OK. Even more, it's the expected behavoiur of kafka given the default settings.

Just be aware that the messages won't be sent right away.
If a large message rate is sent and `exit()` is issued right after it, it might by that no message is actually sent (because the max of the buffering time/n.msg is not reached).

Have a look at the API for all the tunable parameters: https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html


To send a message "synchronously" it can be issued a `flush()` of the producer.

In [None]:
producer.send('my_awesome_topic', b'a new message')
producer.flush()

It's important to realize that producers and consumers are completely decoupled. 
Even if a producer dies the consumer won't be affected by it, as it will still be able to access the topic on the brokers

In [None]:
producer.close()

In [None]:
producer = KafkaProducer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS)
producer.send('my_awesome_topic', b'a message from the revived producer')

Messages have a `<key, value>` pair data structure.

So far we have produced only messages with a given `value` but a `key` can be added as well.
(message keys can be used also to point messages to specific partitions)

In [None]:
producer.send(topic='my_awesome_topic', key=b'some_key', value=b'a message with key')
producer.flush()

### Create a topic from kafka-python

Kafka-python allows to admin the kafka cluster by defining new topics, and assinging then specific configuration parameters, such as the replication factor.

In [None]:
from kafka.admin import KafkaAdminClient, NewTopic

kafka_admin = KafkaAdminClient(
        bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
    )

Let's check the list of topics present on the cluster.

This is the equivalent of issuing `./kafka-topics.sh --list --bootstrap-server kafka-broker:9092`

In [None]:
kafka_admin.list_topics()

Topics are partitioned entities.
Within each partition events are added to the end of the log, resulting in an ordered list of records.

Publishing a new message to a partitioned topic will result in the addition of the message to the end of the log retained on the owner of a specific partition. If replication is enabled, the message will be then ridistributed to the other follower partitions.

In [None]:
# creating a new topic explicitely
#    w/   2 partitions
#    w/o  replication 
a_new_topic = NewTopic(name='a_partitioned_topic', 
                       num_partitions=2, 
                       replication_factor=1)

kafka_admin.create_topics(new_topics=[a_new_topic])

In [None]:
kafka_admin.list_topics()

### Publish messages for the Spark Structured Streaming example

Kafka can be used as a source for incoming messages in Spark Streaming and Structured Streaming.

In Spark 3.1.1 the kafka integration is unfortunately not available for pySpark Streaming (while is still available for scala and java).

We'll use the pySpark Structured Streaming API for implementing the example previously seen in the Spark hands-on sessions.

In [None]:
import socket
import json
import time
import random

first_names=('John','Andy','Joe','Alice')
last_names=('Johnson','Smith','Jones', 'Millers')

# while 1:
for i in range(20):
    msg = {
        'name': random.choice(first_names),
        'surname': random.choice(last_names),
        'amount': '{:.2f}'.format(random.random()*1000),
        'delta_t': '{:.2f}'.format(random.random()*10),
        'flag': random.choices([0,1], weights=[0.8, 0.2])[0]
    }
    producer.send('a_partitioned_topic', json.dumps(msg).encode('utf-8'))
    producer.flush()
    time.sleep(0.25)

Let's create a new topic where to store the results of the Kafka+Spark processing...

In [None]:
a_new_topic = NewTopic(name='results', 
                       num_partitions=2, 
                       replication_factor=1)

kafka_admin.create_topics(new_topics=[a_new_topic])

kafka_admin.list_topics()