# Stream-processing services

**Topics:**  

- Messages and event streaming
- Producers and consumers
- Apache Kafka
- Basic configuration
- Integration with Python


## What is Stream processing?

Stream processing is the practice of taking action on a series of data at the time the data is created. Historically, data practitioners used “real-time processing” to talk generally about data that was processed as frequently as necessary for a particular use case. But with the advent and adoption of stream processing technologies and frameworks, coupled with decreasing prices for RAM, “stream processing” is used in a more specific manner.  

Stream processing often entails multiple tasks on the incoming series of data (the “data stream”), which can be performed serially, in parallel, or both. This workflow is referred to as a stream processing pipeline, which includes the generation of the data, the processing of the data, and the delivery of the data to a final location.  

Actions that stream processing takes on data include aggregations (e.g., calculations such as sum, mean, standard deviation), analytics (e.g., predicting a future event based on patterns in the data), transformations (e.g., changing a number into a date format), enrichment (e.g., combining the data point with other data sources to create more context and meaning), and ingestion (e.g., inserting the data into a database).

![image.png](attachment:image.png)

Stream processing allows applications to respond to new data events at the moment they occur. In this simplified example, input data is processed by the stream processing engine in real-time. The output data is delivered to a streaming analytics application and added to the output stream.

### Why Stream Processing?
To understand why stream processing came into existence, let’s look into how data processing was done before. With the previous approach, called batch processing, all data was stored in a database or a distributed filesystem, and different applications would perform computation using this data. Since batch processing tools were built to process datasets of finite size, to continuously process new data, an application would periodically crunch data from the last period like one hour or one day.


![image.png](attachment:image.png)

While this architecture worked for many years and still has many applications, it has fundamental drawbacks. Since new data is not processed as soon as it arrives, this causes several issues:  

- **High latency** — new results are computed only after a significant delay, but since the value of data decreases with time, this is undesirable  

- **Session data** — since a batch processing system splits data into time intervals, it is hard to analyze events that started during the one-time interval but ended during another time interval  

- **Non-uniform load** — a batch processing system should wait until enough data is accumulated before it can process the next block of data


Stream processing, data processing on its head, is all about processing a flow of events. A typical stream application consists of a number of producers that generate new events and a set of consumers that process these events. Events in the system can be any number of things, such as financial transactions, user activity on a website, or application metrics. Consumers can aggregate incoming data, send automatic alerts in real-time, or produce new streams of data that can be processed by other consumers.

![image.png](attachment:image.png)

This architecture has a number of advantages:
- **Low-latency** — a system can process new events and react to them in real-time
- **A natural fit for many applications** — stream processing system is a natural fit for applications that work with a never-ending stream of events
- **Uniform processing** — instead of waiting for data to accumulate before processing the next batch, stream processing system performs computation as soon as new data arrives


Unsurprisingly, stream processing was first adopted by financial companies that need to process new information, like trades or prices, in real-time, but is now used in many areas like fraud detection, online recommendations, monitoring, and many others.

This architecture, however, poses a question: how should producers and consumers be connected? Should a producer open a TCP session to every consumer and send events directly? While this may be an option, it presents a significant issue if a producer is writing data that a consumer can process. Also, if we have a significant number of consumer and producers, the web of connections can turn into an unruly mess.  

This is exactly the problem that LinkedIn faced in 2008, when they ended up with a number of multiple point-to-point pipelines among multiple systems. To organize it, they started working on an internal project that eventually became Apache Kafka. In a nutshell, Kafka is a buffer that allows producers to store new streaming events and consumers to read them, in real-time, at their own pace.


![image.png](attachment:image.png)

## Messages and event streaming

Enterprise messaging technologies, such as IBM MQ, RabbitMQ and ActiveMQ, have provided asynchronous communication within and across applications for many years. Recently, event streaming technologies (such as Apache Kafka) have grown in popularity, and they also provide asynchronous communication.

### Choosing between enterprise messaging and event streaming technologies

Enterprise messaging and event streaming technologies have different capabilities that they excel at but also have capabilities in common. Selecting the right technology for your solution is key to ensuring that it is not a forced fit.
To facilitate this evaluation, consider these key selection criteria to consider when selecting the right technology for your solution:
- Event history
- Fine-grained subscriptions
- Scalable consumption
- Transactional behavior


**Event history**  


Does the solution need to be able to retrieve historical events either during normal and/or failure situations? Within a messaging pub/sub model, the event is published to a topic. Once it is received by the subscriber it is their responsibility to store this information for the future. There are certain situations where the pub/sub model can retain the last publication, but it is certainly unusual to get the messaging technology to store historical events. For Apache Kafka, storing event history is fundamental to the architecture, the only questions are how much and for how long. In many use cases it is critical to store this history, while in others it may be undesirable from a security and system resources standpoint.

**Fine-grained subscriptions**  

When a topic is created in Apache Kafka this creates one or more partitions within the solution. This is a fundamental architectural concept within Apache Kafka, and provides the capability to scale the solution to handle a massive amount of events. Each partition uses up resources and it is normally advisable to limit the number of topics to hundreds or maybe thousands within a single cluster.

![image.png](attachment:image.png)

Messaging pub/sub technologies have a more flexible mechanism, where the topic can be a hierarchical structure such as /travel/flights/airline/flight-number/seat, allowing more subscription points. This allows subscribing applications to select the events at a finer granularity. In addition, messaging pub/sub selectors can be used to further refine the events of interest.   
Applications subscribing to messaging pub/sub systems are far less likely to receive events that are irrelevant to them, while applications subscribing to Apache Kafka that want only a small proportion of events will likely need a discarding filter to be applied early in the processing.

**Scalable Consumption**  
If 100 consumers subscribe to all events on a topic, a messaging technology may create 100 messages for each published event. Each of these may be stored and, if required, persisted to disk using system resources. In the case of Apache Kafka, the event is written once and each consumer has an index corresponding to where they are in the event history. Messaging providers such as IBM MQ are highly scalable, so depending on the number of events emitted by the publisher, and the number of subscribers, this may or may not be a factor in deciding the most appropriate technology.


**Transactional behavior**  

Both enterprise messaging technologies, such as IBM MQ, and event streaming technologies, such as Apache Kafka, provide transactional APIs to process events. However, the two implementations do work differently and therefore aren’t automatically interchangeable. IBM MQ provides the ACID properties of Atomicity, Consistency, Isolation, and Durability but these are not guaranteed in the same way in Apache Kafka. Often in a pub/sub solution, the specific transactional behavior of IBM MQ is not as critical as in a request for processing use case so being aware of the difference is important.

### Why use Event Streaming?  
Event streaming is really powerful when you have events that you want to be able to process and perform analysis in real-time allowing your systems to immediately take action.  

**Use Case**
- Processing
    - Payments
    - Stocks
- Detection
    - Fraud
    - Anomaly
- Maintenance
    - Predictive
- Analytics
    - IoT


**Event Streaming Use Case Table**   

- Event streaming is used for real-time processing and analysis of changes in state in the data through events
- It can persist events supporting the rebuilding of the state through replaying events in the order they were received
- It allows multiple consumers to receive each event

![image.png](attachment:image.png)
Event Streaming pattern

### Why use Messaging?
Messaging is powerful when it comes to decoupling your systems and providing a highly available and durable solutions, apart from a few use cases, messaging is more of a way to manage your system interactions and control the ingestion of your data.


Pub / Sub **Use Case**  
- Applications
    - Stateful
- Workflows
    - Asynchronous
- Workloads
    - Balancing
- Systems
    - Stocks
- Notifications
    - Email
    - SMS
- Delivery
    - Multiple consumers
    - Ordered


**Messaging Pub / Sub Use Case Table**  
- Publish-subscribe is commonly referred to as pub-sub
- Pub-sub moves data from producers to consumers
- It allows multiple consumers to receive each message in a topic
- Publish-subscribe ensures that each consumer receives messages in a topic in the exact order in which they were received by the messaging system


![image.png](attachment:image.png)

Queueing
**Use Case**  
- Workflows
    - Asynchronous
- Workloads
    - Balancing
- Systems
    - Pay Check
- Task Lists
    - Work Queues
- Delivery
    - One consumer
    - Ordering not important

**Message Queueing Use Case Table**  

- Message queueing ensures that for exactly one consumer each message is delivered and processed
- It does not ensure that messages are delivered or processed in order
- However, each message is removed from the queue once it has been delivered but it does requires consumer acknowledgement

![image.png](attachment:image.png)
Message Queue (push / pull) pattern

## What is Apache Kafka?

Kafka is a **distributed publish-subscribe messaging system that maintains feeds of messages in partitioned and replicated topics**. In the simplest way there are three players in the Kafka ecosystem: producers, topics (run by brokers) and consumers.  

**Producers produce messages** to a topic of their choice. It is possible to attach a key to each message, in which case the producer guarantees that all messages with the same key will arrive to the same partition.  

**Topics are logs** that receive data from the producers and store them across their partitions. Producers always write new messages at the end of the log. In our example we can make abstraction of the partitions, since we’re working locally.  

**Consumers read the messages** of a set of partitions of a topic of their choice at their own pace. If the consumer is part of a consumer group, i.e. a group of consumers subscribed to the same topic, they can commit their offset. This can be important if you want to consume a topic in parallel with different consumers.


![image.png](attachment:image.png)

The offset is the position in the log where the consumer last consumed or read a message. The consumer can then commit this offset to make the reading ‘official’. Offset committing can be done automatically in the background or explicitly. In our example we will commit automatically in the background.

![image.png](attachment:image.png)

### Kafka Concepts


![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

### Topics
Every message that is feed into the system must be part of some topic. The topic is nothing but a stream of records. The messages are stored in key-value format. Each message is assigned a sequence, called Offset. The output of one message could be an input of the other for further processing.

### Producers  
- Producers are the apps responsible to publish data into Kafka system. They publish data on the topic of their choice. (ex: clickstream, logs, IoT).

**Advantages**  
The Kafka Producer API is extremely simple to use: send data, it’s asynchronous and you will get a callback. This is perfectly suited for applications directly emitting streams of data such as logs, clickstreams, IoT.
It is very common to use this kind of API in combination with a Proxy

**Limitations**  
The Kafka Producer API can be extended and built upon to do a lot more things, but this will require engineers to write a lot of added logic. The biggest mistake I see is people trying to perform ETL between a database and Kafka using the Producer API. Here are a few things that are not easy to do:  

- How to track the source offsets? (i.e. how to properly resume your producer if it was stopped)
- How to distribute the load for your ETL across many producers?


For this, we’re much better off using the Kafka Connect Source API

### Consumers  
The messages published into topics are then utilized by Consumers apps. A consumer gets subscribed to the topic of its choice and consumes data. Read a stream and perform real-time actions on it (e.g. send email…)

**Advantages**
The Kafka Consumer API is dead-simple, works using Consumer Groups so that your topics can be consumed in parallel. Although you need to be careful about a few things, such as offset management and commits, as well as rebalances and idempotence constraints, they’re really easy to write. For any stateless kind of workload, they will be perfect. Think notifications!

**Limitations**
When you perform some kind of ETL, Kafka Connect Sinks are better suited as they will avoid you to write some complicated logic against an external data source.


### Broker  
Every instance of Kafka that is responsible for message exchange is called a Broker. Kafka can be used as a stand-alone machine or a part of a cluster.  

**Example**, there is a warehouse or go down of a restaurant where all the raw material is dumped like rice, vegetables etc. The restaurant serves different kinds of dishes: Chinese, Desi, Italian etc. The chefs of each cuisine can refer to the warehouse, pick the desire things and make things. There is a possibility that the stuff made by the raw material can later be used by all departments’ chefs, for instance, some secret sauce that is used in ALL kind of dishes. Here, the warehouse is a broker, vendors of goods are the producers, the goods and the secret sauce made by chefs are topics while chefs are consumers.


![image.png](attachment:image.png)

### Workflow of Pub-Sub Messaging

Following is the step wise workflow of the Pub-Sub Messaging −

- Producers send message to a topic at regular intervals.

- Kafka broker stores all messages in the partitions configured for that particular topic. It ensures the messages are equally shared between partitions. If the producer sends two messages and there are two partitions, Kafka will store one message in the first partition and the second message in the second partition.

- Consumer subscribes to a specific topic.

- Once the consumer subscribes to a topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble.

- Consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.

- Once Kafka receives the messages from producers, it forwards these messages to the consumers.

- Consumer will receive the message and process it.

- Once the messages are processed, consumer will send an acknowledgement to the Kafka broker.

- Once Kafka receives an acknowledgement, it changes the offset to the new value and updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the consumer can read next message correctly even during server outrages.

- This above flow will repeat until the consumer stops the request.

- Consumer has the option to rewind/skip to the desired offset of a topic at any time and read all the subsequent messages.

### Benefits
Following are a few benefits of Kafka −

- **Reliability** − Kafka is distributed, partitioned, replicated and fault tolerance.

- **Scalability** − Kafka messaging system scales easily without down time..

- **Durability** − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable..

- **Performance** − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.

Kafka is very fast and guarantees zero downtime and zero data loss.

### Kafka Use Cases 

Uses of Kafka are multiple. Here are a few use-cases that could help you to figure out its usage.

- **Activity Monitoring:-** Kafka can be used for activity monitoring. The activity could belong to a website or physical sensors and devices. Producers can publish raw data from data sources that later can be used to find trends and pattern.
- **Messaging:-** Kafka can be used as a message broker among services. If you are implementing a microservice architecture, you can have a microservice as a producer and another as a consumer. For instance, you have a microservice that is responsible to create new accounts and other for sending email to users about account creation.
- **Log Aggregation:-** You can use Kafka to collect logs from different systems and store in a centralized system for further processing.
- **ETL:-** Kafka has a feature of almost real-time streaming thus you can come up with an ETL based on your need.
- **Database:-** Based on things I mentioned above, you may say that Kafka also acts as a database. Not a typical databases that have a feature of querying the data as per need, what I meant that you can keep data in Kafka as long as you want without consuming it.

##  Basic configuration Integration with Python

**Setting up the environment**  

In [1]:
pip install kafka-python

Collecting kafka-python
  Using cached kafka_python-2.0.2-py2.py3-none-any.whl (246 kB)
Installing collected packages: kafka-python
Successfully installed kafka-python-2.0.2
Note: you may need to restart the kernel to use updated packages.


1. First of all you have to installed Kafka and Zookeeper on your machine. For installation guide [click here](https://medium.com/@shaaslam/installing-apache-kafka-on-windows-495f6f2fd3c8)   


2. Next install Kafka-Python. You can do this using pip or conda, if you’re using an Anaconda distribution.

`pip install kafka-python`

OR 

`conda install -c conda-forge kafka-python`



3. Don’t forget to start your **Zookeeper server and Kafka broker** before executing the example code below. In this example we assume that Zookeeper is running default on localhost:2181 and Kafka on localhost:9092.


In [None]:
#Starting Zookeeper
#Change dir to the zookeeper bin dir : C:\Apache\apache-zookeeper-3.6.2-bin\bin
zkserver

In [None]:
#Starting Kafka Server
#Change dir to kafka windown dir: C:\Apache\kafka_2.13-2.7.0\bin\windows
kafka-server-start.bat C:\Apache\kafka_2.13-2.7.0\config\server.properties

### Example 1:

### Create Topics

We are also using a topic called **numtest** in this example, you can create a new topic by opening a new command prompt, navigating to `…/kafka/bin/windows` and execute:

**`kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic numtest`**

### Code

In our example we’ll create a producer that emits numbers from 1 to 1000 and send them to our Kafka broker. Then a consumer will read the data from the broker and store them in a MongoDb collection.  

The advantage of using Kafka is that, if our consumer breaks down, the new or fixed consumer will pick up reading where the previous one stopped. This is a great way to make sure **all the data is fed into the database without duplicates or missing data.**


Create a new Python script named producer.py and start with importing json, time.sleep and KafkaProducer from Kafka-Python library.

In [4]:
from time import sleep
from json import dumps
from kafka import KafkaProducer

Then initialize a new Kafka producer. Note the following arguments:

- **bootstrap_servers=[‘localhost:9092’]:** sets the host and port the producer should contact to bootstrap initial cluster metadata. It is not necessary to set this here, since the default is localhost:9092.

- **value_serializer=lambda x: dumps(x).encode(‘utf-8’):** function of how the data should be serialized before sending to the broker. Here, we convert the data to a json file and encode it to utf-8.

In [5]:
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                         value_serializer=lambda x: 
                         dumps(x).encode('utf-8'))

NoBrokersAvailable: NoBrokersAvailable

Now, we want to generate numbers from one till 1000. This can be done with a for-loop where we feed each number as the value into a dictionary with one key: number. This is not the topic key, but just a key of our data. Within the same loop we will also send our data to a broker.  

This can be done by calling the send method on the producer and specifying the topic and the data. Note that our value serializer will automatically convert and encode the data. To conclude our iteration,we take a 5 second break. If you want to make sure the message is received by the broker, it’s advised to include a callback.

In [None]:
for e in range(1000):
    data = {'number' : e}
    producer.send('numtest', value=data)
    sleep(5)

#### Consuming the data
Before we start coding our consumer, create a new file consumer.py and import json.loads, the KafkaConsumer class and MongoClient from pymongo. 

In [2]:
from kafka import KafkaConsumer
from pymongo import MongoClient
from json import loads

Let’s create our KafkaConsumer and take a closer look at the arguments.

- The first argument is the topic, **numtest** in our case.
- **bootstrap_servers=[‘localhost:9092’]:** same as our producer
- **auto_offset_reset=’earliest’:** one of the most important arguments. It handles where the consumer restarts reading after breaking down or being turned off and can be set either to earliest or latest. When set to latest, the consumer starts reading at the end of the log. When set to earliest, the consumer starts reading at the latest committed offset. And that’s exactly what we want here.
- **enable_auto_commit=True:** makes sure the consumer commits its read offset every interval.
- **auto_commit_interval_ms=1000ms:** sets the interval between two commits. Since messages are coming in every five second, committing every second seems fair.
- **group_id=’my-group’:** this is the consumer group to which the consumer belongs. Remember from the introduction that a consumer needs to be part of a consumer group to make the auto commit work.  
- The value deserializer deserializes the data into a common json format, the inverse of what our value serializer was doing.


In [3]:
consumer = KafkaConsumer(
    'numtest',
     bootstrap_servers=['localhost:9092'],
     auto_offset_reset='earliest',
     enable_auto_commit=True,
     group_id='my-group',
     value_deserializer=lambda x: loads(x.decode('utf-8')))

NoBrokersAvailable: NoBrokersAvailable

The code below connects to the numtest collection (a collection is similar to a table in a relational database) of our MongoDb database.

In [None]:
client = MongoClient('localhost:27017')
collection = client.numtest.numtest

We can extract the data from our consumer by looping through it (the consumer is an iterable). The consumer will keep listening until the broker doesn’t respond anymore. A value of a message can be accessed with the value attribute. Here, we overwrite the message with the message value.

The next line inserts the data into our database collection. The last line prints a confirmation that the message was added to our collection. Note that it is possible to add callbacks to all the actions in this loop.

In [None]:
for message in consumer:
    message = message.value
    collection.insert_one(message)
    print('{} added to {}'.format(message, collection))

### Testing

Let’s test our two scripts. Open a command prompt and go to the directory where you saved producer.py and consumer.py. Execute producer.py and open a new command prompt. Launch consumer.py and look how it reads all the messages, including the new ones.

Now interrupt the consumer, remember at which number it was (or check it in the database) and restart the consumer. Notice that the consumer picks up all the missed messages and then continues listening for new ones.

Note that if you turn off the consumer within 1 second after reading the message, the message will be retrieved again upon restart. Why? Because our auto_commit_interval is set to 1 second, remember that if the offset is not committed, the consumer will read the message again (if auto_offset_reset is set to earliest).

### Example 2:

#### Create Topics

kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

You can also list all available topics by running the following command.

In [None]:
#Open a command prompt and Change dir to kafka windown dir: C:\Apache\kafka_2.13-2.7.0\bin\windows
kafka-topics.bat --list --zookeeper localhost:2181

#### Sending Messages
Next, we have to send messages, producers are used for that purpose. Let’s initiate a producer.

In [None]:
#Open a command prompt and Change dir to kafka windown dir: C:\Apache\kafka_2.13-2.7.0\bin\windows
kafka-console-producer.bat --broker-list localhost:9092 --topic test
>Hello
>World

You start the console based producer interface which runs on the port 9092 by default. --topic allows you to set the topic in which the messages will be published. In our case the topic is `**test**`.

It shows you a `>` prompt and you can input whatever you want.  

Messages are stored locally on your disk. You can learn about the path of it by checking the value of log.dirs in config/server.properties file.  

#### Consuming Messages
Messages that are stored should be consumed too. Let’s started a console based consumer.

In [None]:
#Open a command prompt and Change dir to kafka windown dir: C:\Apache\kafka_2.13-2.7.0\bin\windows
kafka-console-consumer.bat --bootstrap-server localhost:9092 kafka-console-consumer.bat 

In [None]:
#Open a command prompt and Change dir to kafka windown dir: C:\Apache\kafka_2.13-2.7.0\bin\windows
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test --from-beginning

If you run, it will dump all the messages from the beginning till now. If you are just interested to consume the messages after running the consumer then you can just omit --from-beginning switch it and run. The reason it does not show the old messages because the offset is updated once the consumer sends an ACK to the Kafka broker about processing messages. You can see the workflow below.

![image.png](attachment:image.png)

### Accessing Kafka in Python
There are multiple Python libraries available for usage: 

- **Kafka-Python** — An open-source community-based library.  
- **PyKafka** — This library is maintained by Parsly and it’s claimed to be a Pythonic API. Unlike Kafka-Python you can’t create dynamic topics.  
- **Confluent Python Kafk**a:- It is offered by Confluent as a thin wrapper around librdkafka, hence it’s performance is better than the two.  
