<h1> Integrating Kafka with Apache Spark </h1>

> One of the most popular use cases in industry is to connect Kafka with Apache Spark, whereby Kafka feeds data into Spark which implements data processing.

In large global organizations, Kafka is not used in isolation by itself, but rather it's usually part of a larger data system that involves multiple steps and phases.

As a quick reminder, Apache Spark is a fast and efficient unified data analytics engine for Big Data and machine learning.  If you haven't already done so, check out the AiCore Spark content for more details.

Below is a high-level overview of the Apache Spark ecosystem:

<p align="center">
  <img src="./images/spark-ecosystem.png" width=600>
  </p>

For the purpose of this lesson, we'll be focusing on __Spark Streaming__.

If you are unfamiliar with Spark, pleasure ensure to review the Spark module.


To get more detailed information about Spark, you can also read the below link
  - __[What is Apache Spark](https://www.ibm.com/cloud/learn/apache-spark)__

### Kafka and Apache Spark

It is common in global companies to use both Kafka and Spark as part of a larger data processing ecosystem.

Below is an example of how such a system would typically look like:

<p align="center">
  <img src="./images/kafka-spark.png" width=600>
  </p>


You'll notice that, in such a system, we have multiple phases or steps in the data flow.  Typically, the flow would be as follows:

__1. Data Streaming:__
- This is where the raw data is normally produced. 
-  It could be either real-time data (such as machine/IoT generated data) or batch data (such as manually entered data into a system).

<p></p>

__2. Data Collection:__  
- The data would then go through a Kafka system, which will be reponsible to connect to the data producer on one hand, and the data consumer (in this case Apache Spark) on the other hand.
-  Kafka also buffers the data until its consumed by Spark.
<p></p>

__2. Data Processing__
- In this topology, Spark (in general) or the Spark Streaming module in particular would be the data consumer, as it'll be on the recieving end of Kafka.
- Data arriving into Spark can be analyzed and processed on the fly in real-time (batch processing is also supported).
- Post-processed data can then be sent to a persistent storage layer where it will be kept for downstream processing.
<p></p>

__3. Data Storage__
- The data storage layer is where the post-processed information is stored.
- There are several options to chose from, including Hadoop HDFS, Amazon S3, NoSQL data stores such as HBase or MongoDB or other commodity hard drives among others.  It's also possible to produce the data to a seperate Kafka topic which can be connected to other downstream systems.
- The cleaned, formatted and processed data will then be accessed by downstream systems that perform various analytics and data science work.
<p></p>

__4. Data Analysis/Visualization__
- In this layer of the process, the prepared data is fed to front-end systems that are used for business intelligence, analytics and data science purposes.
- Data is connected to dashboards and visualization tools which can be leveraged to create reports for top-level business executives.
- Data Science teams use the preapred information to train and test their models.

See how Shopify use Kafka, Python and Spark Streaming for real-time risk management in this [video](https://databricks.com/session/realtime-risk-management-using-kafka-python-and-spark-streaming)

### How to implement Kafka-Spark integration

Now that we have a solid understanding of Kafka, Spark and how both are used together in the real-world, it's time to roll-up our sleeves and start coding.

> In this tutorial, Kafka will act as the streaming platform to produce and consumer events to the Spark framework, while Spark will be used as the data processing engine to count the words from each message batch recieved from Kafka.




At a high-level, the steps involved to implement this excercise include:

1. Setting up the Kafka environment
2. Creating a new Kafka topic
3. Setting up the Spark environment
4. Create/use Wordcount Spark Streaming Python code 
5. Run the Wordcount application
6. Open a Kafka producer 
7. Print the output as the data arrives

### 1. Setting up the Kafka environment

> Ensure that you've followed the detailed instructions in the Kafka module - to properly setup your Kafka environment.

As usual, we'll need to start the Kafka server and zookeeper first. To do so, run the following commands in the terminal (ensure you're in the Kafka folder):

In [None]:
# Start the Kafka Server
bash bin/kafka-server-start.sh config/server.properties

In [None]:
# Start the Zookeeper Server
bash bin/zookeeper-server-start.sh config/zookeeper.properties

Assuming this ran successfully, you should see output similar to:

![](images/kafka-zookeeper.png)

### 2. Creating a new Kafka topic

In order to be able to use Kafka, we'll need a topic to store the data.  To create a new topic, run the following command:

In [None]:
# Create a new Wordcount topic
kafka-topics --create --zookeeper zookeeper_server:2181 --topic wordcounttopic --partitions 1 --replication-factor 1

### 3. Setting up the Spark environment

> Make sure to have followed the steps in the Spark lessons to setup Spark locally on your machine.  If you haven't yet done so, please go to the __Spark Basics__ and __Spark Streaming__ notebooks and follow the steps there before proceeding.

We'll also need to have PySpark installed to be able to run this application.  To check if its already installed, run the following command in the terminal:

In [None]:
pyspark

If PySpark is properly setup, you should see output similar to the following:

![](images/spark-download-7.png)

If PySpark is not setup, you can install it using the following `Pip` command (also, please refer to the Spark Basics and Spark Streaming notebooks for full instructions)

In [None]:
pip install pyspark

### 4. Wordcount Spark Streaming Python code

We need to create a Wordcount application that divides data input streams into batches of 10 seconds, and then counts occurances of the words in each batch.

You can use the following code for this step (save the below code into a file called `kafka_wordcount.py`):

In [None]:
from __future__ import print_function

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
        sys.exit(-1)

    sc = SparkContext(appName="PythonStreamingKafkaWordCount")
    ssc = StreamingContext(sc, 10)

    zkQuorum, topic = sys.argv[1:]
    kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
    counts.pprint()

    ssc.start()
    ssc.awaitTermination()

### 5. Run the Wordcount Application

Now that we have the above code saved in a `kafka_wordcount.py` file, the next step is to use `spark-submit` to utilize Spark to process the data.

Submit the application using `spark-submit` with dynamic allocation disabled and specifying your ZooKeeper server and topic. To run locally, you must specify at least two worker threads: one to receive and one to process data.

Run the following command (ensure you put your correct path for `SPARK_HOME` and your corresponding `spark-example` jar file)

In [None]:
# Spark submit command
spark-submit --master local --deploy-mode client --conf "spark.dynamicAllocation.enabled=false" --jars SPARK_HOME/jars/spark-examples_2.12-3.2.0.jar kafka_wordcount.py zookeeper_server:2181 wordcounttopic

__Important arguments:__
- `master`:
    -   Specifies the manager to use for the cluster.  In this example, we use local, but Spark also supports Mesos, Kubernetes, stand-alone and Yarn.
- `deploy-mode`:
    -   Specifies whether to deploy the driver program on the worker nodes (cluster mode) or on the local machine (client mode)
- `conf`
    -   Specifies the configuration parameters to use.  These configurations are numerous and are used to specify application configurations, shuffle parameters and other runtime configurations.
- `jars`:
    -    Path to the bundled jar file containing the application code and all required dependencies.
- `kafka_wordcount.py`:
    -    The Python file to run
- `zookeeper_server`
    -    Specifies the Zookeeper server details and port number


For a detailed description of all `spark-submit` arguments, run the following command in an open terminal:

In [None]:
# Spark-submit detailed options description
bash /bin/spark-submit --help

Assuming all goes well, you should see the application running.

### 6. Open a Kafka producer

Next, we need a Kafka producer to provide data to Spark.  To create one, run the following command:

In [None]:
# Kafka producer for the topic Wordcount 
kafka-console-producer --broker-list kafka_broker:9092 --topic wordcounttopic

Now, you can enter data in the terminal, which will then be picked up by Spark, and the words will be counted in real-time.

To test this, type the following data in the Kafka producer window:


- `Customer ID`
- `Product ID`
- `Timestamp`
- `Address`
- `Price`
- `Phone Number`
- `Email`
- `Gender`
- `First Name`
- `Last Name`


Depending on how fast you type, in the Spark Streaming application window, you'll see output similar to:


`Time: 2016-01-06 14:18:00`

- `(u'Customer', 1)`
- `(u'Product', 1)`
- `(u'ID', 2)`


This completes the hands-on tutorial and the lesson.

# Key Takeaways

- Kafka and Spark are designed to integrate smoothly together without major configuration or coding effort.
- Kafka can connect to flat files, Twitter feeds and other types of data.
- Kafka can be used to produce a constant stream of data that Spark Streaming can ingest and manipulate.
- Kafka and Spark Streaming together can handle both batch and real-time data
- It's possible to implement different transformations on the data in real-time such as Count and Sort.
- To run a more complex application, we should use `spark-submit` and pass it a file containing our PySpark code and run this command from the terminal.

