## Spark Streaming Notebook

This notebook consumes a message from a kafka topic, writes the message into another kafka topic.

#### Create a new Spark Session, and add a kafka package from a jar file

In [1]:
from pyspark.sql import SparkSession

# Spark session and context
spark = (SparkSession
        .builder
        .master('local')
        .appName('kafka-streaming')
        # Add kafka package
        .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5")
        .getOrCreate())

sc = spark.sparkContext

#### Now we have to read the message from the kafka stream

Create a new instance of the `readStream` object, add kafka bootstrap servers with appropriate port number to internally connect to kafka, and then subscribe to the ingestion topic. Use `selectExpr` to format the messages we're getting from Kafka.

In [2]:
df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "ingestiontopic") \
    .option("failOnDataLoss", "false") \
    .load()

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

DataFrame[key: string, value: string]

#### Create a small temporary view for SparkSQL

Using Spark SQL, from the dataframe created by the readStream, create a new Temporary view.

Then, write out the message to the console of the environment with append mode.

In [3]:
df.createOrReplaceTempView("message")

res = spark.sql("SELECT * FROM message")
res.writeStream.format("console") \
        .outputMode("append") \
        .start()

<pyspark.sql.streaming.StreamingQuery at 0x7fe8bc3107d0>

#### Write message back into kafka in another topic that you are going to listen to with a local consumer

use `/tmp` as the checkpoint location within the docker container

In [None]:
ds = df \
    .writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("topic", "sparkoutput") \
    .option("checkpointLocation", "/tmp") \
    .option("failOnDataLoss", "false") \
    .start() \
    .awaitTermination()