# Spark Stractured Streaming
![alt text](pics/demo2.png )

The motivation of this project is to provide ability of processing data in **real-time**
 from various sources like openmrs, eid, e.t.c

https://spark.apache.org/docs/2.3.3/structured-streaming-kafka-integration.html#deploying

https://mtpatter.github.io/bilao/notebooks/html/01-spark-struct-stream-kafka.html

http://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

## Set up Spark Session

In [1]:
from pyspark.sql import SparkSession
from pyspark.streaming.kafka import KafkaUtils
spark = SparkSession.builder \
            .appName("Spark Structured Streaming from Kafka") \
            .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.3") \
            .config('spark.executor.memory', '10G')\
            .config('spark.driver.memory', '10G')\
            .config('spark.driver.maxResultSize', '10G')\
            .getOrCreate()
 
spark

## Connection to Kakfa
A Kafka topic can be viewed as an infinite stream where data is retained for a configurable amount of time. The infinite nature of this stream means that when starting a new query, we have to first decide what data to read and where in time we are going to begin. At a high level, there are three choices:

- earliest — start reading at the beginning of the stream. This excludes data that has already been deleted from Kafka because it was older than the retention period (“aged out” data).
- latest — start now, processing only new data that arrives after the query has started.

<img src="https://databricks.com/wp-content/uploads/2017/04/kafka-topic.png" width="300">

In [2]:
kafkaStreamDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribePattern", "dbserver1.openmrs.*") \
    .option("startingOffsets", "earliest") \
    .load() 

# for single topic==>   .option("subscribe", "topic") \

## Create output for Spark Structured Streaming

In [4]:
kafkaRaw = kafkaStreamDF \
        .writeStream \
        .queryName("kafkaraw")\
        .format("memory")\
        .start()

raw = spark.sql("select * from kafkaraw")
print(raw.printSchema())
print(raw.show())

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

None
+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
+---+-----+-----+---------+------+---------+-------------+

None


## Cast Value as String

In [5]:
#ds = dsraw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
ds = kafkaStreamDF.selectExpr("CAST(value AS STRING)")
print(type(ds))
print(ds)

<class 'pyspark.sql.dataframe.DataFrame'>
DataFrame[value: string]


In [8]:
castedDF = kafkaStreamDF.selectExpr("CAST(value AS STRING)") \
        .writeStream \
        .queryName("castedDFS")\
        .format("memory")\
        .start()

In [14]:
castedDFS = spark.sql("select * from castedDFS")
print(castedDFS.printSchema())
print(castedDFS.show())

root
 |-- value: string (nullable = true)

None
+--------------------+
|               value|
+--------------------+
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
|{"schema":{"type"...|
+--------------------+
only showing top 20 rows

None


## Convert To Parquet:

In [24]:

kafkaStreamDF.selectExpr("CAST(value AS STRING)") \
        .writeStream \
        .format("parquet")\
        .option("path","streamingdata")\
        .option("checkpointLocation", "streamcheckpoint")\
        .start()

<pyspark.sql.streaming.StreamingQuery at 0x7f4fa4184588>

## Push To Kafka as new topic:

In [33]:
# Write key-value data from a DataFrame to a Kafka topic specified in an option
query =kafkaStreamDF.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\
  .writeStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092")\
  .option("topic","spark_to_kafka") \
  .option("checkpointLocation", "sinkcheckpoint") \
  .start()