# Spark Stractured Streaming
![alt text](pics/demo2.png )

The motivation of this project is to provide ability of processing data in **real-time**
 from various sources like openmrs, eid, e.t.c

https://spark.apache.org/docs/2.3.3/structured-streaming-kafka-integration.html#deploying

https://mtpatter.github.io/bilao/notebooks/html/01-spark-struct-stream-kafka.html

http://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

## Set up Spark Session

In [2]:
from pyspark.sql import SparkSession
from pyspark.streaming.kafka import KafkaUtils
spark = SparkSession.builder \
            .appName("Spark Structured Streaming from Kafka") \
            .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.3") \
            .config('spark.executor.memory', '10G')\
            .config('spark.driver.memory', '10G')\
            .config('spark.driver.maxResultSize', '10G')\
            .getOrCreate()
 
spark

## Connection to Kakfa
A Kafka topic can be viewed as an infinite stream where data is retained for a configurable amount of time. The infinite nature of this stream means that when starting a new query, we have to first decide what data to read and where in time we are going to begin. At a high level, there are three choices:

- earliest — start reading at the beginning of the stream. This excludes data that has already been deleted from Kafka because it was older than the retention period (“aged out” data).
- latest — start now, processing only new data that arrives after the query has started.

<img src="https://databricks.com/wp-content/uploads/2017/04/kafka-topic.png" width="300">

In [3]:
from pyspark.sql.types import *
import pyspark.sql.functions as f
obs_schema = StructType([
    StructField('obs_id', LongType(), True),
    StructField('voided', BooleanType(), True),
    StructField('concept_id', IntegerType(), True),
    StructField('obs_datetime', TimestampType(), True),
    StructField('value', StringType(), True),
    StructField('value_type', StringType(), True),
    StructField('obs_group_id', IntegerType(), True),
    StructField('parent_concept_id', IntegerType(), True)
])


In [16]:
patient= StructType([
    StructField('patient_id', LongType(), True),
    StructField('date_created', LongType(), True),
    StructField('creator', LongType(), True)
])

schema= StructType([
                StructField('schema', StringType()),
                StructField('payload', 
                           StructType([
                                StructField('before', StringType()),
                                StructField('after', patient)
                           ])
                           )
            ])

jsonOptions = { "timestampFormat": "yyyy-MM-dd'T'HH:mm:ss.sss'Z'" }

kafkaStreamDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "dbserver1.openmrs.patient") \
    .option("startingOffsets", "earliest") \
    .load()\
    .select(f.from_json(f.col("value").cast("string"), schema, jsonOptions).alias("parsed_value"))
    
kafkaStreamDF.createOrReplaceTempView("patients")

print(kafkaStreamDF.printSchema())



root
 |-- parsed_value: struct (nullable = true)
 |    |-- schema: string (nullable = true)
 |    |-- payload: struct (nullable = true)
 |    |    |-- before: string (nullable = true)
 |    |    |-- after: struct (nullable = true)
 |    |    |    |-- patient_id: long (nullable = true)
 |    |    |    |-- date_created: long (nullable = true)
 |    |    |    |-- creator: long (nullable = true)

None


In [21]:
sql="""
        SELECT
          WINDOW(FROM_UNIXTIME(parsed_value.payload.after.date_created/1000), "1 hour", "10 minutes") AS eventWindow,
          parsed_value.payload.after.creator AS creator,
          AVG(parsed_value.payload.after.patient_id) AS avgAge,
          MIN(parsed_value.payload.after.patient_id) AS minAge,
          MAX(parsed_value.payload.after.patient_id) AS maxAge
        FROM
          patients
        GROUP BY
          eventWindow,
          creator
        ORDER BY
          eventWindow,
          creator
  """
    
query = spark.sql(sql)
# show results
result = query.writeStream\
    .format("console")\
    .outputMode("complete")\
    .option("truncate", "false")\
    .start()


#result.awaitTermination()

In [None]:
query= kafkaStreamDF\
    .select("parsed_value.payload.after.*")\
    .writeStream \
    .format("console")\
    .start()

#query.awaitTermination()
# pleasse see your terminal/console 

![alt text](pics/console.png )