# Structured Streaming Demo

Introduction about Structured Streaming
* Overview of Structured Streaming
    * Streaming code that looks a lot like the equivalent non-streaming code
    * Structured data allows Spark to represent the Data more efficiently
    * SQL-style queries allow for query optimization opportunities - and even better performance
    * Interoperability with other Spark components based on DataSets
        * MLLib is also moving towards DataSets as it's primary API
    * DataSets in General seems to be the direction Spark is moving
* Basic Concepts about Spark streaming
* https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
* DEMO: A quick demo about an structured streaming example.
* https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#quick-example


### Demo

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.2.1-bin-hadoop2.7')

In [2]:
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

In [None]:
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

In [None]:
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

In [None]:
# Split the lines into words
words = lines.select(explode(split(lines.value, " ")).alias("word"))

In [None]:
# Generate running word count
wordCounts = words.groupBy("word").count()

In [None]:
# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()

query.awaitTermination()

In [None]:
query.name()        # get the name of the auto-generated or user-specified name

In [None]:
query.id()          # get the unique identifier of the running query that persists across restarts from checkpoint data

In [None]:
query.runId()       # get the unique id of this run of the query, which will be generated at every start/restart

In [None]:
query.recentProgress()  # an array of the most recent progress updates for this query

In [None]:
query.lastProgress()    # the most recent progress update of this streaming query

In [None]:
spark.streams().active  # get the list of currently active streaming queries

In [None]:
query.stop()      # stop the query

## References
1. 