# Structured Streaming Demo

### Demo

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.3.0-bin-hadoop2.7')

In [2]:
import sys
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

In [3]:
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

In the terminal, type `nc -lk 9999` to run the netcat server, and then type in whatever you choose.

In [4]:
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

In [5]:
# Split the lines into words
words = lines.select(explode(split(lines.value, " ")).alias("word"))

In [6]:
# Generate running word count
wordCounts = words.groupBy("word").count()

Some of the operations we can run on the structured stream:

| Operator               | Purpose                                                                                     |
|------------------------|------------------------------------------------------------------------------------------|
| query.name()           | get the unique identifier of the running query that persists across restarts from checkpoint data |
| query.id()             | get the unique identifier of the running query that persists across restarts from checkpoint data |
| query.runId()          | get the unique id of this run of the query, which will be generated at every start/restart        |
| query.recentProgress() | an array of the most recent progress updates for this query                                       |
| query.lastProgress()   | the most recent progress update of this streaming query                                           |
| spark.streams().active | get the list of currently active streaming queries                                                |
| query.stop()           | stop the query                                                                                    |

In [7]:
# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

KeyboardInterrupt: 

## References
1. 