# Introduction to Apache Spark Structured Streaming

Structured Streaming is a scalable, fault-tolerant stream processing engine built on top of the Spark SQL engine.
It uses existing Dataset/Dataframe APIs of Spark SQL, unifying the development experience between *classic* batch and stream-based data processing.

## Setup spark and install modules

In [44]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

In [45]:
spark = SparkSession.builder.appName("StructuredStreamingBasics").getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

## Practice

First, let's create a source for data stream. For that we can use Netcat to send data over a network connection.
1. Install netcat & netstat with `apt install netcat net-tools`
2. Check if no other application run on port 9999. Run `netstat -tulpn | grep 9999` to do so. 
3. In the terminal, type `nc -lk 9999` to run the netcat server on port 9999, and then type in whatever you choose.


*For the curious ones, netcat flags explaination:*
>   * -l flag is used to specify that nc should listen for an incoming connection rather than initiate a connection to a remote host
>   * -k flag forces nc to stay listening for another connection after its current connection is completed.

2. Create DataFrame representing the stream of input lines from connection to localhost:9999

In [None]:
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

3. Create another dataFrame that counts words from spliting a line

In [None]:
wordCount = lines.select(explode(split(lines.value, " ")).alias("word")).groupBy("word").count()

4. Start the query that prints the running counts to the console

In [None]:
query = wordCount.writeStream.outputMode("complete").format("console").start()

To kill a Stream, interupt running for cell below

In [42]:
query.stop()
spark.stop()
