# Introduction to Apache Spark Structured Streaming

Structured Streaming is a scalable, fault-tolerant stream processing engine built on top of the Spark SQL engine.
It uses existing Dataset/Dataframe APIs of Spark SQL, unifying the development experience between *classic* batch and stream-based data processing.

## Setup spark and install modules

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

In [8]:
spark = SparkSession.builder.appName("StructuredStreamingBasics").getOrCreate()

24/04/26 12:55:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Practice

First, let's create a source for data stream. For that we can use Netcat to send data over a network connection.
1. Install netcat & netstat with `apt install netcat net-tools`
2. Check if no other application run on port 9999. Run `netstat -tulpn | grep 9999` to do so. 
3. In the terminal, type `nc -lk 9999` to run the netcat server on port 9999, and then type in whatever you choose.


*For the curious ones, netcat flags explaination:*
>   * -l flag is used to specify that nc should listen for an incoming connection rather than initiate a connection to a remote host
>   * -k flag forces nc to stay listening for another connection after its current connection is completed.

2. Create DataFrame representing the stream of input lines from connection to localhost:9999

In [9]:
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

24/04/26 12:57:38 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


3. Create another dataFrame that counts words from spliting a line

In [10]:
wordCount = lines.select(explode(split(lines.value, " ")).alias("word")).groupBy("word").count()

4. Start the query that prints the running counts to the console

In [11]:
query = wordCount.writeStream.outputMode("complete").queryName("Streaming Introduction").format("console").start()

24/04/26 12:58:37 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-8eb81266-4cd7-48b3-8df0-ba46261c15e6. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/04/26 12:58:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+-----+
| word|count|
+-----+-----+
|hello|    3|
+-----+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+-----+
| word|count|
+-----+-----+
|dobry|    1|
|hello|    3|
|dzien|    1|
+-----+-----+



24/04/26 13:05:38 WARN TextSocketMicroBatchStream: Stream closed by localhost:9999


To kill a Stream, interupt running for cell below

In [12]:
query.stop()
spark.stop()

24/04/26 13:06:38 WARN StateStore: Error running maintenance thread
java.lang.IllegalStateException: SparkEnv not active, cannot do maintenance on StateStores
	at org.apache.spark.sql.execution.streaming.state.StateStore$.doMaintenance(StateStore.scala:632)
	at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$startMaintenanceIfNeeded$1(StateStore.scala:610)
	at org.apache.spark.sql.execution.streaming.state.StateStore$MaintenanceTask$$anon$1.run(StateStore.scala:453)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.