# Spark Straming

Structured stream was introduced in Spark 2 and is the current API for streaming. It's based on DataFrames / DataSets, unlike the previous Dtream API based on RDDs.

In [None]:
// Streaming source that monitors the data/logs directory for text files
val accessLines = spark.readStream.text("data/logs")

// Regular expressions to extract pieces of Apache access log lines
val contentSizeExp = "\\s(\\d+)$"
val statusExp = "\\s(\\d{3})\\s"
val generalExp = "\"(\\S+)\\s(\\S+)\\s*(\\S*)\""
val timeExp = "\\[(\\d{2}/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2} -\\d{4})]"
val hostExp = "(^\\S+\\.[\\S+\\.]+\\S+)\\s"

// Apply these regular expressions to create structure from the unstructured text
val logsDF = accessLines.select(regexp_extract(col("value"), hostExp, 1).alias("host"),
regexp_extract(col("value"), timeExp, 1).alias("timestamp"),
regexp_extract(col("value"), generalExp, 1).alias("method"),
regexp_extract(col("value"), generalExp, 2).alias("endpoint"),
regexp_extract(col("value"), generalExp, 3).alias("protocol"),
regexp_extract(col("value"), statusExp, 1).cast("Integer").alias("status"),
regexp_extract(col("value"), contentSizeExp, 1).cast("Integer").alias("content_size"))

// Keep a running count of status codes
val statusCountsDF = logsDF.groupBy("status").count()

// Display the stream to the console
val query = statusCountsDF.writeStream.outputMode("complete").format("console").queryName("counts").start()

// Wait until we terminate the scripts
query.awaitTermination()


Intitializing Scala interpreter ...

Spark Web UI available at http://ec67bae28344:4040
SparkContext available as 'sc' (version = 3.1.1, master = local[*], app id = local-1626909425420)
SparkSession available as 'spark'


-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----+
|status|count|
+------+-----+
|   500|10714|
|   301|  271|
|   400|    2|
|   404|   26|
|   200|64971|
|   304|   92|
|   302|    2|
|   405|    1|
+------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------+------+
|status| count|
+------+------+
|   500| 21428|
|   301|   542|
|   400|     4|
|   404|    52|
|   200|129942|
|   304|   184|
|   302|     4|
|   405|     2|
+------+------+

-------------------------------------------
Batch: 2
-------------------------------------------
+------+------+
|status| count|
+------+------+
|   500| 32142|
|   301|   813|
|   400|     6|
|   404|    78|
|   200|194913|
|   304|   276|
|   302|     6|
|   405|     3|
+------+------+

-------------------------------------------
Batch: 3
-------------------------------------------
+------+------+
|status| count|
+------+-----