# Structured Streaming

Here it is the schema for the bookstore dataset used in this notebook:

![bookstore dataset schema](../Includes/images/image1.png)

In [0]:
%run ../Includes/Copy-Datasets

## Reading a Stream

`spark.readStream` allows to query a Delta table as a stream source.

In [0]:
(spark.readStream
  .table("books")
  .createOrReplaceTempView("books_streaming_tmp_vw")
)

The temporary view created is a streaming temporary view, that allows to apply most transformations in SQL the same way as with static data.

## Displaying Streaming Data

In [0]:
%sql
SELECT * FROM books_streaming_tmp_vw

Once the query is executed, the streaming result is shown. The query is running waiting for new data to be displayed. Generally speaking, a streaming result is not displayed, unless someone is actively monitoring the output.

## Applying Transformations

Adding aggregation to the streaming temporary view:

In [0]:
%sql
SELECT author, count(book_id) AS total_books
FROM books_streaming_tmp_vw
GROUP BY author

Because a streaming temporary view is being queried, this becomes a streaming query that executes infinitely, rather than completing after retrieving a single set of results. In this case, an aggregation of input is being displayed as seen by the stream. None of these records are being persisted anywhere at this point.

## Unsupported Operations

When working with streaming data, some operations are not supported:
* Sorting
* Deduplication

In [0]:
%sql
SELECT *
FROM books_streaming_tmp_vw
ORDER BY author

However, some advanced methods can be used to achieve such operations:
* Windowing
* Watermarking

In order to persist incremental results, the logic needs to be passed back to PySpark DataFrame API:

## Persisting Streaming Data

Creating a new streaming temporary view from the result of a query againts a streaming temporary view:

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW author_counts_tmp_vw AS (
  SELECT author, count(book_id) AS total_books
  FROM books_streaming_tmp_vw
  GROUP BY author
)

In PySpark DataFRame API, `spark.table()` can be used to load data from a streaming temporary view back to a DataFRame.

In [0]:
# Using DataFrame write Stream method to persist the result of a streaming query to a durable storage
(spark.table("author_counts_tmp_vw")
    .writeStream
    .trigger(processingTime="4 seconds")
    .outputMode("complete")
    .option("checkpointLocation", "dbfs:/mnt/demo/author_counts_checkpoint")
    .toTable("author_counts")
)

Note: Spark always loads **streaming views** as a **streaming DataFrames**, and **static views** as a **static DataFrames**. Therefore, incremental processing must be defined from the very beginning with Read logic to support later incremental writing.

For aggregation streaming, **complete outputMode** has to be used to overwrite the table with new calculation.

It is important to explore the dashboard while the streaming process is being executed, so that we can see that the data has been processed and we can now query the target table.

In [0]:
%sql
SELECT * FROM author_counts

Data has been written to the target table (`author_counts`) and each author has currently only one book. Tis is not a streaming query, because the table is being queried directly.

## Adding New Data

Let's add some data to the `books` table while the last streaming is still active to see is this new data arrives on the streaming.

In [0]:
%sql
INSERT INTO books
values("B19", "Introduction to Modeling and Simulation", "Mark W. Spong", "Computer Science", 25),
        ("B20", "Robot Modeling and Control", "Mark W. Spong", "Computer Science", 30),
        ("B21", "Turing's Vision: The Birth of Computer Science", "Chris Bernhardt", "Computer Science", 35)

New data has succesfully arrived as it is shown on the Dashboards of the processing stream:

![](Screenshot 2025-05-28 134006.png)

It can be double-checked by querying the table again:

In [0]:
%sql
SELECT * FROM author_counts

Now some authors have more than 1 book. 

Note: remember to interrupt active streams in the notebook, otherwise the stream will be always working an dprevents the cluster from auto termination.

## Streaming in Batch Mode

Adding new data:

In [0]:
%sql
INSERT INTO books
values("B16", "Hands-On Deep Learning Algorithms with Python", "Sudharsan Ravichandiran", "Computer Science", 25),
      ("B17", "Neural Network Methods in Natural Language Processing", "Yoav Goldberg", "Computer Science", 30),
      ("B18", "Understanding digital signal processing", "Richard Lyons", "Computer Science", 35)

In the following scenario, the trigger method is modified to change the query from an always-on query triggered every 4s, to a triggered incremental batch.

In [0]:
(spark.table("author_counts_tmp_vw")                               
      .writeStream           
      .trigger(availableNow=True)
      .outputMode("complete")
      .option("checkpointLocation", "dbfs:/mnt/demo/author_counts_checkpoint")
      .table("author_counts")
      .awaitTermination()
)

With the trigger option `availableNow`, the query processes all new available data and stop on its own after execution. The `awaitTermination` method blocks the execution of any cell in this notebook until the incremental batch is writing has succeeded.

In [0]:
%sql
SELECT * FROM author_counts

It has worked as there are now 3 more authors.