When working with event time, it’s just another column in our dataset, and that’s really all we need to concern ourselves with!

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", 5)
static = spark.read.json("/FileStore/tables/activity-data")
streaming = spark\
  .readStream\
  .schema(static.schema)\
  .option("maxFilesPerTrigger", 10)\
  .json("/FileStore/tables/activity-data")


In [0]:
streaming.printSchema()

Current column is unixtime nanoseconds (represented as a long). Just like we’d do in batch operations—there’s no special API or DSL. We simply use columns, just
like we might in batch, the aggregation, and we’re working with event time.

In [0]:
withEventTime = streaming.selectExpr(
  "*",
  "cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")


We’re performing an aggregation of keys over a window of time.
-  We update the result table (depending on the output mode) when every trigger runs, which will operate on the data received since the last trigger.
-  In the case of our actual dataset (and Figure 22-2), we’ll do so in 10-minute windows without any overlap between them (each, and only one event can fall into one window). 
- This will update in real time, as well, meaning that if new events were being added upstream to our system, Structured Streaming would update those counts accordingly. 
- This is the complete output mode, Spark will output the entire result table regardless of whether we’ve seen the entire dataset:

In [0]:
from pyspark.sql.functions import window, col
withEventTime.groupBy(window(col("event_time"), "10 minutes")).count()\
  .writeStream\
  .queryName("pyevents_per_window")\
  .format("memory")\
  .outputMode("complete")\
  .start()


In [0]:
%sql
SELECT * FROM pyevents_per_window

window,count
"List(2015-02-24T11:50:00.000+0000, 2015-02-24T12:00:00.000+0000)",150773
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T13:10:00.000+0000)",133323
"List(2015-02-23T12:30:00.000+0000, 2015-02-23T12:40:00.000+0000)",100853
"List(2015-02-23T10:20:00.000+0000, 2015-02-23T10:30:00.000+0000)",99178
"List(2015-02-24T12:30:00.000+0000, 2015-02-24T12:40:00.000+0000)",125679
"List(2015-02-24T13:10:00.000+0000, 2015-02-24T13:20:00.000+0000)",105494
"List(2015-02-23T10:30:00.000+0000, 2015-02-23T10:40:00.000+0000)",100443
"List(2015-02-23T10:40:00.000+0000, 2015-02-23T10:50:00.000+0000)",88681
"List(2015-02-23T13:20:00.000+0000, 2015-02-23T13:30:00.000+0000)",106075
"List(2015-02-22T00:40:00.000+0000, 2015-02-22T00:50:00.000+0000)",35


Notice how window is actually a struct (a complex type).
- Using this we can query this struct for the start and end times of a particular window. 
- Of importance is the fact that we can also perform an aggregation on multiple columns, including the event time column. 
- Just like we saw in the previous chapter, we can even perform these aggregations using methods like cube.
- While we won’t repeat the fact that we can perform the multi-key aggregation below, this does apply to any window-style aggregation (or stateful computation) we would like:

In [0]:
from pyspark.sql.functions import window, col
withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count()\
  .writeStream\
  .queryName("pyevents_per_window")\
  .format("memory")\
  .outputMode("complete")\
  .start()


In [0]:
%sql
SELECT * FROM pyevents_per_window

window,User,count
"List(2015-02-24T12:20:00.000+0000, 2015-02-24T12:30:00.000+0000)",f,50148
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T13:10:00.000+0000)",f,12515
"List(2015-02-24T14:50:00.000+0000, 2015-02-24T15:00:00.000+0000)",e,47304
"List(2015-02-23T14:30:00.000+0000, 2015-02-23T14:40:00.000+0000)",h,35493
"List(2015-02-24T14:10:00.000+0000, 2015-02-24T14:20:00.000+0000)",e,25210
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T13:10:00.000+0000)",d,37593
"List(2015-02-24T14:20:00.000+0000, 2015-02-24T14:30:00.000+0000)",b,38560
"List(2015-02-23T12:30:00.000+0000, 2015-02-23T12:40:00.000+0000)",c,37851
"List(2015-02-23T10:20:00.000+0000, 2015-02-23T10:30:00.000+0000)",g,37193
"List(2015-02-24T13:30:00.000+0000, 2015-02-24T13:40:00.000+0000)",b,30438


**sliding window** through which we look at an hour increment, but we’d like to get the state every 10 minutes. This means that we will update the values over time and will include the last hours of data. In this example, we have 10-minute windows, starting every five minutes. Therefore each event will fall into two different windows. You can tweak this further according to your needs:

In [0]:
from pyspark.sql.functions import window, col
withEventTime.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))\
  .count()\
  .writeStream\
  .queryName("pyevents_per_window")\
  .format("memory")\
  .outputMode("complete")\
  .start()


In [0]:
%sql
SELECT * FROM pyevents_per_window

window,count
"List(2015-02-23T14:15:00.000+0000, 2015-02-23T14:25:00.000+0000)",40572
"List(2015-02-24T11:50:00.000+0000, 2015-02-24T12:00:00.000+0000)",56462
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T13:10:00.000+0000)",50108
"List(2015-02-22T00:35:00.000+0000, 2015-02-22T00:45:00.000+0000)",12
"List(2015-02-23T12:30:00.000+0000, 2015-02-23T12:40:00.000+0000)",37851
"List(2015-02-23T10:20:00.000+0000, 2015-02-23T10:30:00.000+0000)",37193
"List(2015-02-23T13:25:00.000+0000, 2015-02-23T13:35:00.000+0000)",34273
"List(2015-02-24T14:25:00.000+0000, 2015-02-24T14:35:00.000+0000)",76614
"List(2015-02-23T12:55:00.000+0000, 2015-02-23T13:05:00.000+0000)",42655
"List(2015-02-22T00:40:00.000+0000, 2015-02-22T00:50:00.000+0000)",12


If we know that we typically see data as produced downstream in minutes but we have seen delays in events up to five hours after they occur (perhaps the user lost cell phone connectivity), we’d specify the **watermark** in the following way. Structured Streaming will wait until 30 minutes after the final timestamp of this 10-minute rolling window before it finalizes the result of that window.

In [0]:
from pyspark.sql.functions import window, col
withEventTime\
  .withWatermark("event_time", "30 minutes")\
  .groupBy(window(col("event_time"), "10 minutes", "5 minutes"))\
  .count()\
  .writeStream\
  .queryName("pyevents_per_window")\
  .format("memory")\
  .outputMode("complete")\
  .start()


We can query our table and see the intermediate results because we’re using **complete** mode—they’ll be updated over time. In **append** mode, this information won’t be output until the window closes

In [0]:
%sql
SELECT * FROM pyevents_per_window

window,count
"List(2015-02-23T14:15:00.000+0000, 2015-02-23T14:25:00.000+0000)",40572
"List(2015-02-24T11:50:00.000+0000, 2015-02-24T12:00:00.000+0000)",56462
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T13:10:00.000+0000)",50108
"List(2015-02-22T00:35:00.000+0000, 2015-02-22T00:45:00.000+0000)",12
"List(2015-02-23T12:30:00.000+0000, 2015-02-23T12:40:00.000+0000)",37851
"List(2015-02-23T10:20:00.000+0000, 2015-02-23T10:30:00.000+0000)",37193
"List(2015-02-23T13:25:00.000+0000, 2015-02-23T13:35:00.000+0000)",34273
"List(2015-02-24T14:25:00.000+0000, 2015-02-24T14:35:00.000+0000)",76614
"List(2015-02-23T12:55:00.000+0000, 2015-02-23T13:05:00.000+0000)",42655
"List(2015-02-22T00:40:00.000+0000, 2015-02-22T00:50:00.000+0000)",12


In [0]:
from pyspark.sql.functions import expr

withEventTime\
  .withWatermark("event_time", "5 seconds")\
  .dropDuplicates(["User", "event_time"])\
  .groupBy("User")\
  .count()\
  .writeStream\
  .queryName("pydeduplicated")\
  .format("memory")\
  .outputMode("complete")\
  .start()


In [0]:
%sql
SELECT * FROM pydeduplicated

User,count
a,80850
b,91230
c,77150
g,91679
h,77330
e,96897
f,92060
d,81240
i,92550
