-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Streaming Query

##### Objectives
1. Build streaming DataFrames
1. Display streaming query results
1. Write streaming query results
1. Monitor streaming query

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamReader.html#pyspark.sql.streaming.DataStreamReader" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.html#pyspark.sql.streaming.DataStreamWriter" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html#pyspark.sql.streaming.StreamingQuery" target="_blank">StreamingQuery</a>

In [0]:
%run ./Includes/Classroom-Setup

### Build streaming DataFrames

Obtain an initial streaming DataFrame from a Parquet-format file source.

In [0]:
schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

df = (spark
      .readStream
      .schema(schema)
      .option("maxFilesPerTrigger", 1)
      .parquet(eventsPath)
     )
df.isStreaming

In [0]:
display(df)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
Windows,"List(null, null, null)",add_item,1593604013418531.0,1593604166967411,"List(Aurora, CO)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593603932991227,UA000000106484072
macOS,"List(null, null, null)",guest,1593792667777632.0,1593793102376026,"List(Atlanta, GA)","List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1), List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1), List(NEWBED10, M_PREM_Q, Premium Queen Mattress, 1615.5, 1795.0, 1))",email,1593606662775893,UA000000106492759
Linux,"List(null, null, null)",main,,1593610121403996,"List(Uvalde, TX)",List(),google,1593610121403996,UA000000106506379
Android,"List(null, null, null)",add_item,1593595457982777.0,1593595673594309,"List(Houston, TX)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",email,1593595457982777,UA000000106466992
Android,"List(null, null, null)",mattresses,,1593577304912675,"List(Reno, NV)",List(),instagram,1593577304912675,UA000000106458058
Windows,"List(null, null, null)",main,,1593609793654886,"List(Phoenix, AZ)",List(),google,1593609793654886,UA000000106504949
Android,"List(null, null, null)",pillows,,1593619480937685,"List(Chicago, IL)",List(),google,1593619480937685,UA000000106555879
macOS,"List(null, null, null)",add_item,1593618267281779.0,1593618623952958,"List(San Diego, CA)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",google,1593618267281779,UA000000106548706
Windows,"List(null, null, null)",main,,1593601683208285,"List(Muskegon, MI)",List(),google,1593601683208285,UA000000106478125
Android,"List(null, null, null)",add_item,1593594736126253.0,1593596897183596,"List(Lancaster, OH)","List(List(null, M_PREM_T, Premium Twin Mattress, 2190.0, 1095.0, 2))",youtube,1593594736126253,UA000000106466114


Apply some transformations, producing new streaming DataFrames.

In [0]:
from pyspark.sql.functions import col, approx_count_distinct, count

emailTrafficDF = (df
                  .filter(col("traffic_source") == "email")
                  .withColumn("mobile", col("device").isin(["iOS", "Android"]))
                  .select("user_id", "event_timestamp", "mobile")
                 )
emailTrafficDF.isStreaming

In [0]:
display(emailTrafficDF)

user_id,event_timestamp,mobile
UA000000106492759,1593793102376026,False
UA000000106466992,1593595673594309,True
UA000000106466758,1593596243792384,True
UA000000106507716,1593613199618903,True
UA000000106460658,1593587139594631,True
UA000000106527631,1593614494006388,True
UA000000106459980,1593654663757403,True
UA000000106532534,1593615408404063,False
UA000000106492591,1593606616384640,True
UA000000106499964,1593608599877382,False


### Write streaming query results

Take the final streaming DataFrame (our result table) and write it to a file sink in "append" mode.

In [0]:
checkpointPath = userhome + "/email_traffic/checkpoint"
outputPath = userhome + "/email_traffic/output"

devicesQuery = (emailTrafficDF
                .writeStream
                .outputMode("append")
                .format("parquet")
                .queryName("email_traffic")
                .trigger(processingTime="1 second")
                .option("checkpointLocation", checkpointPath)
                .start(outputPath)
               )

### Monitor streaming query

Use the streaming query "handle" to monitor and control it.

In [0]:
devicesQuery.id

In [0]:
devicesQuery.status

In [0]:
devicesQuery.lastProgress

In [0]:
devicesQuery.awaitTermination(5)

In [0]:
devicesQuery.stop()

# Coupon Sales Lab
Process and append streaming data on transactions using coupons.
1. Read data stream
2. Filter for transactions with coupons codes
3. Write streaming query results to Parquet
4. Monitor streaming query
5. Stop streaming query

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamReader.html#pyspark.sql.streaming.DataStreamReader" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.html#pyspark.sql.streaming.DataStreamWriter" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html#pyspark.sql.streaming.StreamingQuery" target="_blank">StreamingQuery</a>

### 1. Read data stream
- Use the schema stored in **`schema`**
- Set to process 1 file per trigger
- Read from Parquet files in the source directory specified by **`salesPath`**

Assign the resulting DataFrame to **`df`**.

In [0]:
schema = "order_id BIGINT, email STRING, transaction_timestamp BIGINT, total_item_quantity BIGINT, purchase_revenue_in_usd DOUBLE, unique_items BIGINT, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>"

In [0]:
# TODO
df = (spark
      .readStream
      .schema(schema)
      .option('maxFilesPerTrigger',1)
      .parquet(salesPath)
)
df.isStreaming

**CHECK YOUR WORK**

In [0]:
assert df.isStreaming
assert df.columns == ["order_id", "email", "transaction_timestamp", "total_item_quantity", "purchase_revenue_in_usd", "unique_items", "items"]

### 2. Filter for transactions with coupon codes
- Explode the **`items`** field in **`df`** with the results replacing the existing **`items`** field
- Filter for records where **`items.coupon`** is not null

Assign the resulting DataFrame to **`couponSalesDF`**.

In [0]:
from pyspark.sql.functions import explode
couponSalesDF = (df
                 .withColumn('items',explode(col('items')))
                 .filter(col('items.coupon').isNotNull())
)
display(couponSalesDF)

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
277029,ariasamanda12@gmail.com,1592439049943080,2,993.6,2,"List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1)"
277029,ariasamanda12@gmail.com,1592439049943080,2,993.6,2,"List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1)"
271875,jenniferpark@lynn.com,1592402210938700,1,1525.5,1,"List(NEWBED10, M_PREM_F, Premium Full Mattress, 1525.5, 1695.0, 1)"
287222,schneidergary@thomas.info,1592541143972493,1,985.5,1,"List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1)"
267814,teresa43@yahoo.com,1592334310039252,2,1071.0,1,"List(NEWBED10, M_STAN_T, Standard Twin Mattress, 1071.0, 595.0, 2)"
297157,matthew08@yahoo.com,1592611197489603,1,1525.5,1,"List(NEWBED10, M_PREM_F, Premium Full Mattress, 1525.5, 1695.0, 1)"
273698,stephenjohnson@paul.com,1592415078909439,1,940.5,1,"List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1)"
272527,zferguson@daniels-taylor.org,1592407612907648,1,535.5,1,"List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1)"
273886,jennifer8546@yahoo.com,1592416045087762,1,940.5,1,"List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1)"
271162,jacob21@warren.info,1592394049326897,1,850.5,1,"List(NEWBED10, M_STAN_F, Standard Full Mattress, 850.5, 945.0, 1)"


**CHECK YOUR WORK**

In [0]:
schemaStr = str(couponSalesDF.schema)
assert "StructField(items,StructType(List(StructField(coupon" in schemaStr, "items column was not exploded"

### 3. Write streaming query results to parquet
- Configure the streaming query to write Parquet format files in "append" mode
- Set the query name to "coupon_sales"
- Set a trigger interval of 1 second
- Set the checkpoint location to **`couponsCheckpointPath`**
- Set the output path to **`couponsOutputPath`**

Start the streaming query and assign the resulting handle to **`couponSalesQuery`**.

In [0]:
# TODO
couponsCheckpointPath = workingDir + "/coupon-sales/checkpoint"
couponsOutputPath = workingDir + "/coupon-sales/output"

couponSalesQuery = (couponSalesDF
                    .writeStream
                    .outputMode('append')
                    .format('parquet')
                    .queryName('coupon_sales')
                    .trigger(processingTime='1 second')
                    .option("checkpointLocation", couponsCheckpointPath)
                    .start(couponsOutputPath)
)

**CHECK YOUR WORK**

In [0]:
untilStreamIsReady("coupon_sales")
assert couponSalesQuery.isActive
assert len(dbutils.fs.ls(couponsOutputPath)) > 0
assert len(dbutils.fs.ls(couponsCheckpointPath)) > 0
assert "coupon_sales" in couponSalesQuery.lastProgress["name"]

### 4. Monitor streaming query
- Get the ID of streaming query and store it in **`queryID`**
- Get the status of streaming query and store it in **`queryStatus`**

In [0]:
# TODO
queryID = couponSalesQuery.id

In [0]:
# TODO
queryStatus = couponSalesQuery.status

**CHECK YOUR WORK**

In [0]:
assert type(queryID) == str
assert list(queryStatus.keys()) == ["message", "isDataAvailable", "isTriggerActive"]

### 5. Stop streaming query
- Stop the streaming query

In [0]:
# TODO
couponSalesQuery.stop()

**CHECK YOUR WORK**

In [0]:
assert not couponSalesQuery.isActive

### 6. Verify the records were written in Parquet format

In [0]:
# ANSWER
display(spark.read.parquet(couponsOutputPath))

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
282611,bmurillo@hotmail.com,1592504237604072,1,940.5,1,"List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1)"
283949,whardin@hotmail.com,1592510720760323,1,535.5,1,"List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1)"
264191,maxwelltara@edwards.com,1592306255847870,2,993.6,2,"List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1)"
264191,maxwelltara@edwards.com,1592306255847870,2,993.6,2,"List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1)"
286727,rojasjorge@yahoo.com,1592533048926949,1,535.5,1,"List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1)"
287573,marmstrong46@hotmail.com,1592547727220317,1,53.1,1,"List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1)"
271597,johnsonderrick@yahoo.com,1592399808941985,5,2846.7,5,"List(NEWBED10, M_STAN_K, Standard King Mattress, 1075.5, 1195.0, 1)"
271597,johnsonderrick@yahoo.com,1592399808941985,5,2846.7,5,"List(NEWBED10, P_DOWN_K, King Down Pillow, 143.1, 159.0, 1)"
271597,johnsonderrick@yahoo.com,1592399808941985,5,2846.7,5,"List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1)"
271597,johnsonderrick@yahoo.com,1592399808941985,5,2846.7,5,"List(NEWBED10, M_PREM_T, Premium Twin Mattress, 985.5, 1095.0, 1)"


### Classroom Cleanup
Run the cell below to clean up resources.

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>