# Streaming Query

##### Objectives
1. Build streaming DataFrames
1. Display streaming query results
1. Write streaming query results
1. Monitor streaming query

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html" target="_blank">StreamingQuery</a>

In [0]:
%run ../Includes/Classroom-Setup

### Build streaming DataFrames

Obtain an initial streaming DataFrame from a Delta-format file source.

In [0]:
df = (spark
      .readStream
      .option("maxFilesPerTrigger", 1)
      .format("delta")
      .load(DA.paths.events)
     )

df.isStreaming

Apply some transformations, producing new streaming DataFrames.

In [0]:
from pyspark.sql.functions import col, approx_count_distinct, count

email_traffic_df = (df
                    .filter(col("traffic_source") == "email")
                    .withColumn("mobile", col("device").isin(["iOS", "Android"]))
                    .select("user_id", "event_timestamp", "mobile")
                   )

email_traffic_df.isStreaming

### Write streaming query results

Take the final streaming DataFrame (our result table) and write it to a file sink in "append" mode.

In [0]:
checkpoint_path = f"{DA.paths.checkpoints}/email_traffic"
output_path = f"{DA.paths.working_dir}/email_traffic/output"

devices_query = (email_traffic_df
                 .writeStream
                 .outputMode("append")
                 .format("delta")
                 .queryName("email_traffic")
                 .trigger(processingTime="1 second")
                 .option("checkpointLocation", checkpoint_path)
                 .start(output_path)
                )

### Monitor streaming query

Use the streaming query "handle" to monitor and control it.

In [0]:
devices_query.id

In [0]:
devices_query.status

In [0]:
devices_query.lastProgress

In [0]:
import time
# Run for 10 more seconds
time.sleep(10) 

devices_query.stop()

In [0]:
devices_query.awaitTermination()

%md
# Coupon Sales Lab
Process and append streaming data on transactions using coupons.
1. Read data stream
2. Filter for transactions with coupons codes
3. Write streaming query results to Delta
4. Monitor streaming query
5. Stop streaming query

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html" target="_blank">StreamingQuery</a>

In [0]:
%run ../Includes/Classroom-Setup

### 1. Read data stream
- Set to process 1 file per trigger
- Read from Delta files in the source directory specified by **`DA.paths.sales`**

Assign the resulting DataFrame to **`df`**.

In [0]:
# ANSWER
df = (spark
      .readStream
      .option("maxFilesPerTrigger", 1)
      .format("delta")
      .load(DA.paths.sales)
     )

**1.1: CHECK YOUR WORK**

In [0]:
assert df.isStreaming
assert df.columns == ["order_id", "email", "transaction_timestamp", "total_item_quantity", "purchase_revenue_in_usd", "unique_items", "items"]
print("All test pass")

### 2. Filter for transactions with coupon codes
- Explode the **`items`** field in **`df`** with the results replacing the existing **`items`** field
- Filter for records where **`items.coupon`** is not null

Assign the resulting DataFrame to **`coupon_sales_df`**.

In [0]:
# ANSWER
from pyspark.sql.functions import col, explode

coupon_sales_df = (df
                   .withColumn("items", explode(col("items")))
                   .filter(col("items.coupon").isNotNull())
                  )

**2.1: CHECK YOUR WORK**

In [0]:
schema_str = str(coupon_sales_df.schema)
assert "StructField(items,StructType(List(StructField(coupon" in schema_str, "items column was not exploded"
print("All test pass")

### 3. Write streaming query results to Delta
- Configure the streaming query to write Delta format files in "append" mode
- Set the query name to "coupon_sales"
- Set a trigger interval of 1 second
- Set the checkpoint location to **`coupons_checkpoint_path`**
- Set the output path to **`coupons_output_path`**

Start the streaming query and assign the resulting handle to **`coupon_sales_query`**.

In [0]:
# ANSWER

coupons_checkpoint_path = f"{DA.paths.checkpoints}/coupon-sales"
coupons_output_path = f"{DA.paths.working_dir}/coupon-sales/output"

coupon_sales_query = (coupon_sales_df
                      .writeStream
                      .outputMode("append")
                      .format("delta")
                      .queryName("coupon_sales")
                      .trigger(processingTime="1 second")
                      .option("checkpointLocation", coupons_checkpoint_path)
                      .start(coupons_output_path)
                     )

**3.1: CHECK YOUR WORK**

In [0]:
DA.block_until_stream_is_ready("coupon_sales")
assert coupon_sales_query.isActive
assert len(dbutils.fs.ls(coupons_output_path)) > 0
assert len(dbutils.fs.ls(coupons_checkpoint_path)) > 0
assert "coupon_sales" in coupon_sales_query.lastProgress["name"]
print("All test pass")

### 4. Monitor streaming query
- Get the ID of streaming query and store it in **`queryID`**
- Get the status of streaming query and store it in **`queryStatus`**

In [0]:
# ANSWER
query_id = coupon_sales_query.id
print(query_id)

In [0]:
# ANSWER
query_status = coupon_sales_query.status
print(query_status)

**4.1: CHECK YOUR WORK**

In [0]:
assert type(query_id) == str
assert list(query_status.keys()) == ["message", "isDataAvailable", "isTriggerActive"]
print("All test pass")

### 5. Stop streaming query
- Stop the streaming query

In [0]:
# ANSWER
coupon_sales_query.stop()
coupon_sales_query.awaitTermination()

**5.1: CHECK YOUR WORK**

In [0]:
assert not coupon_sales_query.isActive
print("All test pass")

### 6. Verify the records were written in Delta format

In [0]:
# ANSWER
display(spark.read.format("delta").load(coupons_output_path))

%md
## Hourly Activity by Traffic Lab
Process streaming data to display the total active users by traffic source with a 1 hour window.
1. Cast to timestamp and add watermark for 2 hours
2. Aggregate active users by traffic source for 1 hour windows
3. Execute query with **`display`** and plot results
5. Use query name to stop streaming query

### Setup
Run the cells below to generate hourly JSON files of event data for July 3, 2020.

In [0]:
%run ../Includes/Classroom-Setup

In [0]:
schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

# Directory of hourly events logged from the BedBricks website on July 3, 2020
hourly_events_path = f"{DA.paths.datasets}/ecommerce/events/events-2020-07-03.json"

df = (spark
      .readStream
      .schema(schema)
      .option("maxFilesPerTrigger", 1)
      .json(hourly_events_path)
     )

### 1. Cast to timestamp and add watermark for 2 hours
- Add a **`createdAt`** column by dividing **`event_timestamp`** by 1M and casting to timestamp
- Set a watermark of 2 hours on the **`createdAt`** column

Assign the resulting DataFrame to **`events_df`**.

In [0]:
# ANSWER
from pyspark.sql.functions import col

events_df = (df
             .withColumn("createdAt", (col("event_timestamp") / 1e6).cast("timestamp"))
             .withWatermark("createdAt", "2 hours")
            )

**1.1: CHECK YOUR WORK**

In [0]:
assert "StructField(createdAt,TimestampType,true" in str(events_df.schema)
print("All test pass")

### 2. Aggregate active users by traffic source for 1 hour windows

- Set the default shuffle partitions to the number of cores on your cluster
- Group by **`traffic_source`** with 1-hour tumbling windows based on the **`createdAt`** column
- Aggregate the approximate count of distinct users per **`user_id`** and alias the resulting column to **`active_users`**
- Select **`traffic_source`**, **`active_users`**, and the **`hour`** extracted from **`window.start`** with an alias of **`hour`**
- Sort by **`hour`** in ascending order
Assign the resulting DataFrame to **`traffic_df`**.

In [0]:
# ANSWER
from pyspark.sql.functions import approx_count_distinct, hour, window

spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)

traffic_df = (events_df
              .groupBy("traffic_source", window(col("createdAt"), "1 hour"))
              .agg(approx_count_distinct("user_id").alias("active_users"))
              .select(col("traffic_source"), col("active_users"), hour(col("window.start")).alias("hour"))
              .sort("hour")
             )

**2.1: CHECK YOUR WORK**

In [0]:
assert str(traffic_df.schema) == "StructType(List(StructField(traffic_source,StringType,true),StructField(active_users,LongType,false),StructField(hour,IntegerType,true)))"
print("All test pass")

### 3. Execute query with display() and plot results
- Use **`display`** to start **`traffic_df`** as a streaming query and display the resulting memory sink
  - Assign "hourly_traffic" as the name of the query by setting the **`streamName`** parameter of **`display`**
- Plot the streaming query results as a bar graph
- Configure the following plot options:
  - Keys: **`hour`**
  - Series groupings: **`traffic_source`**
  - Values: **`active_users`**

In [0]:
# ANSWER
display(traffic_df, streamName="hourly_traffic")

**3.1: CHECK YOUR WORK**

- The bar chart should plot **`hour`** on the x-axis and **`active_users`** on the y-axis
- Six bars should appear at every hour for all traffic sources
- The chart should stop at hour 23

### 4. Manage streaming query
- Iterate over SparkSession's list of active streams to find one with name "hourly_traffic"
- Stop the streaming query

In [0]:
# ANSWER
DA.block_until_stream_is_ready("hourly_traffic")

for s in spark.streams.active:
    if s.name == "hourly_traffic":
        s.stop()
        s.awaitTermination()

**4.1: CHECK YOUR WORK**
Print all active streams to check that "hourly_traffic" is no longer there

In [0]:
for s in spark.streams.active:
    print(s.name)

%md
# Activity by Traffic Lab
Process streaming data to display total active users by traffic source.

##### Objectives
1. Read data stream
2. Get active users by traffic source
3. Execute query with display() and plot results
4. Execute the same streaming query with DataStreamWriter
5. View results being updated in the query table
6. List and stop all active streams

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html" target="_blank">StreamingQuery</a>

### Setup
Run the cells below to generate data and create the **`schema`** string needed for this lab.

In [0]:
%run ../Includes/Classroom-Setup

### 1. Read data stream
- Set to process 1 file per trigger
- Read from Delta with filepath stored in **`DA.paths.events`**

Assign the resulting DataFrame to **`df`**.

In [0]:
# ANSWER
df = (spark
      .readStream
      .option("maxFilesPerTrigger", 1)
      .format("delta")
      .load(DA.paths.events)
     )

**1.1: CHECK YOUR WORK**

In [0]:
assert df.isStreaming
assert df.columns == ["device", "ecommerce", "event_name", "event_previous_timestamp", "event_timestamp", "geo", "items", "traffic_source", "user_first_touch_timestamp", "user_id"]
print("All test pass")

### 2. Get active users by traffic source
- Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
- Group by **`traffic_source`**
  - Aggregate the approximate count of distinct users and alias with "active_users"
- Sort by **`traffic_source`**

In [0]:
# ANSWER
from pyspark.sql.functions import col, approx_count_distinct, count

spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)

traffic_df = (df
              .groupBy("traffic_source")
              .agg(approx_count_distinct("user_id").alias("active_users"))
              .sort("traffic_source")
             )

**2.1: CHECK YOUR WORK**

In [0]:
assert str(traffic_df.schema) == "StructType(List(StructField(traffic_source,StringType,true),StructField(active_users,LongType,false)))"
print("All test pass")

### 3. Execute query with display() and plot results
- Execute results for **`traffic_df`** using display()
- Plot the streaming query results as a bar graph

In [0]:
# ANSWER
display(traffic_df)

**3.1: CHECK YOUR WORK**
- You bar chart should plot **`traffic_source`** on the x-axis and **`active_users`** on the y-axis
- The top three traffic sources in descending order should be **`google`**, **`facebook`**, and **`instagram`**.

### 4. Execute the same streaming query with DataStreamWriter
- Name the query "active_users_by_traffic"
- Set to "memory" format and "complete" output mode
- Set a trigger interval of 1 second

In [0]:
# ANSWER
traffic_query = (traffic_df
                 .writeStream
                 .queryName("active_users_by_traffic")
                 .format("memory")
                 .outputMode("complete")
                 .trigger(processingTime="1 second")
                 .start()
                )

**4.1: CHECK YOUR WORK**

In [0]:
DA.block_until_stream_is_ready("active_users_by_traffic")
assert traffic_query.isActive
assert "active_users_by_traffic" in traffic_query.name
assert traffic_query.lastProgress["sink"]["description"] == "MemorySink"
print("All test pass")

### 5. View results being updated in the query table
Run a query in a SQL cell to display the results from the **`active_users_by_traffic`** table

In [0]:
%sql
-- ANSWER
SELECT * FROM active_users_by_traffic

**5.1: CHECK YOUR WORK**
Your query should eventually result in the following values.

|traffic_source|active_users|
|---|---|
|direct|438886|
|email|281525|
|facebook|956769|
|google|1781961|
|instagram|530050|
|youtube|253321|

### 6. List and stop all active streams
- Use SparkSession to get list of all active streams
- Iterate over the list and stop each query

In [0]:
# ANSWER
for s in spark.streams.active:
    print(s.name)
    s.stop()

**6.1: CHECK YOUR WORK**

In [0]:
assert not traffic_query.isActive
print("All test pass")

%md # Delta Lake

##### Objectives
1. Create a Delta Table
1. Understand the transaction Log
1. Read data from your Delta Table
1. Update data in your Delta Table
1. Access previous versions of table using time travel
1. Vacuum

##### Documentation
- <a href="https://docs.delta.io/latest/quick-start.html#create-a-table" target="_blank">Delta Table</a> 
- <a href="https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html" target="_blank">Transaction Log</a> 
- <a href="https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html" target="_blank">Time Travel</a> 

In [0]:
%run ../Includes/Classroom-Setup

### Create a Delta Table
Let's first read the Parquet-format BedBricks events dataset.

In [0]:
events_df = spark.read.format("parquet").load(f"{DA.paths.datasets}/ecommerce/events/events.parquet")
display(events_df)

Write the data in Delta format to the directory given by **`delta_path`**.

In [0]:
delta_path = f"{DA.paths.working_dir}/delta-events"
events_df.write.format("delta").mode("overwrite").save(delta_path)

Write the data in Delta format as a managed table in the metastore.

In [0]:
events_df.write.format("delta").mode("overwrite").saveAsTable("delta_events")

As with other file formats, Delta supports partitioning your data in storage using the unique values in a specified column (often referred to as "Hive partitioning").

Let's **overwrite** the Delta dataset in the **`delta_path`** directory to partition by state. This can accelerate queries that filter by state.

In [0]:
from pyspark.sql.functions import col

state_events_df = events_df.withColumn("state", col("geo.state"))

state_events_df.write.format("delta").mode("overwrite").partitionBy("state").option("overwriteSchema", "true").save(delta_path)

### Understand the Transaction Log
We can see how Delta stores the different state partitions in separate directories.

Additionally, we can also see a directory called **`_delta_log`**, which is the transaction log.

When a Delta Lake dataset is created, its transaction log is automatically created in the **`_delta_log`** subdirectory.

In [0]:
display(dbutils.fs.ls(delta_path))

When changes are made to that table, these changes are recorded as ordered, atomic commits in the transaction log.

Each commit is written out as a JSON file, starting with 00000000000000000000.json.

Additional changes to the table generate subsequent JSON files in ascending numerical order.

<div style="img align: center; line-height: 0; padding-top: 9px;">
  <img src="https://user-images.githubusercontent.com/20408077/87174138-609fe600-c29c-11ea-90cc-84df0c1357f1.png" width="500"/>
</div>

In [0]:
display(dbutils.fs.ls(f"{delta_path}/_delta_log/"))

Next, let's take a look at a transaction log File.


The <a href="https://docs.databricks.com/delta/delta-utility.html" target="_blank">four columns</a> each represent a different part of the very first commit to the Delta Table, creating the table.
- The **`add`** column has statistics about the DataFrame as a whole and individual columns.
- The **`commitInfo`** column has useful information about what the operation was (WRITE or READ) and who executed the operation.
- The **`metaData`** column contains information about the column schema.
- The **`protocol`** version contains information about the minimum Delta version necessary to either write or read to this Delta Table.

In [0]:
display(spark.read.json(f"{delta_path}/_delta_log/00000000000000000000.json"))

One key difference between these two transaction logs is the size of the JSON file, this file has 206 rows compared to the previous 7.

To understand why, let's take a look at the **`commitInfo`** column. We can see that in the **`operationParameters`** section, **`partitionBy`** has been filled in by the **`state`** column. Furthermore, if we look at the add section on row 3, we can see that a new section called **`partitionValues`** has appeared. As we saw above, Delta stores partitions separately in memory, however, it stores information about these partitions in the same transaction log file.

In [0]:
display(spark.read.json(f"{delta_path}/_delta_log/00000000000000000001.json"))

Finally, let's take a look at the files inside one of the state partitions. The files inside corresponds to the partition commit (file 01) in the _delta_log directory.

In [0]:
display(dbutils.fs.ls(f"{delta_path}/state=CA/"))

### Read from your Delta table

In [0]:
df = spark.read.format("delta").load(delta_path)
display(df)

### Update your Delta Table

Let's filter for rows where the event takes place on a mobile device.

In [0]:
df_update = state_events_df.filter(col("device").isin(["Android", "iOS"]))
display(df_update)

In [0]:
df_update.write.format("delta").mode("overwrite").save(delta_path)

In [0]:
df = spark.read.format("delta").load(delta_path)
display(df)

Let's look at the files in the California partition post-update. Remember, the different files in this directory are snapshots of your DataFrame corresponding to different commits.

In [0]:
display(dbutils.fs.ls(f"{delta_path}/state=CA/"))

### Access previous versions of table using Time  Travel

Oops, it turns out we actually we need the entire dataset! You can access a previous version of your Delta Table using Time Travel. Use the following two cells to access your version history. Delta Lake will keep a 30 day version history by default, but if necessary, Delta can store a version history for longer.

In [0]:
spark.sql("DROP TABLE IF EXISTS train_delta")
spark.sql(f"CREATE TABLE train_delta USING DELTA LOCATION '{delta_path}'")

In [0]:
%sql
DESCRIBE HISTORY train_delta

Using the **`versionAsOf`** option allows you to easily access previous versions of our Delta Table.

In [0]:
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
display(df)

You can also access older versions using a timestamp.

Replace the timestamp string with the information from your version history. 

<img src="https://files.training.databricks.com/images/icon_note_32.png"> Note: You can use a date without the time information if necessary.

In [0]:
# ANSWER

temp_df = spark.sql("DESCRIBE HISTORY train_delta").select("timestamp").orderBy(col("timestamp").asc())
time_stamp = temp_df.first()["timestamp"]

as_of_df = spark.read.format("delta").option("timestampAsOf", time_stamp).load(delta_path)
display(as_of_df)

### Vacuum 

Now that we're happy with our Delta Table, we can clean up our directory using **`VACUUM`**. Vacuum accepts a retention period in hours as an input.

It looks like our code doesn't run! By default, to prevent accidentally vacuuming recent commits, Delta Lake will not let users vacuum a period under 7 days or 168 hours. Once vacuumed, you cannot return to a prior commit through time travel, only your most recent Delta Table will be saved.

In [0]:
# from delta.tables import *

# delta_table = DeltaTable.forPath(spark, delta_path)
# delta_table.vacuum(0)

We can workaround this by setting a spark configuration that will bypass the default retention period check.

In [0]:
from delta.tables import *

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.vacuum(0)

Let's take a look at our Delta Table files now. After vacuuming, the directory only holds the partition of our most recent Delta Table commit.

In [0]:
display(dbutils.fs.ls(delta_path + "/state=CA/"))

Since vacuuming deletes files referenced by the Delta Table, we can no longer access past versions. 

The code below should throw an error.

Uncomment it and give it a try.

In [0]:
# df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
# display(df)

%md # Delta Lake Lab
##### Tasks
1. Write sales data to Delta
1. Modify sales data to show item count instead of item array
1. Rewrite sales data to same Delta path
1. Create table and view version history
1. Time travel to read previous version

In [0]:
%run ../Includes/Classroom-Setup

In [0]:
sales_df = spark.read.parquet(f"{DA.paths.datasets}/ecommerce/sales/sales.parquet")
delta_sales_path = f"{DA.paths.working_dir}/delta-sales"

### 1. Write sales data to Delta
Write **`sales_df`** to **`delta_sales_path`**

In [0]:
# ANSWER
sales_df.write.format("delta").mode("overwrite").save(delta_sales_path)

**1.1: CHECK YOUR WORK**

In [0]:
assert len(dbutils.fs.ls(delta_sales_path)) > 0

### 2. Modify sales data to show item count instead of item array
Replace values in the **`items`** column with an integer value of the items array size.
Assign the resulting DataFrame to **`updated_sales_df`**.

In [0]:
# ANSWER
from pyspark.sql.functions import size, col

updated_sales_df = sales_df.withColumn("items", size(col("items")))
display(updated_sales_df)

**2.1: CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import IntegerType

assert updated_sales_df.schema[6].dataType == IntegerType()
print("All test pass")

### 3. Rewrite sales data to same Delta path
Write **`updated_sales_df`** to the same Delta location **`delta_sales_path`**.

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> This will fail without an option to overwrite the schema.

In [0]:
# ANSWER
updated_sales_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(delta_sales_path)

**3.1: CHECK YOUR WORK**

In [0]:
assert spark.read.format("delta").load(delta_sales_path).schema[6].dataType == IntegerType()
print("All test pass")

### 4. Create table and view version history
Run SQL queries by writing SQL inside of `spark.sql()` to perform the following steps.
- Drop table **`sales_delta`** if it exists
- Create **`sales_delta`** table using the **`delta_sales_path`** location
- List version history for the **`sales_delta`** table

An example of a SQL query inside of `spark.sql()` would be something like ```spark.sql("SELECT * FROM sales_data")```

In [0]:
# ANSWER
spark.sql("DROP TABLE IF EXISTS sales_delta")
spark.sql("CREATE TABLE sales_delta USING DELTA LOCATION '{}'".format(delta_sales_path))

In [0]:
# ANSWER
display(spark.sql("DESCRIBE HISTORY sales_delta"))

**4.1: CHECK YOUR WORK**

In [0]:
sales_delta_df = spark.sql("SELECT * FROM sales_delta")
assert sales_delta_df.count() == 210370
assert sales_delta_df.schema[6].dataType == IntegerType()
print("All test pass")

### 5. Time travel to read previous version
Read delta table at **`delta_sales_path`** at version 0.
Assign the resulting DataFrame to **`old_sales_df`**.

In [0]:
# ANSWER
old_sales_df = spark.read.format("delta").option("versionAsOf", 0).load(delta_sales_path)
display(old_sales_df)

**5.1: CHECK YOUR WORK**

In [0]:
assert old_sales_df.select(size(col("items"))).first()[0] == 1
print("All test pass")