-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Advanced Delta Lake Features

Now that you feel comfortable performing basic data tasks with Delta Lake, we can discuss a few features unique to Delta Lake.

Note that while some of the keywords used here aren't part of standard ANSI SQL, all Delta Lake operations can be run on Databricks using SQL

## Learning Objectives
By the end of this lesson, you should be able to:
* Use **`OPTIMIZE`** to compact small files
* Use **`ZORDER`** to index tables
* Describe the directory structure of Delta Lake files
* Review a history of table transactions
* Query and roll back to previous table version
* Clean up stale data files with **`VACUUM`**

**Resources**
* <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html" target="_blank">Delta Optimize - Databricks Docs</a>
* <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html" target="_blank">Delta Vacuum - Databricks Docs</a>

## Run Setup
The first thing we're going to do is run a setup script. It will define a username, userhome, and database that is scoped to each user.

In [0]:
%run ../Includes/Classroom-Setup-2.3

## Creating a Delta Table with History

The cell below condenses all the transactions from the previous lesson into a single cell. (Except for the **`DROP TABLE`**!)

As you're waiting for this query to run, see if you can identify the total number of transactions being executed.

In [0]:
%sql
CREATE TABLE students
  (id INT, name STRING, value DOUBLE);
  
INSERT INTO students VALUES (1, "Yve", 1.0);
INSERT INTO students VALUES (2, "Omar", 2.5);
INSERT INTO students VALUES (3, "Elia", 3.3);

INSERT INTO students
VALUES 
  (4, "Ted", 4.7),
  (5, "Tiffany", 5.5),
  (6, "Vini", 6.3);
  
UPDATE students 
SET value = value + 1
WHERE name LIKE "T%";

DELETE FROM students 
WHERE value > 6;

CREATE OR REPLACE TEMP VIEW updates(id, name, value, type) AS VALUES
  (2, "Omar", 15.2, "update"),
  (3, "", null, "delete"),
  (7, "Blue", 7.7, "insert"),
  (11, "Diya", 8.8, "update");
  
MERGE INTO students b
USING updates u
ON b.id=u.id
WHEN MATCHED AND u.type = "update"
  THEN UPDATE SET *
WHEN MATCHED AND u.type = "delete"
  THEN DELETE
WHEN NOT MATCHED AND u.type = "insert"
  THEN INSERT *;

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
3,1,1,1


## Examine Table Details

Databricks uses a Hive metastore by default to register databases, tables, and views.

#### Using **`DESCRIBE EXTENDED`** allows us to see important metadata about our table.

In [0]:
%sql
DESCRIBE EXTENDED students

col_name,data_type,comment
id,int,
name,string,
value,double,
,,
# Partitioning,,
Not partitioned,,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3,


#### `DESCRIBE DETAIL` is another command that allows us to explore table metadata.

In [0]:
%sql
DESCRIBE DETAIL students

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,414fbbe9-7312-45b3-9510-c4781713d06b,dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3.students,,dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students,2022-07-28T06:30:48.407+0000,2022-07-28T06:31:03.000+0000,List(),4,4236,Map(),1,2


Note the **`Location`** field.

While we've so far been thinking about our table as just a relational entity within a database, a Delta Lake table is actually backed by a collection of files stored in cloud object storage.

## Explore Delta Lake Files

We can see the files **backing our Delta Lake table** by using a Databricks Utilities function.

**NOTE**: It's not important right now to know everything about these files to work with Delta Lake, but it will help you gain a greater appreciation for how the technology is implemented.

In [0]:
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students"))

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/,_delta_log/,0,1658989864000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-21d5bbc7-b8bf-48cd-aecb-d138c839db67-c000.snappy.parquet,part-00000-21d5bbc7-b8bf-48cd-aecb-d138c839db67-c000.snappy.parquet,1055,1658989851000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-3987f4e9-b6db-4109-b03a-28c33d338105-c000.snappy.parquet,part-00000-3987f4e9-b6db-4109-b03a-28c33d338105-c000.snappy.parquet,1055,1658989858000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-663497bc-8864-4ba1-a1ce-0b90ff33c0cd-c000.snappy.parquet,part-00000-663497bc-8864-4ba1-a1ce-0b90ff33c0cd-c000.snappy.parquet,1063,1658989852000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-6bf3ead7-0848-482e-b847-afbda564759a-c000.snappy.parquet,part-00000-6bf3ead7-0848-482e-b847-afbda564759a-c000.snappy.parquet,1056,1658989855000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-7f8a4fda-519f-4de1-b2ff-99d82eceff6d-c000.snappy.parquet,part-00000-7f8a4fda-519f-4de1-b2ff-99d82eceff6d-c000.snappy.parquet,1063,1658989862000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-c300102f-3db9-4802-a1f5-1f9532cfcb04-c000.snappy.parquet,part-00000-c300102f-3db9-4802-a1f5-1f9532cfcb04-c000.snappy.parquet,1063,1658989854000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00001-0d8b2671-7522-47c3-bd54-3c3a24180bb2-c000.snappy.parquet,part-00001-0d8b2671-7522-47c3-bd54-3c3a24180bb2-c000.snappy.parquet,1084,1658989857000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00001-5f7b092f-ce48-45fe-95e4-89fd6f7bc73b-c000.snappy.parquet,part-00001-5f7b092f-ce48-45fe-95e4-89fd6f7bc73b-c000.snappy.parquet,1083,1658989855000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00002-1e38a2b2-5de2-4872-94e9-f8e47082087d-c000.snappy.parquet,part-00002-1e38a2b2-5de2-4872-94e9-f8e47082087d-c000.snappy.parquet,1063,1658989855000


Note that our directory contains a number of Parquet data files and a directory named **`_delta_log`**.

**Records in Delta Lake tables are stored as data in Parquet files.**

#### Transactions to Delta Lake tables are recorded in the **`_delta_log`**.

We can peek inside the **`_delta_log`** to see more.

In [0]:
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students/_delta_log"))

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000000.crc,00000000000000000000.crc,1945,1658989849000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000000.json,00000000000000000000.json,1005,1658989848000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000001.crc,00000000000000000001.crc,1951,1658989852000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000001.json,00000000000000000001.json,968,1658989851000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000002.crc,00000000000000000002.crc,1951,1658989854000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000002.json,00000000000000000002.json,970,1658989853000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000003.crc,00000000000000000003.crc,1951,1658989855000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000003.json,00000000000000000003.json,970,1658989854000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000004.crc,00000000000000000004.crc,1951,1658989856000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/00000000000000000004.json,00000000000000000004.json,1858,1658989856000


#### Each transaction results in a new JSON file being written to the Delta Lake transaction log. Here, we can see that there are 8 total transactions against this table (Delta Lake is 0 indexed).

## Reasoning about Data Files

We just saw a lot of data files for what is obviously a very small table.

**`DESCRIBE DETAIL`** allows us to see some other details about our Delta table, including the number of files.

In [0]:
%sql
DESCRIBE DETAIL students

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,414fbbe9-7312-45b3-9510-c4781713d06b,dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3.students,,dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students,2022-07-28T06:30:48.407+0000,2022-07-28T06:31:03.000+0000,List(),4,4236,Map(),1,2


Here we see that our table currently contains 4 data files in its present version. So what are all those other Parquet files doing in our table directory? 

Rather than overwriting or immediately deleting files containing changed data, Delta Lake uses the transaction log to indicate whether or not files are valid in a current version of the table.

Here, we'll look at the transaction log corresponding the **`MERGE`** statement above, where records were inserted, updated, and deleted.

In [0]:
%python
display(spark.sql(f"SELECT * FROM json.`{DA.paths.user_db}/students/_delta_log/00000000000000000007.json`"))

add,commitInfo,remove
,,"List(true, 1658989863134, true, part-00000-c300102f-3db9-4802-a1f5-1f9532cfcb04-c000.snappy.parquet, 1063, List(1658989854000000, 268435456))"
,,"List(true, 1658989863134, true, part-00000-663497bc-8864-4ba1-a1ce-0b90ff33c0cd-c000.snappy.parquet, 1063, List(1658989852000000, 268435456))"
"List(true, 1658989862000, part-00000-7f8a4fda-519f-4de1-b2ff-99d82eceff6d-c000.snappy.parquet, 1063, {""numRecords"":1,""minValues"":{""id"":2,""name"":""Omar"",""value"":15.2},""maxValues"":{""id"":2,""name"":""Omar"",""value"":15.2},""nullCount"":{""id"":0,""name"":0,""value"":0}}, List(1658989862000000, 268435456))",,
"List(true, 1658989862000, part-00002-d583035e-da18-4079-8303-ed6c7b32d644-c000.snappy.parquet, 1063, {""numRecords"":1,""minValues"":{""id"":7,""name"":""Blue"",""value"":7.7},""maxValues"":{""id"":7,""name"":""Blue"",""value"":7.7},""nullCount"":{""id"":0,""name"":0,""value"":0}}, List(1658989862000001, 268435456))",,
,"List(0725-045645-b5m629fz, Databricks-Runtime/10.4.x-scala2.12, false, WriteSerializable, List(2331746562402691), MERGE, List(1857, 2, 4, 0, 2, 2, 0, 1, 1, 1, 760, 987), List([{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}], (b.id = u.id)), 6, 1658989863199, bfeef74d-c96f-41aa-843d-37756e35cc8b, 6997591375752473, manujkumar.joshi@celebaltech.com)",


The **`add`** column contains a list of all the new files written to our table; the **`remove`** column indicates those files that no longer should be included in our table.

When we query a Delta Lake table, the query engine uses the transaction logs to resolve all the files that are valid in the current version, and ignores all other data files.

## Compacting Small Files and Indexing

Small files can occur for a variety of reasons; in our case, we performed a number of operations where only one or several records were inserted.

Files will be combined toward an optimal size (scaled based on the size of the table) by using the **`OPTIMIZE`** command.

**`OPTIMIZE`** will replace existing data files by combining records and rewriting the results.

When executing **`OPTIMIZE`**, users can optionally specify one or several fields for **`ZORDER`** indexing. While the specific math of Z-order is unimportant, it speeds up data retrieval when filtering on provided fields by colocating data with similar values within data files.

In [0]:
%sql
OPTIMIZE students
ZORDER BY id

path,metrics
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students,"List(1, 4, List(1102, 1102, 1102.0, 1, 1102), List(1055, 1063, 1059.0, 4, 4236), 0, List(minCubeSize(107374182400), List(0, 0), List(4, 4236), 0, List(4, 4236), 1, null), 1, 4, 0, false)"


Given how small our data is, **`ZORDER`** does not provide any benefit, but we can see all of the metrics that result from this operation.

## Reviewing Delta Lake Transactions

Because all changes to the Delta Lake table are stored in the transaction log, we can easily review the <a href="https://docs.databricks.com/spark/2.x/spark-sql/language-manual/describe-history.html" target="_blank">table history</a>.

In [0]:
%sql
DESCRIBE HISTORY students

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
8,2022-07-28T06:52:38.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,OPTIMIZE,"Map(predicate -> [], zOrderBy -> [""id""], batchId -> 0, auto -> false)",,List(2331746562402691),0725-045645-b5m629fz,7.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4236, p25FileSize -> 1102, minFileSize -> 1102, numAddedFiles -> 1, maxFileSize -> 1102, p75FileSize -> 1102, p50FileSize -> 1102, numAddedBytes -> 1102)",,Databricks-Runtime/10.4.x-scala2.12
7,2022-07-28T06:31:03.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,MERGE,"Map(predicate -> (b.id = u.id), matchedPredicates -> [{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], notMatchedPredicates -> [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}])",,List(2331746562402691),0725-045645-b5m629fz,6.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 1, numTargetFilesAdded -> 2, executionTimeMs -> 1857, numTargetRowsInserted -> 1, scanTimeMs -> 987, numTargetRowsUpdated -> 1, numOutputRows -> 2, numTargetChangeFilesAdded -> 0, numSourceRows -> 4, numTargetFilesRemoved -> 2, rewriteTimeMs -> 760)",,Databricks-Runtime/10.4.x-scala2.12
6,2022-07-28T06:31:00.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,"Map(predicate -> [""(spark_catalog.dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3.students.value > 6.0D)""])",,List(2331746562402691),0725-045645-b5m629fz,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 567, numDeletedRows -> 2, scanTimeMs -> 357, numAddedFiles -> 0, rewriteTimeMs -> 210)",,Databricks-Runtime/10.4.x-scala2.12
5,2022-07-28T06:30:58.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,UPDATE,"Map(predicate -> StartsWith(name#13160, T))",,List(2331746562402691),0725-045645-b5m629fz,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1081, scanTimeMs -> 181, numAddedFiles -> 2, numUpdatedRows -> 2, rewriteTimeMs -> 899)",,Databricks-Runtime/10.4.x-scala2.12
4,2022-07-28T06:30:56.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,3.0,WriteSerializable,True,"Map(numFiles -> 3, numOutputRows -> 3, numOutputBytes -> 3202)",,Databricks-Runtime/10.4.x-scala2.12
3,2022-07-28T06:30:54.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12
2,2022-07-28T06:30:53.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12
1,2022-07-28T06:30:51.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1055)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-07-28T06:30:48.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(2331746562402691),0725-045645-b5m629fz,,WriteSerializable,True,Map(),,Databricks-Runtime/10.4.x-scala2.12


As expected, **`OPTIMIZE`** created another version of our table, meaning that version 8 is our most current version.

Remember all of those extra data files that had been marked as removed in our transaction log? These provide us with the ability to query previous versions of our table.

These time travel queries can be performed by specifying either the integer version or a timestamp.

**NOTE**: In most cases, you'll use a timestamp to recreate data at a time of interest. For our demo we'll use version, as this is deterministic (whereas you may be running this demo at any time in the future).

In [0]:
%sql
SELECT * 
FROM students VERSION AS OF 3

id,name,value
2,Omar,2.5
3,Elia,3.3
1,Yve,1.0


In [0]:
%sql
SELECT * 
FROM students VERSION AS OF 8

id,name,value
2,Omar,15.2
7,Blue,7.7
1,Yve,1.0
4,Ted,5.7


In [0]:
%sql
SELECT * 
FROM students VERSION AS OF 0

id,name,value


What's important to note about time travel is that we're not recreating a previous state of the table by undoing transactions against our current version; rather, we're just querying all those data files that were indicated as valid as of the specified version.

## Rollback Versions

Suppose you're typing up query to manually delete some records from a table and you accidentally execute this query in the following state.

In [0]:
%sql
DELETE FROM students

num_affected_rows
-1


In [0]:
%sql
DESCRIBE HISTORY students

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
9,2022-07-28T06:55:55.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,Map(predicate -> []),,List(2331746562402691),0725-045645-b5m629fz,8.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numAddedChangeFiles -> 0, executionTimeMs -> 33, scanTimeMs -> 32, rewriteTimeMs -> 0)",,Databricks-Runtime/10.4.x-scala2.12
8,2022-07-28T06:52:38.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,OPTIMIZE,"Map(predicate -> [], zOrderBy -> [""id""], batchId -> 0, auto -> false)",,List(2331746562402691),0725-045645-b5m629fz,7.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4236, p25FileSize -> 1102, minFileSize -> 1102, numAddedFiles -> 1, maxFileSize -> 1102, p75FileSize -> 1102, p50FileSize -> 1102, numAddedBytes -> 1102)",,Databricks-Runtime/10.4.x-scala2.12
7,2022-07-28T06:31:03.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,MERGE,"Map(predicate -> (b.id = u.id), matchedPredicates -> [{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], notMatchedPredicates -> [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}])",,List(2331746562402691),0725-045645-b5m629fz,6.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 1, numTargetFilesAdded -> 2, executionTimeMs -> 1857, numTargetRowsInserted -> 1, scanTimeMs -> 987, numTargetRowsUpdated -> 1, numOutputRows -> 2, numTargetChangeFilesAdded -> 0, numSourceRows -> 4, numTargetFilesRemoved -> 2, rewriteTimeMs -> 760)",,Databricks-Runtime/10.4.x-scala2.12
6,2022-07-28T06:31:00.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,"Map(predicate -> [""(spark_catalog.dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3.students.value > 6.0D)""])",,List(2331746562402691),0725-045645-b5m629fz,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 567, numDeletedRows -> 2, scanTimeMs -> 357, numAddedFiles -> 0, rewriteTimeMs -> 210)",,Databricks-Runtime/10.4.x-scala2.12
5,2022-07-28T06:30:58.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,UPDATE,"Map(predicate -> StartsWith(name#13160, T))",,List(2331746562402691),0725-045645-b5m629fz,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1081, scanTimeMs -> 181, numAddedFiles -> 2, numUpdatedRows -> 2, rewriteTimeMs -> 899)",,Databricks-Runtime/10.4.x-scala2.12
4,2022-07-28T06:30:56.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,3.0,WriteSerializable,True,"Map(numFiles -> 3, numOutputRows -> 3, numOutputBytes -> 3202)",,Databricks-Runtime/10.4.x-scala2.12
3,2022-07-28T06:30:54.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12
2,2022-07-28T06:30:53.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12
1,2022-07-28T06:30:51.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1055)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-07-28T06:30:48.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(2331746562402691),0725-045645-b5m629fz,,WriteSerializable,True,Map(),,Databricks-Runtime/10.4.x-scala2.12


Note that when we see a **`-1`** for number of rows affected by a delete, this means an entire directory of data has been removed.

Let's confirm this below.

In [0]:
%sql
SELECT * FROM students

id,name,value


Deleting all the records in your table is probably not a desired outcome. Luckily, we can simply rollback this commit.

In [0]:
%sql
RESTORE TABLE students TO VERSION AS OF 8 

table_size_after_restore,num_of_files_after_restore,num_removed_files,num_restored_files,removed_files_size,restored_files_size
1102,1,0,1,0,1102


In [0]:
%sql
DESCRIBE HISTORY students

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
10,2022-07-28T06:57:18.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,RESTORE,"Map(version -> 8, timestamp -> null)",,List(2331746562402691),0725-045645-b5m629fz,9.0,Serializable,False,"Map(numRestoredFiles -> 1, removedFilesSize -> 0, numRemovedFiles -> 0, restoredFilesSize -> 1102, numOfFilesAfterRestore -> 1, tableSizeAfterRestore -> 1102)",,Databricks-Runtime/10.4.x-scala2.12
9,2022-07-28T06:55:55.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,Map(predicate -> []),,List(2331746562402691),0725-045645-b5m629fz,8.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numAddedChangeFiles -> 0, executionTimeMs -> 33, scanTimeMs -> 32, rewriteTimeMs -> 0)",,Databricks-Runtime/10.4.x-scala2.12
8,2022-07-28T06:52:38.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,OPTIMIZE,"Map(predicate -> [], zOrderBy -> [""id""], batchId -> 0, auto -> false)",,List(2331746562402691),0725-045645-b5m629fz,7.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4236, p25FileSize -> 1102, minFileSize -> 1102, numAddedFiles -> 1, maxFileSize -> 1102, p75FileSize -> 1102, p50FileSize -> 1102, numAddedBytes -> 1102)",,Databricks-Runtime/10.4.x-scala2.12
7,2022-07-28T06:31:03.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,MERGE,"Map(predicate -> (b.id = u.id), matchedPredicates -> [{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], notMatchedPredicates -> [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}])",,List(2331746562402691),0725-045645-b5m629fz,6.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 1, numTargetFilesAdded -> 2, executionTimeMs -> 1857, numTargetRowsInserted -> 1, scanTimeMs -> 987, numTargetRowsUpdated -> 1, numOutputRows -> 2, numTargetChangeFilesAdded -> 0, numSourceRows -> 4, numTargetFilesRemoved -> 2, rewriteTimeMs -> 760)",,Databricks-Runtime/10.4.x-scala2.12
6,2022-07-28T06:31:00.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,"Map(predicate -> [""(spark_catalog.dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3.students.value > 6.0D)""])",,List(2331746562402691),0725-045645-b5m629fz,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 567, numDeletedRows -> 2, scanTimeMs -> 357, numAddedFiles -> 0, rewriteTimeMs -> 210)",,Databricks-Runtime/10.4.x-scala2.12
5,2022-07-28T06:30:58.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,UPDATE,"Map(predicate -> StartsWith(name#13160, T))",,List(2331746562402691),0725-045645-b5m629fz,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1081, scanTimeMs -> 181, numAddedFiles -> 2, numUpdatedRows -> 2, rewriteTimeMs -> 899)",,Databricks-Runtime/10.4.x-scala2.12
4,2022-07-28T06:30:56.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,3.0,WriteSerializable,True,"Map(numFiles -> 3, numOutputRows -> 3, numOutputBytes -> 3202)",,Databricks-Runtime/10.4.x-scala2.12
3,2022-07-28T06:30:54.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12
2,2022-07-28T06:30:53.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,1.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12
1,2022-07-28T06:30:51.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,0.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1055)",,Databricks-Runtime/10.4.x-scala2.12


Note that a **`RESTORE`** <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-restore.html" target="_blank">command</a> is recorded as a transaction; you won't be able to completely hide the fact that you accidentally deleted all the records in the table, but you will be able to undo the operation and bring your table back to a desired state.

## Cleaning Up Stale Files

#### Databricks will automatically clean up stale files in Delta Lake tables.

While Delta Lake versioning and time travel are great for querying recent versions and rolling back queries, keeping the data files for all versions of large production tables around indefinitely is very expensive (and can lead to compliance issues if PII is present).

#### If you wish to manually purge old data files, this can be performed with the **`VACUUM`** operation.

Uncomment the following cell and execute it with a retention of **`0 HOURS`** to keep only the current version:

In [0]:
%sql
--VACUUM students RETAIN 0 HOURS

#### By default, **`VACUUM`** will prevent you from deleting files less than 7 days old, just to ensure that no long-running operations are still referencing any of the files to be deleted. If you run **`VACUUM`** on a Delta table, you lose the ability time travel back to a version older than the specified data retention period.  In our demos, you may see Databricks executing code that specifies a retention of **`0 HOURS`**. This is simply to demonstrate the feature and is not typically done in production.  

### In the following cell, we:
1. Turn off a check to prevent premature deletion of data files
1. Make sure that logging of **`VACUUM`** commands is enabled
1. Use the **`DRY RUN`** version of vacuum to print out all records to be deleted

In [0]:
%sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
SET spark.databricks.delta.vacuum.logging.enabled = true;

VACUUM students RETAIN 0 HOURS DRY RUN

path
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-21d5bbc7-b8bf-48cd-aecb-d138c839db67-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-3987f4e9-b6db-4109-b03a-28c33d338105-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-663497bc-8864-4ba1-a1ce-0b90ff33c0cd-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-6bf3ead7-0848-482e-b847-afbda564759a-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-7f8a4fda-519f-4de1-b2ff-99d82eceff6d-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-c300102f-3db9-4802-a1f5-1f9532cfcb04-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00001-0d8b2671-7522-47c3-bd54-3c3a24180bb2-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00001-5f7b092f-ce48-45fe-95e4-89fd6f7bc73b-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00002-1e38a2b2-5de2-4872-94e9-f8e47082087d-c000.snappy.parquet
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00002-d583035e-da18-4079-8303-ed6c7b32d644-c000.snappy.parquet


By running **`VACUUM`** and deleting the 10 files above, we will permanently remove access to versions of the table that require these files to materialize.

In [0]:
%sql
VACUUM students RETAIN 0 HOURS

path
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students


In [0]:
%sql
DESCRIBE HISTORY students

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
12,2022-07-28T07:03:08.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,VACUUM END,Map(status -> COMPLETED),,List(2331746562402691),0725-045645-b5m629fz,11.0,SnapshotIsolation,True,"Map(numDeletedFiles -> 10, numVacuumedDirectories -> 1)",,Databricks-Runtime/10.4.x-scala2.12
11,2022-07-28T07:03:06.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,VACUUM START,"Map(retentionCheckEnabled -> false, specifiedRetentionMillis -> 0, defaultRetentionMillis -> 604800000)",,List(2331746562402691),0725-045645-b5m629fz,10.0,SnapshotIsolation,True,Map(numFilesToDelete -> 10),,Databricks-Runtime/10.4.x-scala2.12
10,2022-07-28T06:57:18.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,RESTORE,"Map(version -> 8, timestamp -> null)",,List(2331746562402691),0725-045645-b5m629fz,9.0,Serializable,False,"Map(numRestoredFiles -> 1, removedFilesSize -> 0, numRemovedFiles -> 0, restoredFilesSize -> 1102, numOfFilesAfterRestore -> 1, tableSizeAfterRestore -> 1102)",,Databricks-Runtime/10.4.x-scala2.12
9,2022-07-28T06:55:55.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,Map(predicate -> []),,List(2331746562402691),0725-045645-b5m629fz,8.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numAddedChangeFiles -> 0, executionTimeMs -> 33, scanTimeMs -> 32, rewriteTimeMs -> 0)",,Databricks-Runtime/10.4.x-scala2.12
8,2022-07-28T06:52:38.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,OPTIMIZE,"Map(predicate -> [], zOrderBy -> [""id""], batchId -> 0, auto -> false)",,List(2331746562402691),0725-045645-b5m629fz,7.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 4236, p25FileSize -> 1102, minFileSize -> 1102, numAddedFiles -> 1, maxFileSize -> 1102, p75FileSize -> 1102, p50FileSize -> 1102, numAddedBytes -> 1102)",,Databricks-Runtime/10.4.x-scala2.12
7,2022-07-28T06:31:03.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,MERGE,"Map(predicate -> (b.id = u.id), matchedPredicates -> [{""predicate"":""(u.type = 'update')"",""actionType"":""update""},{""predicate"":""(u.type = 'delete')"",""actionType"":""delete""}], notMatchedPredicates -> [{""predicate"":""(u.type = 'insert')"",""actionType"":""insert""}])",,List(2331746562402691),0725-045645-b5m629fz,6.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 1, numTargetFilesAdded -> 2, executionTimeMs -> 1857, numTargetRowsInserted -> 1, scanTimeMs -> 987, numTargetRowsUpdated -> 1, numOutputRows -> 2, numTargetChangeFilesAdded -> 0, numSourceRows -> 4, numTargetFilesRemoved -> 2, rewriteTimeMs -> 760)",,Databricks-Runtime/10.4.x-scala2.12
6,2022-07-28T06:31:00.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,DELETE,"Map(predicate -> [""(spark_catalog.dbacademy_manujkumar_joshi_celebaltech_com_dewd_2_3.students.value > 6.0D)""])",,List(2331746562402691),0725-045645-b5m629fz,5.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 567, numDeletedRows -> 2, scanTimeMs -> 357, numAddedFiles -> 0, rewriteTimeMs -> 210)",,Databricks-Runtime/10.4.x-scala2.12
5,2022-07-28T06:30:58.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,UPDATE,"Map(predicate -> StartsWith(name#13160, T))",,List(2331746562402691),0725-045645-b5m629fz,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 2, numCopiedRows -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1081, scanTimeMs -> 181, numAddedFiles -> 2, numUpdatedRows -> 2, rewriteTimeMs -> 899)",,Databricks-Runtime/10.4.x-scala2.12
4,2022-07-28T06:30:56.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,3.0,WriteSerializable,True,"Map(numFiles -> 3, numOutputRows -> 3, numOutputBytes -> 3202)",,Databricks-Runtime/10.4.x-scala2.12
3,2022-07-28T06:30:54.000+0000,6997591375752473,manujkumar.joshi@celebaltech.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(2331746562402691),0725-045645-b5m629fz,2.0,WriteSerializable,True,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1063)",,Databricks-Runtime/10.4.x-scala2.12


Check the table directory to show that files have been successfully deleted.

In [0]:
%python
display(dbutils.fs.ls(f"{DA.paths.user_db}/students"))

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/_delta_log/,_delta_log/,0,1658991789000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/2.3/2_3.db/students/part-00000-e66af923-bcb2-447e-82b7-9b3e2edad974-c000.snappy.parquet,part-00000-e66af923-bcb2-447e-82b7-9b3e2edad974-c000.snappy.parquet,1102,1658991158000


Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>