# Advanced Delta Lake Features

## Time Travel

In [0]:
USE CATALOG hive_metastore

In [0]:
DESCRIBE HISTORY employees

In Delta Lake previous versions can easily be queried of the table, and this feature of **time travel** is possible thanks to those extra data files that had been masrked as removed in the transaction log.

If the data before the updated operation wants to be retrieved, the version number can be used to retrieve the data:

In [0]:
SELECT * FROM employees VERSION AS OF 4

Another way:

In [0]:
SELECT * FROM employees@v4

Now, let's imagine the data was deleted and there is a need to recover it.
RESTORE VERSION AS OF 4 TO TABLE employees
SELECT * FROM employees

In [0]:
DELETE FROM employees;

SELECT * FROM employees

`RESTORE TABLE` command allows to roll back to a specific version of the table:

In [0]:
RESTORE TABLE employees TO VERSION AS OF 5

Data has been succesfully restored.

In [0]:
SELECT * FROM employees

But, what really happened on the table?

In [0]:
DESCRIBE HISTORY employees

The `RESTORE` command has been recorded as a transaction. Therefore, **Time Travel** is a really powerful feature.

## Compaction & ZORDER Indexing

In [0]:
DESCRIBE DETAIL employees

The Delta Table has 4 small data files. Having many small dat files negativelly affect the performance of the delta table. To solve this issue OPTIMIZED command can be used:

In [0]:
OPTIMIZE employees
ZORDER BY (id)

4 data files have been deleted and a new file has been added that combines those 4 files. Additionally, `ZORDER` indexing speeds up data retrieval when filtering on provided fields by grouping data with similar values within the same data files.

In this example, the `ZORDER` was applied by the `id` column. However, in such a small dataset, it diesn't provide any benefit.

The outcome can be confirmed by running the following command:

In [0]:
DESCRIBE DETAIL employees

Now the humber of files in the current version is just 1. How was the `OPTIMIZE` operationhas been recorded in the table history?

In [0]:
DESCRIBE HISTORY employees

As expected, a new version of the table has been created. The data files in the table directory can be also checked:

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

There are 7 data files, but the current table version refeences only one file after the `OPTIMIZE` operation. That means that other data files are unused and can be simply cleaned up.

## Vacuum

In [0]:
VACUUM employees 

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

After the `VACUUM`operation, the data files are still there because the `RETENTION PERIOD` has not been specified. This retention period is **by default 7 days**. That means `VACUUM` operation will prevent from deleting files less than 7 days old to ensure that no longer running operation are still referencing any of the files to be deleted.

When trying a `VACUUM` operation with a retention period of 0h (for keeping only the current version), it won't work because the default threshold is 7 days.

In [0]:
VACUUM employees RETAIN 0 HOURS

In this demo, a workaround is used for demostration purposes only. That is to turn off the rentention duration check.

**NOTE: do not do it on production**.

In [0]:
SET spark.databricks.delta.retentionDurationCheck.enabled=false;

In [0]:
VACUUM employees RETAIN 0 HOURS

Now the table directory should show two files:
* Transaction log file
* Data file

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

Now we are no longer able to access older versions. That can be confirmed by querying an old table version:

In [0]:
SELECT * FROM employees@v1

Finally, the table can be deleted permanently and its data from the Lakehouse:

In [0]:
DROP TABLE employees

In [0]:
SELECT * FROM employees

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/employees'

Table doesn't exist any more, so it has been succesfully deleted.