# Using Clone with Delta Lake

In DBR 7.2+, Delta Lake provides native support for copying existing tables with `CLONE`. This notebook will explore both deep and shallow clones. The docs for this feature are [here](https://docs.databricks.com/delta/delta-utility.html#clone-a-delta-table); full syntax docs are available [here](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-clone.html).

## Learning Objectives
By the end of this lesson, you should be able to:
* Use deep clones to create full incremental backups of tables
* Use shallow clones to create development datasets
* Describe expected behavior after performing common database operations on source and clone tables

## Configure the environment
The following cell will create a database and source table that we'll use in this lesson, alongside some variables we'll use to control file locations.

In [0]:
%run ./Includes/setup $mode="reset"


username: mariapastora.alvarez@bosonit.com
userhome: dbfs:/user/mariapastora.alvarez@bosonit.com/clones
database: clones_mariapastora_alvarez_bosonit_com_db


## Look at the production table details
The production table we'll be using as our source is named `sensors_prod`.

Use the following cell to explore the table history. Note that 4 total transactions have been run to create and load data into this table.

In [0]:
%sql
DESCRIBE HISTORY sensors_prod

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
3,2022-06-06T08:13:43.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4138986985048162),0606-060456-wpn3v13q,2.0,WriteSerializable,True,"Map(numFiles -> 8, numOutputRows -> 1000, numOutputBytes -> 28042)",,Databricks-Runtime/10.4.x-scala2.12
2,2022-06-06T08:13:40.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4138986985048162),0606-060456-wpn3v13q,1.0,WriteSerializable,True,"Map(numFiles -> 8, numOutputRows -> 1000, numOutputBytes -> 28005)",,Databricks-Runtime/10.4.x-scala2.12
1,2022-06-06T08:13:37.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,WRITE,"Map(mode -> Append, partitionBy -> [])",,List(4138986985048162),0606-060456-wpn3v13q,0.0,WriteSerializable,True,"Map(numFiles -> 8, numOutputRows -> 1000, numOutputBytes -> 27979)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-06-06T08:13:30.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,CREATE TABLE,"Map(isManaged -> false, description -> null, partitionBy -> [], properties -> {})",,List(4138986985048162),0606-060456-wpn3v13q,,WriteSerializable,True,Map(),,Databricks-Runtime/10.4.x-scala2.12


Explore the table description to discover the schema and additional details. Note that comments have been added to describe each data field.

In [0]:
%sql
DESCRIBE FORMATTED sensors_prod

col_name,data_type,comment
time,bigint,event timestamp in ms since epoch
device_id,bigint,"device IDs, integer only"
sensor_type,string,sensor type identifier; single upper case letter
signal_strength,double,decimal value between 0 and 1
,,
# Partitioning,,
Not partitioned,,
,,
# Detailed Table Information,,
Catalog,spark_catalog,


The helper function `check_files` was defined to accept a table name and return the count of underlying data files (as well as list the content of the table directory).

Recall that all Delta tables comprise:
1. Data files stored in parquet format
1. Transaction logs stored in the `_delta_log` directory

The table name we're interacting with in the metastore is just a pointer to these underlying assets.

In [0]:
check_files("sensors_prod")

Count of all data files in sensors_prod: 24

Out[6]: [FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/part-00000-57d76ac4-f0fe-45b8-a588-fc6bbab87f9d-c000.snappy.parquet', name='part-00000-57d76ac4-f0fe-45b8-a588-fc6bbab87f9d-c000.snappy.parquet', size=3473, modificationTime=1654503216000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/part-00000-d1d4c9da-573b-44ea-9550-2575a07fbd79-c000.snappy.parquet', name='part-00000-d1d4c9da-573b-44ea-9550-2575a07fbd79-c000.snappy.parquet', size=3482, modificationTime=1654503223000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/part-00000-d7aeab01-6c6d-4d97-a22e-c505f7d083d5-c000.snappy.parquet', name='part-00000-d7aeab01-6c6d-4d97-a22e-c505f7d083d5-c000.snappy.parquet', size=3509, modificationTime=1654503219

## Create a backup of your dataset with deep clone

Deep clone will copy all data and metadata files from your source table to

In [0]:
%sql
CREATE OR REPLACE TABLE sensors_backup 
DEEP CLONE sensors_prod
LOCATION '${c.userhome}/backup/sensors'

source_table_size,source_num_of_files,num_removed_files,num_copied_files,removed_files_size,copied_files_size
84026,24,0,24,0,84026


You'll recall that our `sensors_prod` table had 4 versions associated with it. The clone operation created version 0 of the cloned table. 

The `operationsParameters` field indicates the `sourceVersion` that was cloned.

The `operationMetrics` field will provide information about the files copied during this transaction.

In [0]:
%sql
DESCRIBE HISTORY sensors_backup

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
0,2022-06-06T08:16:36.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,CLONE,"Map(source -> clones_mariapastora_alvarez_bosonit_com_db.sensors_prod, sourceVersion -> 3, isShallow -> false)",,List(4138986985048162),0606-060456-wpn3v13q,-1,Serializable,False,"Map(removedFilesSize -> 0, numRemovedFiles -> 0, sourceTableSize -> 84026, numCopiedFiles -> 24, copiedFilesSize -> 84026, sourceNumOfFiles -> 24)",,Databricks-Runtime/10.4.x-scala2.12


Metadata like comments will also be cloned.

In [0]:
%sql
DESCRIBE FORMATTED sensors_backup

col_name,data_type,comment
time,bigint,event timestamp in ms since epoch
device_id,bigint,"device IDs, integer only"
sensor_type,string,sensor type identifier; single upper case letter
signal_strength,double,decimal value between 0 and 1
,,
# Partitioning,,
Not partitioned,,
,,
# Detailed Table Information,,
Catalog,spark_catalog,


## Incremental Cloning

If you examine the files in your backup table, you'll see that you have the same number of files as your source table. Upon closer examination, you'll note that file names and sizes have also been preserved by the clone. 

This allows Delta Lake to incrementally apply changes to the backup table.

In [0]:
check_files("sensors_backup")

Count of all data files in sensors_backup: 24

Out[7]: [FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/backup/sensors/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/backup/sensors/part-00000-57d76ac4-f0fe-45b8-a588-fc6bbab87f9d-c000.snappy.parquet', name='part-00000-57d76ac4-f0fe-45b8-a588-fc6bbab87f9d-c000.snappy.parquet', size=3473, modificationTime=1654503395000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/backup/sensors/part-00000-d1d4c9da-573b-44ea-9550-2575a07fbd79-c000.snappy.parquet', name='part-00000-d1d4c9da-573b-44ea-9550-2575a07fbd79-c000.snappy.parquet', size=3482, modificationTime=1654503395000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/backup/sensors/part-00000-d7aeab01-6c6d-4d97-a22e-c505f7d083d5-c000.snappy.parquet', name='part-00000-d7aeab01-6c6d-4d97-a22e-c505f7d083d5-c000.snappy.parquet', size=3509, modificationTime=

To see incremental clone in action, begin by commiting a transaction to the `sensor_prod` table. Here, we'll delete all those records where `sensor_type` is `C`.

Remember that Delta Lake manages changes at the file level, so any file containing a matching record will be rewritten.

In [0]:
%sql
DELETE FROM sensors_prod WHERE sensor_type = 'C'

num_affected_rows
750


When we re-execute our deep clone command, we only copy those files that were written during our most recent transaction.

In [0]:
%sql
CREATE OR REPLACE TABLE sensors_backup 
DEEP CLONE sensors_prod
LOCATION '${c.userhome}/backup/sensors'

source_table_size,source_num_of_files,num_removed_files,num_copied_files,removed_files_size,copied_files_size
46740,8,24,8,84026,46740


We can review our history to confirm this.

In [0]:
%sql
DESCRIBE HISTORY sensors_backup

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
1,2022-06-06T08:41:00.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,CLONE,"Map(source -> clones_mariapastora_alvarez_bosonit_com_db.sensors_prod, sourceVersion -> 4, isShallow -> false)",,List(4138986985048162),0606-060456-wpn3v13q,0,Serializable,False,"Map(removedFilesSize -> 84026, numRemovedFiles -> 24, sourceTableSize -> 46740, numCopiedFiles -> 8, copiedFilesSize -> 46740, sourceNumOfFiles -> 8)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-06-06T08:16:36.000+0000,3279574748515926,mariapastora.alvarez@bosonit.com,CLONE,"Map(source -> clones_mariapastora_alvarez_bosonit_com_db.sensors_prod, sourceVersion -> 3, isShallow -> false)",,List(4138986985048162),0606-060456-wpn3v13q,-1,Serializable,False,"Map(removedFilesSize -> 0, numRemovedFiles -> 0, sourceTableSize -> 84026, numCopiedFiles -> 24, copiedFilesSize -> 84026, sourceNumOfFiles -> 24)",,Databricks-Runtime/10.4.x-scala2.12


## Creating Development Datasets with Shallow Clone

Whereas deep clone copies both data and metadata, shallow clone just copies the metadata and creates a pointer to the existing data files.

Note that the cloned table will have read-only permissions on the source data files. This makes it easy to create development datasets using a production dataset without fear of table corruption.

Here, we'll also specify using version 2 of our source production table.

In [0]:
%sql
CREATE OR REPLACE TABLE sensors_dev
SHALLOW CLONE sensors_prod@v2
LOCATION '${c.userhome}/dev/sensors'

source_table_size,source_num_of_files,num_removed_files,num_copied_files,removed_files_size,copied_files_size
55984,16,0,0,0,0


When we look at the target directory, we'll note that no data files exist. The metadata for this table just points to those data files in the source table's data directory.

In [0]:
check_files("sensors_dev")

Count of all data files in sensors_dev: 0

Out[8]: [FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/dev/sensors/_delta_log/', name='_delta_log/', size=0, modificationTime=0)]

## Apply Changes to Dev Data
But what happens if you want to test modifications to your dev table?

The code below inserts only those records from version 3 of our production table that don't have the value "C" as a `sensor_type`.

In [0]:
%sql
MERGE INTO sensors_dev dev
USING (SELECT * FROM sensors_prod@v3 WHERE sensor_type != "C") prod
ON dev.device_id = prod.device_id AND dev.time = prod.time
WHEN NOT MATCHED THEN INSERT *

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
742,0,0,742


The operation is successful and new rows are inserted. If we check the contents of our table location, we'll see that data files now exists.

In [0]:
check_files("sensors_dev")

Count of all data files in sensors_dev: 7

Out[10]: [FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/dev/sensors/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/dev/sensors/part-00000-5a17d9d9-8d33-46fa-8fb0-c6203d500156-c000.snappy.parquet', name='part-00000-5a17d9d9-8d33-46fa-8fb0-c6203d500156-c000.snappy.parquet', size=4440, modificationTime=1654505195000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/dev/sensors/part-00001-0ffb2fcf-5594-4caf-98cc-7535acbb6358-c000.snappy.parquet', name='part-00001-0ffb2fcf-5594-4caf-98cc-7535acbb6358-c000.snappy.parquet', size=3136, modificationTime=1654505195000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/dev/sensors/part-00003-31cadc3e-2a86-491e-a178-58081c40c9c4-c000.snappy.parquet', name='part-00003-31cadc3e-2a86-491e-a178-58081c40c9c4-c000.snappy.parquet', size=3143, modificationTime=1654505195000),

Any changes made to a shallow cloned table will write new data files to the specified target directory, meaning that you can safely test writes, updates, and deletes without risking corruption of your original table. The Delta logs will automatically reference the correct files (from the source table and this clone directory) to materialize the current view of your dev table.

## File Retention and Cloned Tables

It's important to understand how cloned tables behave with file retention actions.

Run the cell below to `VACUUM` your source production table (removing all files not referenced in the most recent version).

In [0]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)
spark.sql("VACUUM sensors_prod RETAIN 0 HOURS")
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", True)

We see that there are now fewer total data files associated with this table.

In [0]:
check_files("sensors_prod")

Count of all data files in sensors_prod: 8

Out[13]: [FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/part-00000-99778ec4-01e9-4190-9859-a55cd56292bb-c000.snappy.parquet', name='part-00000-99778ec4-01e9-4190-9859-a55cd56292bb-c000.snappy.parquet', size=5772, modificationTime=1654504815000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/part-00001-5d696765-238d-4e43-9541-7ccc2c759b9a-c000.snappy.parquet', name='part-00001-5d696765-238d-4e43-9541-7ccc2c759b9a-c000.snappy.parquet', size=6040, modificationTime=1654504815000),
 FileInfo(path='dbfs:/user/mariapastora.alvarez@bosonit.com/clones/prod/sensors/part-00002-26a525cb-af53-4473-b344-86ed2af44524-c000.snappy.parquet', name='part-00002-26a525cb-af53-4473-b344-86ed2af44524-c000.snappy.parquet', size=5867, modificationTime=1654504815

You'll recall that our `sensors_dev` table was initialized against version 2 of our production table. As such, it still has reference to data files associated with that table version.

Because these data files have been removed by our vacuum operation, we should expect the following query against our shallow cloned table to fail.

In [0]:
%sql
SELECT * FROM sensors_dev

Because deep clone created a full copy of our files and associated metadata, we still have access to our `sensors_backup` table. Here, we'll query the original version of this backup (which corresponds to version 3 of our source table).

In [0]:
%sql
SELECT * FROM sensors_backup@v0

time,device_id,sensor_type,signal_strength
1654503339026,10,C,0.7544386889605265
1654503350485,82,C,0.9540555710382808
1654503346359,83,C,0.2949914058403428
1654503339552,55,D,0.4764445174304901
1654503332155,99,D,0.5141792255704167
1654503348915,14,C,0.0366247440607943
1654503341668,37,B,0.0427806829753976
1654503334509,34,D,0.1738441640306324
1654503340243,35,C,0.9166222082609332
1654503344027,51,D,0.1375679765999884


One of the useful features of deep cloning is the ability to set different table properties for file and log retention. This allows production tables to have optimized performance while maintaining files for auditing and regulatory compliance. 

The cell below sets the log and deleted file retention periods to 10 years.

In [0]:
%sql
ALTER TABLE sensors_backup
SET TBLPROPERTIES (
  delta.logRetentionDuration = '3650 days',
  delta.deletedFileRetentionDuration = '3650 days'
)

## Wrapping Up

In this notebook, we explored the basic syntax and behavior of deep and shallow clones. We saw how changes to source and clone tables impacted tables, including the ability to incrementally clone changes to keep a backup table in-sync with its source. We saw that shallow clone could be used for creating temporary tables for development based on production data, but noted that removal of source data files will lead to errors when trying to query this shallow clone.

Run the following cell to delete the tables and files associated with this demo.

In [0]:
%run ./Includes/setup $mode="cleanup"


username: mariapastora.alvarez@bosonit.com
userhome: dbfs:/user/mariapastora.alvarez@bosonit.com/clones
database: clones_mariapastora_alvarez_bosonit_com_db
