
## Introduction

In this Notebook we will see how to work with Delta Tables when using Fabric Notebooks. 

Some of the things we will look at are:
* Creating a new Delta Table
* Using Delta Log and Time Traveling 
* Tracking data changes using Change Data Feed
* Cloning tables
* Masking data by using Dynamic Views

In addition to Delta Tables we will also get to see some tips and tricks on working on Fabri environment.


### Environment Setup

We will be using [Databricks Notebooks workflow](https://docs.databricks.com/notebooks/notebook-workflows.html) element to set up environment for this exercise. 

`dbutils.notebook.run()` command will run another notebook and return its output to be used here.

`dbutils` has some other interesting uses such as interacting with file system or reading [Databricks Secrets](https://docs.databricks.com/dev-tools/databricks-utils.html#dbutils-secrets)


## Set medallion paths

In [3]:
# Reference a notebook to get and set Path variables 
setup_responses = mssparkutils.notebook.run("Get-Metadata").split()

# Set medallion paths
bronzePath = setup_responses[0]
bronzeLakehouse = setup_responses[1]
silverLakehouse = setup_responses[2]
goldLakehouse = setup_responses[3]

print(f"bronze data path is {bronzePath}")      
print("bronze lakehouse is {}".format(bronzeLakehouse))
print("silver lakehouse is {}".format(silverLakehouse))
print("gold lakehouse is {}".format(goldLakehouse))

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 3, Finished, Available)

bronze data path is abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files
bronze lakehouse is liad_bronze
silver lakehouse is liad_silver
gold lakehouse is liad_gold



## Delta Tables

Let's load store locations data to Delta Table. In our case we don't want to track any history and opt to overwrite data every time process is running.


### Create Delta Table

***Load employees data to a spark data frame then write that dataframe to a Delta Table***

Let's start with simply reading CSV file into a DataFrame

In [6]:
dataPath = f"{bronzePath}/source/retail-org/company_employees/company_employees.csv"

df = spark.read\
  .option("header", "true")\
  .option("delimiter", ",")\
  .option("quote", "\"") \
  .option("inferSchema", "true")\
  .csv(dataPath)\
  .limit(9)

display(df)

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 20d0f07e-adc4-45ae-9e6a-aeb09a892c50)


Data is in a DataFrame, but not yet in a Delta Table. Still, we can already use SQL to query data or copy it into the Delta table

In [8]:
# Creating a Temporary View will allow us to use SQL to interact with data

df.createOrReplaceTempView(f"employees_csv_file")

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 8, Finished, Available)

In [9]:
%%sql

SELECT * from employees_csv_file

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 9, Finished, Available)

<Spark SQL result set with 9 rows and 8 fields>


SQL DDL can be used to create table using view we have just created. 

In [11]:
%%sql

DROP TABLE IF EXISTS employees;

CREATE TABLE employees
USING DELTA
AS
SELECT * FROM employees_csv_file;

SELECT * from employees;

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, -1, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 9 rows and 8 fields>


This SQL query has created a simple Delta Table (as specified by `USING DELTA`). DELTA a default format so it would create a delta table even if we skip the `USING DELTA` part.

For more complex tables you can also specify table PARTITION or add COMMENTS.


### Describe Delta Table

Now that we have created our first Delta Table - let's see what it looks like on our database and where are the data files stored.  

Quick way to get information on your table is to run `DESCRIBE EXTENDED` command on SQL cell

In [7]:
%%sql
DESCRIBE HISTORY employees;

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 11, Finished, Available)

<Spark SQL result set with 1 rows and 15 fields>

In [12]:
%%sql

DESCRIBE EXTENDED employees;

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 14, Finished, Available)

<Spark SQL result set with 18 rows and 3 fields>


There is yet another way to see these files - it is by using `%fs ls file_path` magic command.
You can try it out by filling in cell below with your table location path

In [13]:
%fs ls abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 15, Finished, Available)

FileInfo(_delta_log, abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log, 0)
FileInfo(part-00000-c856f6d6-eead-42e1-b3dc-84956c7dbc96-c000.snappy.parquet, abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/part-00000-c856f6d6-eead-42e1-b3dc-84956c7dbc96-c000.snappy.parquet, 4047)



In [14]:
# DBTITLE 1,Notice the _delta_log/ folder unders employees.  Its contains the JSON log file info that tracks versions.
display(mssparkutils.fs.ls( f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{bronzeLakehouse}.Lakehouse/Tables/employees" ))


StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 16, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/part-00000-c856f6d6-eead-42e1-b3dc-84956c7dbc96-c000.snappy.parquet, name=part-00000-c856f6d6-eead-42e1-b3dc-84956c7dbc96-c000.snappy.parquet, size=4047)]


### Explore Delta Log

We can see that next to data stored in parquet file we have a *_delta_log/* folder - this is where the Log files can be found

In [16]:

log_files_location = f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{bronzeLakehouse}.Lakehouse/Tables/employees/_delta_log/"
print(log_files_location)

display(mssparkutils.fs.ls(log_files_location))

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 18, Finished, Available)

abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/


[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/00000000000000000000.json, name=00000000000000000000.json, size=2353),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/_temporary, name=_temporary, size=0)]


`00000000000000000000.json` has a very first commit logged for our table. Each change to the table will be creating a new _json_ file

In [17]:
first_log_file_location = f"{log_files_location}00000000000000000000.json"
mssparkutils.fs.head(first_log_file_location)

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 19, Finished, Available)

'{"commitInfo":{"timestamp":1693592480575,"operation":"CREATE TABLE AS SELECT","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"9","numOutputBytes":"4047"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.3.1.5.2-100223822 Delta-Lake/2.2.0.6","txnId":"f905f000-e950-4325-b5fb-d969ecd87ae5"}}\n{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}\n{"metaData":{"id":"a37b806f-2d6c-4dad-96dc-15e08cd5ce95","format":{"provider":"parquet","options":{}},"schemaString":"{\\"type\\":\\"struct\\",\\"fields\\":[{\\"name\\":\\"employee_id\\",\\"type\\":\\"integer\\",\\"nullable\\":true,\\"metadata\\":{}},{\\"name\\":\\"employee_name\\",\\"type\\":\\"string\\",\\"nullable\\":true,\\"metadata\\":{}},{\\"name\\":\\"department\\",\\"type\\":\\"string\\",\\"nullable\\":true,\\"metadata\\":{}},{\\"name\\":\\"region\\",\\"type\\":\\"string\\",\\


Another way to see what is stored on our log file is to use `DESCRIBE HISTORY` command. 

In [18]:
%%sql

DESCRIBE HISTORY employees;

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 20, Finished, Available)

<Spark SQL result set with 1 rows and 15 fields>


### Update Delta Table

Provided dataset has no employee titles - let's add them!

In [19]:
%%sql

alter table employees
add column employee_title string;

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 21, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

In [20]:
%%sql

update employees
set employee_title = case when employee_id in (0,2) then 'MGR' else 'unknown' end

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 22, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [21]:
%%sql
select employee_id,employee_title from employees;

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 23, Finished, Available)

<Spark SQL result set with 9 rows and 2 fields>

In [22]:
%%sql

update employees
set employee_title = 'CEO'
where employee_id = 1

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 24, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [24]:
%%sql
select employee_title, count(employee_id) as employee_count_by_title from employees group by employee_title

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 26, Finished, Available)

<Spark SQL result set with 3 rows and 2 fields>


### Track Data History


Delta Tables keep all changes made in the delta log we've seen before. There are multiple ways to see that - e.g. by running `DESCRIBE HISTORY` for a table

In [25]:
%%sql

DESCRIBE HISTORY employees

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 27, Finished, Available)

<Spark SQL result set with 4 rows and 15 fields>


We can also check what files storage location has now

In [26]:
display(mssparkutils.fs.ls(log_files_location))

StatementMeta(, 717cedff-d2e1-47d7-a335-063d4205c3f6, 28, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/00000000000000000000.json, name=00000000000000000000.json, size=2353),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/00000000000000000001.json, name=00000000000000000001.json, size=1388),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/00000000000000000002.json, name=00000000000000000002.json, size=1750),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/00000000000000000003.json, name=00000000000000000003.json, size=1787),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Tables/employees/_delta_log/_temporary, name=_temporary, size=0)]

###Time Travel
Having all this information and old data files mean that we can **Time Travel**!  You can query your table at any given `VERSION AS OF` or  `TIMESTAMP AS OF`.

Let's check again what table looked like before we ran last update

In [25]:
%%sql

select * from employees VERSION AS OF 2 where employee_title = 'MGR';

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 29, Finished, Available)

<Spark SQL result set with 2 rows and 9 fields>

In [26]:
%%sql

select * from employees VERSION AS OF 3
where employee_title = 'MGR';

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 30, Finished, Available)

<Spark SQL result set with 2 rows and 9 fields>

In [27]:
%%sql

DESCRIBE HISTORY employees

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 31, Finished, Available)

<Spark SQL result set with 4 rows and 15 fields>


### Change Data Feed


The Delta change data feed represents row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.

It is not enabled by default, but we can enabled it using `TBLPROPERTIES`

In [28]:
%%sql
ALTER TABLE employees SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 32, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>


Changes to table properties also generate a new version

In [29]:
%%sql

describe history employees

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 33, Finished, Available)

<Spark SQL result set with 5 rows and 15 fields>


Change data feed can be seen by using `table_changes` function. You will need to specify a range of changes to be returned - it can be done by providing either version or timestamp for the start and end. The start and end versions and timestamps are inclusive in the queries. 

To read the changes from a particular start version to the latest version of the table, specify only the starting version or timestamp.

In [None]:
%%sql

-- simulate change of address for store AKL01 and removal of store BNE02

update employees
set employee_title = 'Supervisor'
where employee_title = 'MGR';

delete from employees
where employee_id = 0;

--SELECT * FROM table_changes('employees', 5,6) -- Note that we increment versions due to UPDATE statements above


Delta CDC gives back 4 cdc types in the "__change_type" column:

| CDC Type             | Description                                                               |
|----------------------|---------------------------------------------------------------------------|
| **update_preimage**  | Content of the row before an update                                       |
| **update_postimage** | Content of the row after the update (what you want to capture downstream) |
| **delete**           | Content of a row that has been deleted                                    |
| **insert**           | Content of a new row that has been inserted                               |

Therefore, 1 update results in 2 rows in the cdc stream (one row with the previous values, one with the new values)


### CLONE


What if our use case is more of a having monthly snapshots of the data instead of detailed changes log? Easy way to get it done is to create CLONE of table.

You can create a copy of an existing Delta table at a specific version using the clone command. Clones can be either deep or shallow.

 

* A **deep clone** is a clone that copies the source table data to the clone target in addition to the metadata of the existing table.
* A **shallow clone** is a clone that does not copy the data files to the clone target. The table metadata is equivalent to the source. These clones are cheaper to create, but they will break if original data files were not available

In [33]:
%%sql

drop table if exists employees_clone;

create table employees_clone DEEP CLONE employees VERSION AS OF 3 -- you can specify timestamp here instead of a version

StatementMeta(, , , Finished, )

<Spark SQL result set with 0 rows and 0 fields>

Error: 
Syntax error at or near 'CLONE'(line 3, pos 29)

== SQL ==


create table employees_clone CLONE employees VERSION AS OF 3 -- you can specify timestamp here instead of a version
-----------------------------^^^


In [None]:
%%sql

describe history employees_clone;

In [None]:
%%sql

drop table if exists employees_clone_shallow;

-- Note that no files are copied

create table employees_clone_shallow SHALLOW CLONE employees;


### Dynamic Views


Our stores table has some PII data (email, phone number). We can use dynamic views to limit visibility to the columns and rows depending on groups user belongs to.

In [34]:
%%sql

select * from employees

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 44, Finished, Available)

<Spark SQL result set with 8 rows and 9 fields>

In [35]:
%%sql

DROP VIEW IF EXISTS v_employee_name_redacted;

CREATE VIEW v_employee_name_redacted AS
SELECT
  employee_id,
  CASE WHEN
    --NOT is_member('admins') THEN employee_name
    is_member('admins') THEN employee_name
    ELSE 'REDACTED'
  END AS employee_name,
  department,
  region,
  employee_title
FROM employees;

StatementMeta(, , , Finished, )

<Spark SQL result set with 0 rows and 0 fields>

Error: Undefined function: is_member. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.liad_bronze.is_member.; line 8 pos 4

In [None]:
%%sql

select * from v_employee_name_redacted;

In [39]:
%%sql

DROP VIEW IF EXISTS v_employees_limited;

CREATE VIEW v_employees_limited AS
SELECT *
FROM employees
WHERE 
  (employee_title = 'unknown');

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, -1, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

In [40]:
%%sql

select * from v_employees_limited;

StatementMeta(, aa1b7503-1edf-47fa-9527-34f2381bc41a, 52, Finished, Available)

<Spark SQL result set with 6 rows and 9 fields>