# Introduction to delta-rs

This notebook introduces you to the key features of Delta Lake via the delta-rs library.

delta-rs allows you to work with Delta Lake without a Spark runtime.

You can easily install the software environment for running this notebook by running `conda install env -f envs/mr-powers-rs` and then run `conda activate mr-powers-rs` to activate the environment.

Once you work through this notebook, you'll have a better understanding of the features that make Delta Lake powerful.  It's a relatively quick guide and should be eye-opening!  Let's dive in!

We'll start by importing pandas and deltalake and by creating a current working directory path.

In [7]:
import pathlib

from deltalake import write_deltalake, DeltaTable
import pandas as pd
import pyarrow.dataset as ds

In [8]:
cwd = pathlib.Path().resolve()

## Create a Delta Lake

Let's create a pandas DataFrame and then write out the data to a Delta Lake.

In [9]:
df = pd.DataFrame({"num": [1, 2, 3], "letter": ["a", "b", "c"]})

In [10]:
print(df)

   num letter
0    1      a
1    2      b
2    3      c


In [11]:
write_deltalake(f"{cwd}/tmp/delta-table", df)

You can inspect the contents of the `tmp/delta-table` folder to begin understanding how Delta Lake works.  Here's what the folder will contain:

```
tmp/
  delta-table/
    _delta_log/
      00000000000000000000.json
    0-3f43d8ae-40a5-4417-8a00-ae55392a662f-0.parquet
```

`tmp/delta-table` contains a `delta_log` which is often refered to as the "transaction log".  The transaction log tracks the files that have been added and removed from the Delta Lake, along with other metadata.

The Parquet file contains the actual data that was written to the Delta Lake.

You don't need to have a detailed understanding of how the transaction log works.  A high level conceptual grasp is all you need to understand how Delta Lake provides you with useful data management features.

In [12]:
!tree tmp/delta-table

[01;34mtmp/delta-table[0m
├── [00m0-acc1b1db-f6a6-4486-acae-3f2b314ad48a-0.parquet[0m
└── [01;34m_delta_log[0m
    └── [00m00000000000000000000.json[0m

1 directory, 2 files


In [13]:
!jq . tmp/delta-table/_delta_log/00000000000000000000.json

[1;39m{
  [0m[34;1m"protocol"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"minReaderVersion"[0m[1;39m: [0m[0;39m1[0m[1;39m,
    [0m[34;1m"minWriterVersion"[0m[1;39m: [0m[0;39m1[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"metaData"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"id"[0m[1;39m: [0m[0;32m"c55d7d7d-65e2-4d61-bdab-e01dd84f94a2"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[1;30mnull[0m[1;39m,
    [0m[34;1m"description"[0m[1;39m: [0m[1;30mnull[0m[1;39m,
    [0m[34;1m"format"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"provider"[0m[1;39m: [0m[0;32m"parquet"[0m[1;39m,
      [0m[34;1m"options"[0m[1;39m: [0m[1;39m{}[0m[1;39m
    [1;39m}[0m[1;39m,
    [0m[34;1m"schemaString"[0m[1;39m: [0m[0;32m"{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"letter\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}"[0m[1;39m,
    [0m[34;1m"partiti

## Read a Delta Lake

Let's read the Delta Lake you created into a pandas DataFrame and print out the contents.

In [14]:
dt = DeltaTable("./tmp/delta-table")

In [15]:
dt.to_pandas()

Unnamed: 0,num,letter
0,1,a
1,2,b
2,3,c


In [16]:
dt.version()

0

After the first data insert, the Delta Lake is at "version 0".  Let's add some more data to the Delta Lake and see how the version gets updated after another write transaction is performed.

## Insert more data into Delta Lake

Create another pandas DataFrame with the same schema and insert it to the Delta Lake.

In [17]:
df = pd.DataFrame({"num": [77, 88, 99], "letter": ["x", "y", "z"]})

The Delta Lake already exists, so we need to set the write `mode="append"` to add additional data.

In [18]:
write_deltalake(f"{cwd}/tmp/delta-table", df, mode="append")

Let's read the Delta Lake into a pandas DataFrame and confirm it contains the data from both the first and second write transactions.

In [19]:
dt = DeltaTable("./tmp/delta-table")

In [20]:
dt.to_pandas()

Unnamed: 0,num,letter
0,1,a
1,2,b
2,3,c
3,77,x
4,88,y
5,99,z


After the first write transaction, the Delta Lake was at "version 0".  Now, after the second write transaction, the Delta Lake is at "version 1".

In [15]:
dt.version()

1

In [21]:
!tree tmp/delta-table

[01;34mtmp/delta-table[0m
├── [00m0-acc1b1db-f6a6-4486-acae-3f2b314ad48a-0.parquet[0m
├── [00m1-582344a2-df7e-4688-9f3c-cac96939bdae-0.parquet[0m
└── [01;34m_delta_log[0m
    ├── [00m00000000000000000000.json[0m
    └── [00m00000000000000000001.json[0m

1 directory, 4 files


In [22]:
!jq . tmp/delta-table/_delta_log/00000000000000000001.json

[1;39m{
  [0m[34;1m"add"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"path"[0m[1;39m: [0m[0;32m"1-582344a2-df7e-4688-9f3c-cac96939bdae-0.parquet"[0m[1;39m,
    [0m[34;1m"size"[0m[1;39m: [0m[0;39m2208[0m[1;39m,
    [0m[34;1m"partitionValues"[0m[1;39m: [0m[1;39m{}[0m[1;39m,
    [0m[34;1m"modificationTime"[0m[1;39m: [0m[0;39m1701198708275[0m[1;39m,
    [0m[34;1m"dataChange"[0m[1;39m: [0m[0;39mtrue[0m[1;39m,
    [0m[34;1m"stats"[0m[1;39m: [0m[0;32m"{\"numRecords\": 3, \"minValues\": {\"num\": 77, \"letter\": \"x\"}, \"maxValues\": {\"num\": 99, \"letter\": \"z\"}, \"nullCount\": {\"num\": 0, \"letter\": 0}}"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"commitInfo"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"timestamp"[0m[1;39m: [0m[0;39m1701198708275[0m[1;39m,
    [0m[34;1m"operation"[0m[1;39m: [0m[0;32m"WRITE"[0m[1;39m,
    [0m[34;1m"operationParameters"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"partitionBy"

## Overwrite Delta table

## Time travel to previous version of data

Let's travel back in time and inspect the content of the Delta Lake at "version 0".  

In [23]:
dt = DeltaTable("./tmp/delta-table", version=0)

In [24]:
dt.to_pandas()

Unnamed: 0,num,letter
0,1,a
1,2,b
2,3,c


In [25]:
!tree tmp/delta-table

[01;34mtmp/delta-table[0m
├── [00m0-acc1b1db-f6a6-4486-acae-3f2b314ad48a-0.parquet[0m
├── [00m1-582344a2-df7e-4688-9f3c-cac96939bdae-0.parquet[0m
└── [01;34m_delta_log[0m
    ├── [00m00000000000000000000.json[0m
    └── [00m00000000000000000001.json[0m

1 directory, 4 files


Wow!  That's cool!

We performed two write transactions and were able to travel back in time and view the contents of the Delta Lake before the second write transaction was performed.  This is an incredibly powerful and useful feature.

Delta Lake gives you time travel for free!

## Schema enforcement

Schema enforcement is enabled by default.  If you try to append data to a Delta Lake that doesn't have the same schema, it'll error out with a descriptive message detailing the schema differences.

In [19]:
df = pd.DataFrame({"name": ["bob", "denise"], "age": [64, 43]})

In [20]:
dl.writer.write_deltalake(f"{cwd}/tmp/delta-table", df, mode="append")

ValueError: Schema of data does not match table schema
Table schema:
name: string
age: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 470
Data Schema:
num: int64
letter: string

In [21]:
dt = dl.DeltaTable("./tmp/delta-table")

In [22]:
dt.to_pandas()

Unnamed: 0,num,letter
0,1,a
1,2,b
2,3,c
3,77,x
4,88,y
5,99,z


## Delete rows

This section demonstrates how you can delete rows of data from the Delta Lake.

In [23]:
dt = dl.DeltaTable("./tmp/delta-table")

Convert the DeltaTable to a PyArrow dataset, so we can perform a filtering operation.

In [24]:
dataset = dt.to_pyarrow_dataset()

Filter out all the values that are less than 1 and greater than 99

In [25]:
condition = (ds.field("num") > 1.0) & (ds.field("num") < 99.0)

In [26]:
filtered = dataset.to_table(filter=condition).to_pandas()

In [27]:
filtered

Unnamed: 0,num,letter
0,2,b
1,3,c
2,77,x
3,88,y


Set the save mode to overwrite to update the Delta Lake to only include the filtered data.

In [28]:
dl.writer.write_deltalake(f"{cwd}/tmp/delta-table", filtered, mode="overwrite")

Read in the latest version of the Delta Lake to a pandas DataFrame to confirm that it only includes the filtered data.

In [29]:
dt = dl.DeltaTable("./tmp/delta-table")

In [30]:
dt.to_pandas()

Unnamed: 0,num,letter
0,2,b
1,3,c
2,77,x
3,88,y


## Vacuum old data files

Delta Lake doesn't delete stale file from disk by default.  We just performed an overwrite transaction which means that all the data for the latest version of the Delta Lake is in a new file.  When we read in the latest version of the Delta Lake, it'll just read the new file.  Let's take a look.

In [31]:
dt = dl.DeltaTable("./tmp/delta-table")

In [32]:
dt.files()

['2-5f1b893c-7e42-4968-b4cf-0a76c3061d6e-0.parquet']

In [33]:
dt.to_pandas()

Unnamed: 0,num,letter
0,2,b
1,3,c
2,77,x
3,88,y


We have several Parquet files on disk, but only one is being read for the current version of the Delta Lake.  Let's take a look at all the Parquet files currently in the Delta Lake.

In [34]:
! ls tmp/delta-table/*.parquet

tmp/delta-table/0-e859573b-51d9-4193-aaee-55f52b07392a-0.parquet
tmp/delta-table/1-5db2221e-eb29-47eb-b59d-ea99281c351c-0.parquet
tmp/delta-table/2-5f1b893c-7e42-4968-b4cf-0a76c3061d6e-0.parquet


The "stale" Parquet files are what allow for time travel.  Let's time travel back to "version 1" of the Delta Lake.

In [35]:
dt = dl.DeltaTable("./tmp/delta-table", version=1)

In [36]:
dt.files()

['0-e859573b-51d9-4193-aaee-55f52b07392a-0.parquet',
 '1-5db2221e-eb29-47eb-b59d-ea99281c351c-0.parquet']

In [37]:
dt.to_pandas()

Unnamed: 0,num,letter
0,1,a
1,2,b
2,3,c
3,77,x
4,88,y
5,99,z


When we time travel back to version 1, we're reading entirely different files than when we read the latest version of the the Delta Lake.

The legacy files are what allow you to time travel.

If you don't want to time travel, you can delete the legacy files with the `vacuum()` command.

In [38]:
dt = dl.DeltaTable(f"{cwd}/tmp/delta-table")

Vacuum is run in "dry run" mode by default.

In [39]:
dt.vacuum(retention_hours=0, enforce_retention_duration=False)

['/Users/powers/Documents/code/my_apps/delta-examples/notebooks/delta-rs/tmp/delta-table/1-5db2221e-eb29-47eb-b59d-ea99281c351c-0.parquet',
 '/Users/powers/Documents/code/my_apps/delta-examples/notebooks/delta-rs/tmp/delta-table/0-e859573b-51d9-4193-aaee-55f52b07392a-0.parquet']

The files aren't actually deleted when the code is executed in dry run mode.

In [40]:
! ls tmp/delta-table/*.parquet

tmp/delta-table/0-e859573b-51d9-4193-aaee-55f52b07392a-0.parquet
tmp/delta-table/1-5db2221e-eb29-47eb-b59d-ea99281c351c-0.parquet
tmp/delta-table/2-5f1b893c-7e42-4968-b4cf-0a76c3061d6e-0.parquet


Explicitly set `dry_run` to `False` to actually delete the files.

In [41]:
dt.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)

['/Users/powers/Documents/code/my_apps/delta-examples/notebooks/delta-rs/tmp/delta-table/1-5db2221e-eb29-47eb-b59d-ea99281c351c-0.parquet',
 '/Users/powers/Documents/code/my_apps/delta-examples/notebooks/delta-rs/tmp/delta-table/0-e859573b-51d9-4193-aaee-55f52b07392a-0.parquet']

In [42]:
! ls tmp/delta-table/*.parquet

tmp/delta-table/2-5f1b893c-7e42-4968-b4cf-0a76c3061d6e-0.parquet


## Cleanup

Let's delete the Delta Lake now that we're done with this demo.

In [1]:
! rm -rf ./tmp/delta-table/