## delta-lake Example

This notebook illustrates how delta-lake works under the hood. Observe how the files created in the `DATA` directory are changed by running each command.

I use [delta-rs](https://github.com/delta-io/delta-rs) in this example to read and write the delta-lake table. In databricks, one would typically use (py-)spark.

In [21]:
from deltalake import DeltaTable, write_deltalake
import pandas as pd
import shutil
from pathlib import Path
from helpers import spread_out_log_timestamps

DATA_DIR = Path("my_table")

shutil.rmtree(DATA_DIR, ignore_errors=True)

In [22]:
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "bar"]})
df

Unnamed: 0,id,value
0,1,foo
1,2,bar


In [23]:
write_deltalake(DATA_DIR, df, mode="append")

In [24]:
table = DeltaTable(DATA_DIR)
table.delete("id = 1")

{'num_added_files': 1,
 'num_removed_files': 1,
 'num_deleted_rows': 1,
 'num_copied_rows': 1,
 'execution_time_ms': 5,
 'scan_time_ms': 2,
 'rewrite_time_ms': 3}

In [25]:
for _ in range(100):
    write_deltalake(DATA_DIR, df, mode="append")

In [None]:
table = DeltaTable(DATA_DIR)
table.optimize.compact()

In [None]:
pd.DataFrame(table.history())

In [48]:
spread_out_log_timestamps(DATA_DIR)

In [None]:
table = DeltaTable(DATA_DIR)
table.vacuum(retention_hours=48, dry_run=False, enforce_retention_duration=False)

In [52]:
pd.DataFrame(table.history())