# Time Travelling

Another advantage of Iceberg's metadata structure is that it gives us Time Travel for free. Since all we're doing is storing snapshots and moving pointers, time travelling is essentially just asking to see the data at a previous pointer.

In [None]:
import sqlalchemy as sa
import polars as pl
from pyiceberg.catalog.rest import RestCatalog
from IPython.display import display
pl.Config.set_thousands_separator(',')

In [None]:
engine = sa.create_engine("trino://trino:@trino:8080/lakekeeper")
catalog = RestCatalog(
    "lakekeeper", uri="http://lakekeeper:8181/catalog", warehouse="lakehouse"
)
house_prices_t = catalog.load_table("housing.staging_prices")

## Python API vs SQL
Pyiceberg offers us some APIs that let us inspect the table metadata - it's all Pyarrow under the hood in Pyiceberg, so we can use polars to pretty-print the dataframes

In [None]:
with pl.Config(thousands_separator=None):
    display(pl.from_arrow(house_prices_t.inspect.history()))

The SQL equivalent will depend on the query engine - Trino uses `$` as the metadata table identifier

In [None]:
history = pl.read_database(
    'SELECT * FROM housing."staging_prices$history" order by made_current_at', engine
)
with pl.Config(thousands_separator=None):
    display(history)

Now that we have a list of snapshots, we can demonstrate timetravelling. We loaded 2024, 2023 and 2022 data into our table, so we should see different counts in each snapshot

In [None]:
pl.read_database("SELECT count(transaction_id) as num_rows FROM housing.staging_prices", engine)

The time travel syntax also varies by query engine, but Trino uses the `FOR VERSION AS OF` syntax

In [None]:
pl.read_database(
    "SELECT count(transaction_id) as num_rows from housing.staging_prices for version as of 4406190551159350418",
    engine,
)

Pyiceberg exposes a similar API, where we can specify the `snapshot_id` we want to read

In [None]:
house_prices_t.scan(
    snapshot_id=4406190551159350418, selected_fields=["transaction_id"]
).to_arrow().num_rows

Since most libriaries build on Pyiceberg, you'll see similar APIs there

In [None]:
pl.scan_iceberg(house_prices_t, snapshot_id=4406190551159350418).select(
    pl.count("transaction_id")
).collect()

SQL offers us some niceties here in that we can timetravel via timestamps as well, and Trino will do the work of looking up the snapshot closest in time

In [None]:
pl.read_database(
    "SELECT count(transaction_id) as num_rows from housing.staging_prices for timestamp as of timestamp '2025-06-04 20:30:00'",
    engine,
)

Remembering these snapshot ids or pinpointing the exact time we're interested in is tricky for our human brains, so Iceberg supports tagging so that we can provide human-readable references to a given snapshot.

In [None]:
house_prices_t.manage_snapshots().create_tag(
    4406190551159350418, "initial commit"
).commit()

In [None]:
with pl.Config(thousands_separator=None):
    display(pl.from_arrow(house_prices_t.inspect.refs()))

Now that we have this tag, we can reference it directly in our SQL statement

In [None]:
pl.read_database(
    "SELECT count(transaction_id) as num_rows from housing.staging_prices for version as of 'initial commit'",
    engine,
)

Pyiceberg is a bit more clunky - since we need to pass a snapshot ID, we need to use Pyiceberg to lookup the snapshot_id for our tag

In [None]:
pl.scan_iceberg(
    house_prices_t,
    snapshot_id=house_prices_t.snapshot_by_name("initial commit").snapshot_id,
).select(pl.count("transaction_id")).collect()

We can permanently rollback a change, though this is not available through Pyiceberg

In [None]:
with engine.connect() as conn:
    conn.execute(
        sa.text(
            "ALTER TABLE housing.staging_prices EXECUTE rollback_to_snapshot(4406190551159350418)"
        )
    ).fetchone()

```{warning}
The current schema of the table remains unchanged even if we rollback. Current schema is set to include the `_loaded_at` column we added earlier
```

In [None]:
pl.read_database("SELECT count('transaction_id') as num_rows from housing.staging_prices", engine)

When making metadata changes in a different query engine it's important to refresh our Pyiceberg metadata, since metadata is cached

In [None]:
house_prices_t.refresh();

In [None]:
pl.scan_iceberg(house_prices_t).select(pl.col("transaction_id").len().alias('num_rows')).collect()

In [None]:
with pl.Config(thousands_separator=None):
    display(pl.from_arrow(house_prices_t.inspect.history()))

## Cleaning up

Iceberg provides various routines to clean up files and metadata as orphan files and unused data pile up. Depending on your catalogue, this may be an automated process, but we can manually trigger them via Trino

In [None]:
with engine.connect() as conn:
    # Remove snapshots and corresponding metadata
    conn.execute(
        sa.text(
            "ALTER TABLE housing.staging_prices EXECUTE expire_snapshots(retention_threshold => '0d')"
        )
    ).fetchone()
    # Remove orphaned files not referenced by metadata
    conn.execute(
        sa.text(
            "ALTER table housing.staging_prices execute remove_orphan_files(retention_threshold => '0d')"
        )
    ).fetchone()
    # Co-locate manifests based on partitioning
    conn.execute(
        sa.text("ALTER TABLE housing.staging_prices EXECUTE optimize_manifests")
    ).fetchone()
    # Compact small files into larger
    conn.execute(
        sa.text("ALTER table housing.staging_prices execute optimize")
    ).fetchone()

In [None]:
with pl.Config(thousands_separator=None):
    display(pl.read_database(
        'SELECT * FROM housing."staging_prices$history" order by made_current_at', engine
    ))