# Time Travelling

Another advantage of Iceberg's metadata structure is that it gives us Time Travel for free. Since all we're doing is storing snapshots and moving pointers, time travelling is essentially just asking to see the data at a previous pointer.

In [13]:
import sqlalchemy as sa
import polars as pl
from pyiceberg.catalog.rest import RestCatalog
from IPython.display import display
pl.Config.set_thousands_separator(',')

polars.config.Config

In [2]:
engine = sa.create_engine("trino://trino:@trino:8080/lakekeeper")
catalog = RestCatalog(
    "lakekeeper", uri="http://lakekeeper:8181/catalog", warehouse="lakehouse"
)
house_prices_t = catalog.load_table("housing.staging_prices")

## Python API vs SQL
Pyiceberg offers us some APIs that let us inspect the table metadata - it's all Pyarrow under the hood in Pyiceberg, so we can use polars to pretty-print the dataframes

In [3]:
pl.from_arrow(house_prices_t.inspect.history())

made_current_at,snapshot_id,parent_id,is_current_ancestor
datetime[ms],i64,i64,bool
2025-05-25 10:39:28.248,561444497425091042,,True
2025-05-25 10:47:03.208,6216576064814123496,5.614444974250911e+17,True
2025-05-25 11:55:21.465,7207777689682313573,6.216576064814123e+18,True
2025-05-25 11:56:50.468,6710783448956002675,7.207777689682313e+18,True


The SQL equivalent will depend on the query engine - Trino uses `$` as the metadata table identifier

In [17]:
history = pl.read_database(
    'SELECT * FROM housing."staging_prices$history" order by made_current_at', engine
)
with pl.Config(thousands_separator=None):
    display(history)

made_current_at,snapshot_id,parent_id,is_current_ancestor
"datetime[μs, UTC]",i64,i64,bool
2025-05-25 10:39:28.248 UTC,561444497425091042,,True
2025-05-25 10:47:03.208 UTC,6216576064814123496,5.614444974250911e+17,True
2025-05-25 11:55:21.465 UTC,7207777689682313573,6.216576064814123e+18,True
2025-05-25 11:56:50.468 UTC,6710783448956002675,7.207777689682313e+18,True


Now that we have a list of snapshots, we can demonstrate timetravelling. We loaded 2024, 2023 and 2022 data into our table, so we should see different counts in each snapshot

In [14]:
pl.read_database("SELECT count(transaction_id) as num_rows FROM housing.staging_prices", engine)

num_rows
i64
2613415


The time travel syntax also varies by query engine, but Trino uses the `FOR VERSION AS OF` syntax

In [20]:
pl.read_database(
    "SELECT count(transaction_id) as num_rows from housing.staging_prices for version as of 561444497425091042",
    engine,
)

num_rows
i64
704344


Pyiceberg exposes a similar API, where we can specify the `snapshot_id` we want to read

In [21]:
house_prices_t.scan(
    snapshot_id=561444497425091042, selected_fields=["transaction_id"]
).to_arrow().num_rows

704344

Since most libriaries build on Pyiceberg, you'll see similar APIs there

In [22]:
pl.scan_iceberg(house_prices_t, snapshot_id=561444497425091042).select(
    pl.count("transaction_id")
).collect()

transaction_id
u32
704344


SQL offers us some niceties here in that we can timetravel via timestamps as well, and Trino will do the work of looking up the snapshot closest in time

In [23]:
pl.read_database(
    "SELECT count(transaction_id) as num_rows from housing.staging_prices for timestamp as of timestamp '2025-05-25 10:40:00'",
    engine,
)

num_rows
i64
704344


Remembering these snapshot ids or pinpointing the exact time we're interested in is tricky for our human brains, so Iceberg supports tagging so that we can provide human-readable references to a given snapshot.

In [24]:
house_prices_t.manage_snapshots().create_tag(
    561444497425091042, "initial commit"
).commit()

In [25]:
with pl.Config(thousands_separator=None):
    display(pl.from_arrow(house_prices_t.inspect.refs()))

name,type,snapshot_id,max_reference_age_in_ms,min_snapshots_to_keep,max_snapshot_age_in_ms
str,cat,i64,i64,i32,i64
"""main""","""BRANCH""",6710783448956002675,,,
"""initial commit""","""TAG""",561444497425091042,,,


Now that we have this tag, we can reference it directly in our SQL statement

In [26]:
pl.read_database(
    "SELECT count(transaction_id) as num_rows from housing.staging_prices for version as of 'initial commit'",
    engine,
)

num_rows
i64
704344


Pyiceberg is a bit more clunky - since we need to pass a snapshot ID, we need to use Pyiceberg to lookup the snapshot_id for our tag

In [27]:
pl.scan_iceberg(
    house_prices_t,
    snapshot_id=house_prices_t.snapshot_by_name("initial commit").snapshot_id,
).select(pl.count("transaction_id")).collect()

transaction_id
u32
704344


We can permanently rollback a change, though this is not available through Pyiceberg

In [28]:
with engine.connect() as conn:
    conn.execute(
        sa.text(
            "ALTER TABLE housing.staging_prices EXECUTE rollback_to_snapshot(561444497425091042)"
        )
    ).fetchone()

```{warning}
The current schema of the table remains unchanged even if we rollback. Current schema is set to include the `_loaded_at` column we added earlier
```

When making metadata changes in a different query engine it's important to refresh our Pyiceberg metadata, since metadata is cached

In [30]:
house_prices_t.refresh();

In [31]:
# TODO: verify with Trino
pl.read_database("SELECT * FROM housing.staging_prices", engine)

date_of_transfer,transaction_id,price,postcode,property_type,new_property,duration,paon,saon,street,locality,town,district,county,ppd_category_type,record_status,_dwh_loaded_at
date,str,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,null
2024-10-02,"""{25E9DA80-AD30-555E-E063-4704A…",225000,"""DE6 1TW""","""S""","""N""","""F""","""48""","""""","""ACORN DRIVE""","""""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",
2024-10-04,"""{25E9DA80-AD31-555E-E063-4704A…",120000,"""SK22 4AH""","""F""","""N""","""L""","""8""","""""","""MEAL STREET""","""NEW MILLS""","""HIGH PEAK""","""HIGH PEAK""","""DERBYSHIRE""","""A""","""A""",
2024-08-19,"""{25E9DA80-AD32-555E-E063-4704A…",197500,"""S42 5FN""","""T""","""N""","""F""","""24""","""""","""FARMHOUSE WAY""","""GRASSMOOR""","""CHESTERFIELD""","""NORTH EAST DERBYSHIRE""","""DERBYSHIRE""","""A""","""A""",
2024-07-17,"""{25E9DA80-AD33-555E-E063-4704A…",275000,"""S40 3HF""","""D""","""N""","""F""","""22""","""""","""GREENWAYS""","""""","""CHESTERFIELD""","""CHESTERFIELD""","""DERBYSHIRE""","""A""","""A""",
2024-02-09,"""{25E9DA80-AD34-555E-E063-4704A…",216000,"""DE24 3GP""","""S""","""N""","""F""","""7""","""""","""LOWICK CLOSE""","""""","""DERBY""","""CITY OF DERBY""","""CITY OF DERBY""","""A""","""A""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2024-02-02,"""{1EAE3DF6-DFF6-9EB1-E063-4704A…",300000,"""OX7 5EB""","""T""","""N""","""F""","""11A""","""""","""BURFORD ROAD""","""""","""CHIPPING NORTON""","""WEST OXFORDSHIRE""","""OXFORDSHIRE""","""A""","""A""",
2024-07-11,"""{1EAE3DF6-DFF7-9EB1-E063-4704A…",640000,"""OX4 4XQ""","""T""","""N""","""F""","""73""","""""","""THE CRESCENT""","""LITTLEMORE""","""OXFORD""","""OXFORD""","""OXFORDSHIRE""","""A""","""A""",
2024-07-19,"""{1EAE3DF6-DFF8-9EB1-E063-4704A…",600000,"""OX11 9NX""","""D""","""N""","""F""","""HIGHLANDS""","""""","""LONDON ROAD""","""BLEWBURY""","""DIDCOT""","""VALE OF WHITE HORSE""","""OXFORDSHIRE""","""A""","""A""",
2024-06-19,"""{1EAE3DF6-DFF9-9EB1-E063-4704A…",495000,"""OX25 4NE""","""S""","""N""","""F""","""86""","""""","""SOUTH STREET""","""CAULCOTT""","""BICESTER""","""CHERWELL""","""OXFORDSHIRE""","""A""","""A""",


In [32]:
pl.scan_iceberg(house_prices_t).limit(10).collect()

date_of_transfer,transaction_id,price,postcode,property_type,new_property,duration,paon,saon,street,locality,town,district,county,ppd_category_type,record_status,_dwh_loaded_at
date,str,i32,str,str,str,str,str,str,str,str,str,str,str,str,str,"datetime[μs, UTC]"
2024-10-02,"""{25E9DA80-AD30-555E-E063-4704A…",225000,"""DE6 1TW""","""S""","""N""","""F""","""48""","""""","""ACORN DRIVE""","""""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",
2024-10-04,"""{25E9DA80-AD31-555E-E063-4704A…",120000,"""SK22 4AH""","""F""","""N""","""L""","""8""","""""","""MEAL STREET""","""NEW MILLS""","""HIGH PEAK""","""HIGH PEAK""","""DERBYSHIRE""","""A""","""A""",
2024-08-19,"""{25E9DA80-AD32-555E-E063-4704A…",197500,"""S42 5FN""","""T""","""N""","""F""","""24""","""""","""FARMHOUSE WAY""","""GRASSMOOR""","""CHESTERFIELD""","""NORTH EAST DERBYSHIRE""","""DERBYSHIRE""","""A""","""A""",
2024-07-17,"""{25E9DA80-AD33-555E-E063-4704A…",275000,"""S40 3HF""","""D""","""N""","""F""","""22""","""""","""GREENWAYS""","""""","""CHESTERFIELD""","""CHESTERFIELD""","""DERBYSHIRE""","""A""","""A""",
2024-02-09,"""{25E9DA80-AD34-555E-E063-4704A…",216000,"""DE24 3GP""","""S""","""N""","""F""","""7""","""""","""LOWICK CLOSE""","""""","""DERBY""","""CITY OF DERBY""","""CITY OF DERBY""","""A""","""A""",
2024-09-25,"""{25E9DA80-AD35-555E-E063-4704A…",210000,"""DE6 5PH""","""S""","""N""","""F""","""8""","""""","""GARDNER COURT""","""DOVERIDGE""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",
2024-08-30,"""{25E9DA80-AD36-555E-E063-4704A…",220000,"""S43 4ZD""","""S""","""N""","""F""","""2""","""""","""HAWTHORNE ROAD""","""BARLBOROUGH""","""CHESTERFIELD""","""BOLSOVER""","""DERBYSHIRE""","""A""","""A""",
2024-08-30,"""{25E9DA80-AD37-555E-E063-4704A…",230000,"""SK17 7PR""","""S""","""N""","""F""","""66""","""""","""VICTORIA PARK ROAD""","""""","""BUXTON""","""HIGH PEAK""","""DERBYSHIRE""","""A""","""A""",
2024-05-29,"""{25E9DA80-AD38-555E-E063-4704A…",140000,"""DE65 6AH""","""D""","""N""","""F""","""4""","""""","""LONGLANDS LANE""","""FINDERN""","""DERBY""","""SOUTH DERBYSHIRE""","""DERBYSHIRE""","""A""","""A""",
2024-09-26,"""{25E9DA80-AD39-555E-E063-4704A…",205000,"""DE6 5PH""","""S""","""N""","""F""","""5""","""""","""GARDNER COURT""","""DOVERIDGE""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",


## Cleaning up

Iceberg provides various routines to clean up files and metadata as orphan files and unused data pile up. Depending on your catalogue, this may be an automated process, but we can manually trigger them via Trino

In [34]:
with engine.connect() as conn:
    # Remove snapshots and corresponding metadata
    conn.execute(
        sa.text(
            "ALTER TABLE housing.staging_prices EXECUTE expire_snapshots(retention_threshold => '0d')"
        )
    ).fetchone()
    # Remove orphaned files not referenced by metadata
    conn.execute(
        sa.text(
            "ALTER table housing.staging_prices execute remove_orphan_files(retention_threshold => '0d')"
        )
    ).fetchone()
    # Co-locate manifests based on partitioning
    conn.execute(
        sa.text("ALTER TABLE housing.staging_prices EXECUTE optimize_manifests")
    ).fetchone()
    # Compact small files into larger
    conn.execute(
        sa.text("ALTER table housing.staging_prices execute optimize")
    ).fetchone()

In [36]:
with pl.Config(thousands_separator=None):
    display(pl.read_database(
        'SELECT * FROM housing."staging_prices$history" order by made_current_at', engine
    ))

made_current_at,snapshot_id,parent_id,is_current_ancestor
"datetime[μs, UTC]",i64,i64,bool
2025-05-25 10:39:28.248 UTC,561444497425091042,,True
