# Time Travelling

Another advantage of Iceberg's metadata structure is that it gives us Time Travel for free. Since all we're doing is storing snapshots and moving pointers, time travelling is essentially just asking to see the data at a previous pointer.

In [31]:
import sqlalchemy as sa
import polars as pl
from pyiceberg.catalog.rest import RestCatalog
import pyarrow.csv as pc

polars.config.Config

In [6]:
engine = sa.create_engine("trino://trino:@trino:8080/lakekeeper")
catalog = RestCatalog("lakekeeper", uri="http://lakekeeper:8181/catalog", warehouse="lakehouse")
house_prices_t = catalog.load_table("housing.staging_prices")

## Python API vs SQL
Pyiceberg offers us some APIs that let us inspect the table metadata - it's all Pyarrow under the hood in Pyiceberg, so we can use polars to pretty-print the dataframes

In [33]:
pl.from_arrow(house_prices_t.inspect.history())

made_current_at,snapshot_id,parent_id,is_current_ancestor
datetime[ms],i64,i64,bool
2025-05-19 20:22:34.277,8580268702316458514,,True
2025-05-19 20:43:24.155,5967660870361116733,8.580268702316458e+18,True
2025-05-19 21:10:37.844,3231086783972264595,5.967660870361117e+18,True
2025-05-19 21:11:53.418,1606895426754472649,3.2310867839722644e+18,True


The SQL equivalent will depend on the query engine - Trino uses `$` as the metadata table identifier

In [34]:
pl.read_database('SELECT * FROM housing."staging_prices$history" order by made_current_at', engine)

made_current_at,snapshot_id,parent_id,is_current_ancestor
"datetime[μs, UTC]",i64,i64,bool
2025-05-19 20:22:34.277 UTC,8580268702316458514,,True
2025-05-19 20:43:24.155 UTC,5967660870361116733,8.580268702316458e+18,True
2025-05-19 21:10:37.844 UTC,3231086783972264595,5.967660870361117e+18,True
2025-05-19 21:11:53.418 UTC,1606895426754472649,3.2310867839722644e+18,True


Now that we have a list of snapshots, we can demonstrate timetravelling. We loaded 2024, 2023 and 2022 data into our table, so we should see different counts in each snapshot

In [16]:
pl.Config.set_thousands_separator(',')
pl.read_database('SELECT count(transaction_id) as num_rows FROM housing.staging_prices', engine)

num_rows
i64
2387888


The time travel syntax also varies by query engine, but Trino uses the `FOR VERSION AS OF` syntax

In [18]:
pl.read_database('SELECT count(transaction_id) as num_rows from housing.staging_prices for version as of 5967660870361116733', engine)

num_rows
i64
1546116


Pyiceberg exposes a similar API, where we can specify the `snapshot_id` we want to read

In [19]:
house_prices_t.scan(snapshot_id=5967660870361116733, selected_fields=['transaction_id']).to_arrow().num_rows



1546116



Since most libriaries build on Pyiceberg, you'll see similar APIs there

In [22]:
pl.scan_iceberg(house_prices_t, snapshot_id=5967660870361116733).select(pl.count("transaction_id")).collect()

transaction_id
u32
1546116


SQL offers us some niceties here in that we can timetravel via timestamps as well, and Trino will do the work of looking up the snapshot closest in time

In [26]:
pl.read_database("SELECT count(transaction_id) as num_rows from housing.staging_prices for timestamp as of timestamp '2025-05-19 21:00:00'", engine)

num_rows
i64
1546116


Remembering these snapshot ids or pinpointing the exact time we're interested in is tricky for our human brains, so Iceberg supports tagging so that we can provide human-readable references to a given snapshot.

In [27]:
house_prices_t.manage_snapshots().create_tag(5967660870361116733, "initial commit").commit()

In [35]:
pl.Config.set_thousands_separator(None)
pl.from_arrow(house_prices_t.inspect.refs())

name,type,snapshot_id,max_reference_age_in_ms,min_snapshots_to_keep,max_snapshot_age_in_ms
str,cat,i64,i64,i32,i64
"""main""","""BRANCH""",1606895426754472649,,,
"""initial commit""","""TAG""",5967660870361116733,,,


Now that we have this tag, we can reference it directly in our SQL statement

In [37]:
pl.read_database("SELECT count(transaction_id) as num_rows from housing.staging_prices for version as of 'initial commit'", engine)

num_rows
i64
1546116


Pyiceberg is a bit more clunky - since we need to pass a snapshot ID, we need to use Pyiceberg to lookup the snapshot_id for our tag

In [38]:
pl.scan_iceberg(house_prices_t, snapshot_id=house_prices_t.snapshot_by_name('initial commit').snapshot_id).select(pl.count('transaction_id')).collect()

transaction_id
u32
1546116


We can permanently rollback a change, though this is not available through Pyiceberg

In [41]:
with engine.connect() as conn:
    conn.execute(sa.text("ALTER TABLE housing.staging_prices EXECUTE rollback_to_snapshot(8580268702316458514)")).fetchone()

```{warning}
The current schema of the table remains unchanged even if we rollback. Current schema is set to include the `_loaded_at` column we added earlier
```

When making metadata changes in a different query engine it's important to refresh our Pyiceberg metadata, since metadata is cached

In [45]:
house_prices_t.refresh()

staging_prices(
  2: price: required int (Sale price stated on the transfer deed.),
  3: date_transfer: required date (Date when the sale was completed, as stated on the transfer deed.),
  1: transaction_id: required string (A reference number which is generated automatically recording each published sale. The number is unique and will change each time a sale is recorded.),
  4: postcode: required string (This is the postcode used at the time of the original transaction. Note that postcodes can be reallocated and these changes are not reflected in the Price Paid Dataset.),
  5: property_type: required string (D = Detached, S = Semi-Detached, T = Terraced, F = Flats/Maisonettes, O = Other),
  6: new_property: required string (Indicates the age of the property and applies to all price paid transactions, residential and non-residential. Y = a newly built property, N = an established residential building),
  7: duration: required string (Relates to the tenure: F = Freehold, L= Leasehold et

In [None]:
# TODO: verify with Trino
pl.read_database("SELECT * FROM housing.staging_prices", engine)

In [46]:
pl.scan_iceberg(house_prices_t).limit(10).collect()

price,date_transfer,transaction_id,postcode,property_type,new_property,duration,paon,saon,street,locality,town,district,county,ppd_category_type,record_status,_dwh_loaded_at
i32,date,str,str,str,str,str,str,str,str,str,str,str,str,str,str,"datetime[μs, UTC]"
225000,2024-10-02,"""{25E9DA80-AD30-555E-E063-4704A…","""DE6 1TW""","""S""","""N""","""F""","""48""","""""","""ACORN DRIVE""","""""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",
120000,2024-10-04,"""{25E9DA80-AD31-555E-E063-4704A…","""SK22 4AH""","""F""","""N""","""L""","""8""","""""","""MEAL STREET""","""NEW MILLS""","""HIGH PEAK""","""HIGH PEAK""","""DERBYSHIRE""","""A""","""A""",
197500,2024-08-19,"""{25E9DA80-AD32-555E-E063-4704A…","""S42 5FN""","""T""","""N""","""F""","""24""","""""","""FARMHOUSE WAY""","""GRASSMOOR""","""CHESTERFIELD""","""NORTH EAST DERBYSHIRE""","""DERBYSHIRE""","""A""","""A""",
275000,2024-07-17,"""{25E9DA80-AD33-555E-E063-4704A…","""S40 3HF""","""D""","""N""","""F""","""22""","""""","""GREENWAYS""","""""","""CHESTERFIELD""","""CHESTERFIELD""","""DERBYSHIRE""","""A""","""A""",
216000,2024-02-09,"""{25E9DA80-AD34-555E-E063-4704A…","""DE24 3GP""","""S""","""N""","""F""","""7""","""""","""LOWICK CLOSE""","""""","""DERBY""","""CITY OF DERBY""","""CITY OF DERBY""","""A""","""A""",
210000,2024-09-25,"""{25E9DA80-AD35-555E-E063-4704A…","""DE6 5PH""","""S""","""N""","""F""","""8""","""""","""GARDNER COURT""","""DOVERIDGE""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",
220000,2024-08-30,"""{25E9DA80-AD36-555E-E063-4704A…","""S43 4ZD""","""S""","""N""","""F""","""2""","""""","""HAWTHORNE ROAD""","""BARLBOROUGH""","""CHESTERFIELD""","""BOLSOVER""","""DERBYSHIRE""","""A""","""A""",
230000,2024-08-30,"""{25E9DA80-AD37-555E-E063-4704A…","""SK17 7PR""","""S""","""N""","""F""","""66""","""""","""VICTORIA PARK ROAD""","""""","""BUXTON""","""HIGH PEAK""","""DERBYSHIRE""","""A""","""A""",
140000,2024-05-29,"""{25E9DA80-AD38-555E-E063-4704A…","""DE65 6AH""","""D""","""N""","""F""","""4""","""""","""LONGLANDS LANE""","""FINDERN""","""DERBY""","""SOUTH DERBYSHIRE""","""DERBYSHIRE""","""A""","""A""",
205000,2024-09-26,"""{25E9DA80-AD39-555E-E063-4704A…","""DE6 5PH""","""S""","""N""","""F""","""5""","""""","""GARDNER COURT""","""DOVERIDGE""","""ASHBOURNE""","""DERBYSHIRE DALES""","""DERBYSHIRE""","""A""","""A""",


In [43]:
house_prices_t.refresh().current_snapshot().snapshot_id

8580268702316458514

In [48]:
pl.read_database('SELECT * FROM housing."staging_prices$history" order by made_current_at', engine)

made_current_at,snapshot_id,parent_id,is_current_ancestor
"datetime[μs, UTC]",i64,i64,bool
2025-05-19 20:22:34.277 UTC,8580268702316458514,,True
2025-05-19 20:43:24.155 UTC,5967660870361116733,8.580268702316458e+18,False
2025-05-19 21:10:37.844 UTC,3231086783972264595,5.967660870361117e+18,False
2025-05-19 21:11:53.418 UTC,1606895426754472649,3.2310867839722644e+18,False


## Cleaning up

Iceberg provides various routines to clean up files and metadata as orphan files and unused data pile up. Depending on your catalogue, this may be an automated process, but we can manually trigger them via Trino

In [50]:
with engine.connect() as conn:
    # Remove snapshots and corresponding metadata
    conn.execute(sa.text("ALTER TABLE housing.staging_prices EXECUTE expire_snapshots(retention_threshold => '0d')")).fetchone()
    # Remove orphaned files not referenced by metadata
    conn.execute(sa.text("ALTER table housing.staging_prices execute remove_orphan_files(retention_threshold => '0d')")).fetchone()
    # Co-locate manifests based on partitioning
    conn.execute(sa.text("ALTER TABLE housing.staging_prices EXECUTE optimize_manifests")).fetchone()
    # Compact small files into larger
    conn.execute(sa.text("ALTER table housing.staging_prices execute optimize")).fetchone()

In [56]:
pl.read_database('SELECT * FROM housing."staging_prices$history" order by made_current_at', engine)

made_current_at,snapshot_id,parent_id,is_current_ancestor
"datetime[μs, UTC]",i64,i64,bool
2025-05-19 20:22:34.277 UTC,8580268702316458514,,True
2025-05-19 20:43:24.155 UTC,5967660870361116733,8.580268702316458e+18,False


In [54]:
house_prices_t.refresh().current_snapshot()

Snapshot(snapshot_id=8580268702316458514, parent_snapshot_id=None, sequence_number=1, timestamp_ms=1747686154277, manifest_list='s3://warehouse/housing/staging/metadata/snap-8580268702316458514-0-f9ba0a23-01a5-4b0e-94b4-690ee85e28fd.avro', summary=Summary(Operation.APPEND, **{'total-equality-deletes': '0', 'total-position-deletes': '0', 'total-data-files': '1', 'total-files-size': '15839390', 'total-delete-files': '0', 'total-records': '704344', 'added-data-files': '1', 'added-records': '704344', 'added-files-size': '15839390'}), schema_id=0)

In [61]:
pl.scan_iceberg(house_prices_t, snapshot_id=5967660870361116733).select(pl.col("transaction_id").len()).collect()

transaction_id
u32
1546116
