# Partitioning

Iceberg (and other lakehouses) don't provide indexes that you may be used to from a more traditional datawarehouse, but they do provide a concept of partitioning, which serves a similar purpose. 

Partitioning refers to structuring the way the files are saved to disk in order to co-locate ranges of values. This makes it more likely that the query engine only has to read a few files to get all the requested data instead of all of them.

If you haven't noticed the theme yet, it's all about eliminating as much disk I/O as possible. The less files we have to scan, the more performant our query is!

Iceberg implements what they call *Hidden Partitioning*, and let's digress a little bit to the past to understand what that means.

Hive implemented *Explicit partitioning*, where the user needs to be aware of the partitioning and explicitly use when reading and writing.

```{figure} images/hive_partitioning.png
:alt: Hive-style partitioning
:align: center
:figwidth: image

Hive-style partitioning
```

The main issue with Hive-style partitioning is that it is explicit.
Given this partitioning scheme, if I wanted to query a range 2024-01-01 <=> 2024-02-28 I might want to write this query

```sql
SELECT * FROM reviews WHERE review_date between '2024-01-01' AND '2024-02-28'
```

This query would not use the index, as Hive is explicitly expecting a year, month and date filter.

```sql
SELECT * from reviews where year = 2024 AND (month = 1 OR month = 2) AND DAY BETWEEN 1 and 31
```

Iceberg hides this complexity away from the user, hence **Hidden Partitioning**

We could have defined our partitioning when we created the table, but like much of data engineering, we realized later we needed it.

In [1]:
from schema import house_prices_schema
from utils import read_house_prices, query, catalog, engine

In [2]:
catalog.drop_table("house_prices.raw", purge_requested=True)

In [3]:
house_prices_t = catalog.create_table_if_not_exists("house_prices.raw", schema=house_prices_schema, location="s3://warehouse/raw")

In [4]:
from pyiceberg.transforms import MonthTransform, YearTransform
with house_prices_t.update_spec() as spec:
    spec.add_field("date_of_transfer", MonthTransform(), "month_date_of_transfer")

# Show adding new field

In [6]:
import pathlib
files_to_load = sorted(list(pathlib.Path("data/house_prices/").glob("*.csv")))
files_to_load

[PosixPath('data/house_prices/pp-1995.csv'),
 PosixPath('data/house_prices/pp-1996.csv'),
 PosixPath('data/house_prices/pp-1997.csv'),
 PosixPath('data/house_prices/pp-1998.csv'),
 PosixPath('data/house_prices/pp-1999.csv'),
 PosixPath('data/house_prices/pp-2000.csv'),
 PosixPath('data/house_prices/pp-2001.csv'),
 PosixPath('data/house_prices/pp-2002.csv'),
 PosixPath('data/house_prices/pp-2003.csv'),
 PosixPath('data/house_prices/pp-2004.csv'),
 PosixPath('data/house_prices/pp-2005.csv'),
 PosixPath('data/house_prices/pp-2006.csv'),
 PosixPath('data/house_prices/pp-2007.csv'),
 PosixPath('data/house_prices/pp-2008.csv'),
 PosixPath('data/house_prices/pp-2009.csv'),
 PosixPath('data/house_prices/pp-2010.csv'),
 PosixPath('data/house_prices/pp-2011.csv'),
 PosixPath('data/house_prices/pp-2012.csv'),
 PosixPath('data/house_prices/pp-2013.csv'),
 PosixPath('data/house_prices/pp-2014.csv'),
 PosixPath('data/house_prices/pp-2015.csv'),
 PosixPath('data/house_prices/pp-2016.csv'),
 PosixPath

In [7]:
for filename in files_to_load:
    year = filename.name[3:7]
    df = read_house_prices(filename).to_arrow().cast(house_prices_schema.as_arrow())
    print(f"Appending {filename.name} - {len(df):,} rows")
    house_prices_t.append(df)
    current_snapshot = house_prices_t.current_snapshot().snapshot_id
    house_prices_t.manage_snapshots().create_tag(current_snapshot, f'{year}_load').commit()
    print(f"Tagged: {year}_load")

Appending pp-1995.csv - 797,040 rows
Tagged: 1995_load
Appending pp-1996.csv - 965,283 rows
Tagged: 1996_load
Appending pp-1997.csv - 1,094,498 rows
Tagged: 1997_load
Appending pp-1998.csv - 1,050,541 rows
Tagged: 1998_load
Appending pp-1999.csv - 1,194,921 rows
Tagged: 1999_load
Appending pp-2000.csv - 1,129,409 rows
Tagged: 2000_load
Appending pp-2001.csv - 1,245,876 rows
Tagged: 2001_load
Appending pp-2002.csv - 1,351,810 rows
Tagged: 2002_load
Appending pp-2003.csv - 1,235,493 rows
Tagged: 2003_load
Appending pp-2004.csv - 1,231,987 rows
Tagged: 2004_load
Appending pp-2005.csv - 1,061,437 rows
Tagged: 2005_load
Appending pp-2006.csv - 1,326,106 rows
Tagged: 2006_load
Appending pp-2007.csv - 1,272,356 rows
Tagged: 2007_load
Appending pp-2008.csv - 649,649 rows
Tagged: 2008_load
Appending pp-2009.csv - 625,307 rows
Tagged: 2009_load
Appending pp-2010.csv - 663,286 rows
Tagged: 2010_load
Appending pp-2011.csv - 661,203 rows
Tagged: 2011_load
Appending pp-2012.csv - 668,918 rows
Tagged

In [8]:
house_prices_t.refs()

{'2016_load': SnapshotRef(snapshot_id=2885797485783118428, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 'main': SnapshotRef(snapshot_id=7191227797506811923, snapshot_ref_type=SnapshotRefType.BRANCH, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2011_load': SnapshotRef(snapshot_id=1276191452546819040, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2018_load': SnapshotRef(snapshot_id=8321004510563170038, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2014_load': SnapshotRef(snapshot_id=5557644292704981373, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2000_load': SnapshotRef(snapshot_id=5142863807827331626, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_sn

In [9]:
query("select count(*) as num_rows from house_prices.raw for version as of '2024_load'")

num_rows
i64
29978492


In [28]:
import polars as pl

In [None]:
pl.scan_iceberg(house_prices_t).select(
    pl
    pl.col("paon"),
    pl.col("saon"),
    pl.col("street"),
    pl.col("locality"),
    pl.col("town"),
    pl.col("district"),
    pl.col("county"),
    pl.col("postcode"),
).unique()