# Partitioning

Iceberg (and other lakehouses) don't provide indexes that you may be used to from a more traditional datawarehouse, but they do provide a concept of partitioning, which serves a similar purpose. 

Partitioning refers to structuring the way the files are saved to disk in order to co-locate ranges of values. This makes it more likely that the query engine only has to read a few files to get all the requested data instead of all of them.

If you haven't noticed the theme yet, it's all about eliminating as much disk I/O as possible. The less files we have to scan, the more performant our query is!

Iceberg implements what they call *Hidden Partitioning*, and let's digress a little bit to the past to understand what that means.

Hive implemented *Explicit partitioning*, where the user needs to be aware of the partitioning and explicitly use when reading and writing.

```{figure} images/hive_partitioning.png
:alt: Hive-style partitioning
:align: center
:figwidth: image

Hive-style partitioning
```

The main issue with Hive-style partitioning is that it is explicit.
Given this partitioning scheme, if I wanted to query a range 2024-01-01 <=> 2024-02-28 I might want to write this query

```sql
SELECT * FROM reviews WHERE review_date between '2024-01-01' AND '2024-02-28'
```

This query would not use the index, as Hive is explicitly expecting a year, month and date filter.

```sql
SELECT * from reviews where year = 2024 AND (month = 1 OR month = 2) AND DAY BETWEEN 1 and 31
```

Iceberg hides this complexity away from the user, hence **Hidden Partitioning**

We could have defined our partitioning when we created the table, but like much of data engineering, we realized later we needed it.

In [22]:
from schema import house_prices_schema
from utils import read_house_prices, query, catalog, engine

In [4]:
catalog.drop_table("house_prices.raw", purge_requested=True)

In [7]:
house_prices_t = catalog.create_table_if_not_exists("house_prices.raw", schema=house_prices_schema, location="s3://warehouse/raw")

In [21]:
from pyiceberg.transforms import MonthTransform, YearTransform
with house_prices_t.update_spec() as spec:
    spec.add_field("date_of_transfer", MonthTransform(), "month_date_of_transfer")

# Show adding new field

ValueError: Duplicate partition field for $date_of_transfer=$Reference(name='date_of_transfer'), $1000: month_date_of_transfer: month(3) already exists

In [9]:
import pathlib
files_to_load = sorted(list(pathlib.Path("data").glob("*.csv")))
files_to_load

[PosixPath('data/pp-2015.csv'),
 PosixPath('data/pp-2016.csv'),
 PosixPath('data/pp-2017.csv'),
 PosixPath('data/pp-2018.csv'),
 PosixPath('data/pp-2019.csv'),
 PosixPath('data/pp-2020.csv'),
 PosixPath('data/pp-2021.csv'),
 PosixPath('data/pp-2022.csv'),
 PosixPath('data/pp-2023.csv'),
 PosixPath('data/pp-2024.csv')]

In [10]:
for filename in files_to_load:
    year = filename.name[3:7]
    df = read_house_prices(filename).to_arrow().cast(house_prices_schema.as_arrow())
    print(f"Appending {filename.name} - {len(df):,} rows")
    house_prices_t.append(df)
    current_snapshot = house_prices_t.current_snapshot().snapshot_id
    house_prices_t.manage_snapshots().create_tag(current_snapshot, f'{year}_load').commit()
    print(f"Tagged: {year}_load")

Appending pp-2015.csv - 1,010,755 rows
Tagged: 2015_load
Appending pp-2016.csv - 1,046,018 rows




Tagged: 2016_load
Appending pp-2017.csv - 1,067,118 rows
Tagged: 2017_load
Appending pp-2018.csv - 1,037,085 rows
Tagged: 2018_load
Appending pp-2019.csv - 1,011,237 rows
Tagged: 2019_load
Appending pp-2020.csv - 895,168 rows
Tagged: 2020_load
Appending pp-2021.csv - 1,276,537 rows
Tagged: 2021_load
Appending pp-2022.csv - 841,772 rows
Tagged: 2022_load
Appending pp-2023.csv - 841,772 rows
Tagged: 2023_load
Appending pp-2024.csv - 704,344 rows
Tagged: 2024_load


In [11]:
house_prices_t.refs()

{'2020_load': SnapshotRef(snapshot_id=5245330294268174244, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2016_load': SnapshotRef(snapshot_id=1738779038844655692, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2019_load': SnapshotRef(snapshot_id=5890171336967504561, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2023_load': SnapshotRef(snapshot_id=2420041366645628972, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2015_load': SnapshotRef(snapshot_id=913501148215003094, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_snapshot_age_ms=None, max_ref_age_ms=None),
 '2017_load': SnapshotRef(snapshot_id=8685362870842111241, snapshot_ref_type=SnapshotRefType.TAG, min_snapshots_to_keep=None, max_s

In [20]:
query("select count(*) as num_rows from house_prices.raw for version as of '2024_load'")

num_rows
i64
9731806


In [28]:
import polars as pl

In [None]:
pl.scan_iceberg(house_prices_t).select(
    pl
    pl.col("paon"),
    pl.col("saon"),
    pl.col("street"),
    pl.col("locality"),
    pl.col("town"),
    pl.col("district"),
    pl.col("county"),
    pl.col("postcode"),
).unique()