# Partitioning

Iceberg (and other lakehouses) don't provide indexes that you may be used to from a more traditional datawarehouse, but they do provide a concept of partitioning, which serves a similar purpose. 

Partitioning refers to structuring the way the files are saved to disk in order to co-locate ranges of values. This makes it more likely that the query engine only has to read a few files to get all the requested data instead of all of them.

If you haven't noticed the theme yet, it's all about eliminating as much disk I/O as possible. The less files we have to scan, the more performant our query is!

Iceberg implements what they call *Hidden Partitioning*, and let's digress a little bit to the past to understand what that means.

Hive implemented *Explicit partitioning*, where the user needs to be aware of the partitioning and explicitly use when reading and writing.

```{figure} images/hive_partitioning.png
:alt: Hive-style partitioning
:align: center
:figwidth: image

Hive-style partitioning
```

The main issue with Hive-style partitioning is that it is explicit.
Given this partitioning scheme, if I wanted to query a range 2024-01-01 <=> 2024-02-28 I might want to write this query

```sql
SELECT * FROM reviews WHERE review_date between '2024-01-01' AND '2024-02-28'
```

This query would not use the index, as Hive is explicitly expecting a year, month and date filter.

```sql
SELECT * from reviews where year = 2024 AND (month = 1 OR month = 2) AND DAY BETWEEN 1 and 31
```

Iceberg hides this complexity away from the user, hence **Hidden Partitioning**

We could have defined our partitioning when we created the table, but like much of data engineering, we often realize later that we needed it. Predicting query patterns up-front is a big ask. 

In [14]:
from schema import house_prices_schema
from utils import read_house_prices, query, catalog, engine, get_iceberg_metadata, fs
from IPython.display import JSON
import polars as pl

Let's reset everything to start from a clean slate

In [42]:
catalog.drop_table("housing.staging_prices", purge_requested=True)

In [43]:
house_prices_t = catalog.create_table_if_not_exists("housing.staging_prices", schema=house_prices_schema, location="s3://warehouse/staging")

Iceberg defines a number of supported `transforms` - functions that Iceberg will use to map a query onto a partition. Dates are pretty common in warehouses, so Year, Month, Day transfomrs enable intelligent date-based partitioning. For keys and identifiers, Bucket and Truncate are used to ensure a distributed write pattern. 

In this case, we know we're interested in date-based queries, and since we don't have a lot of daily activity, partitioning by month sounds like a good starting point.

In [44]:
from pyiceberg.transforms import MonthTransform, YearTransform
with house_prices_t.update_spec() as spec:
    spec.add_field("date_of_transfer", MonthTransform(), "month_date_of_transfer")

Let's have a look at the metadata file after the update

In [45]:
JSON(get_iceberg_metadata(fs, house_prices_t))

<IPython.core.display.JSON object>

Now that we've setup some partitioning - let's load in our data to see what that looks like. 

In [46]:
import pathlib
files_to_load = sorted(list(pathlib.Path("data/house_prices/").glob("*.csv")))
files_to_load

[PosixPath('data/house_prices/pp-2015.csv'),
 PosixPath('data/house_prices/pp-2016.csv'),
 PosixPath('data/house_prices/pp-2017.csv'),
 PosixPath('data/house_prices/pp-2018.csv'),
 PosixPath('data/house_prices/pp-2019.csv'),
 PosixPath('data/house_prices/pp-2020.csv'),
 PosixPath('data/house_prices/pp-2021.csv'),
 PosixPath('data/house_prices/pp-2022.csv'),
 PosixPath('data/house_prices/pp-2023.csv'),
 PosixPath('data/house_prices/pp-2024.csv')]

We could imagine that for each monthly load, we would want to generate a tag to be easily able to roll back to a given load, so let's do that for fun :).

Let's start by reading in the first file

In [47]:
# Load the data into Iceberg
df = read_house_prices(files_to_load[0]).to_arrow().cast(house_prices_schema.as_arrow())
house_prices_t.append(df)

# Tag the new snapshot - retain it for a month
current_snapshot = house_prices_t.current_snapshot().snapshot_id
house_prices_t.manage_snapshots().create_tag(current_snapshot, '2015_load', max_ref_age_ms=2629746000).commit()

Let's have a look at what is happening in the physical storage

In [49]:
fs.ls(f"{house_prices_t.location()}/data")

['warehouse/staging/data/month_date_of_transfer=2015-01',
 'warehouse/staging/data/month_date_of_transfer=2015-02',
 'warehouse/staging/data/month_date_of_transfer=2015-03',
 'warehouse/staging/data/month_date_of_transfer=2015-04',
 'warehouse/staging/data/month_date_of_transfer=2015-05',
 'warehouse/staging/data/month_date_of_transfer=2015-06',
 'warehouse/staging/data/month_date_of_transfer=2015-07',
 'warehouse/staging/data/month_date_of_transfer=2015-08',
 'warehouse/staging/data/month_date_of_transfer=2015-09',
 'warehouse/staging/data/month_date_of_transfer=2015-10',
 'warehouse/staging/data/month_date_of_transfer=2015-11',
 'warehouse/staging/data/month_date_of_transfer=2015-12']

The data is now physically partitioned by year-month, and we can now use it without having to know anything about the partitioning. Looking at the Trino query plan, we can see that it's using the partition

In [60]:
print(pl.read_database("EXPLAIN SELECT max(price) as max_price from housing.staging_prices where date_of_transfer between DATE '2015-01-01' AND DATE '2015-06-30'", engine).item(0, 0))

Trino version: 475
Fragment 0 [SINGLE]
    Output layout: [max]
    Output partitioning: SINGLE []
    Output[columnNames = [max_price]]
    │   Layout: [max:integer]
    │   Estimates: {rows: 1 (5B), cpu: 0, memory: 0B, network: 0B}
    │   max_price := max
    └─ Aggregate[type = FINAL]
       │   Layout: [max:integer]
       │   Estimates: {rows: 1 (5B), cpu: 2.12M, memory: 5B, network: 0B}
       │   max := max(max_0)
       └─ LocalExchange[partitioning = SINGLE]
          │   Layout: [max_0:integer]
          │   Estimates: {rows: 444557 (2.12MB), cpu: 0, memory: 0B, network: 0B}
          └─ RemoteSource[sourceFragmentIds = [1]]
                 Layout: [max_0:integer]

Fragment 1 [SOURCE]
    Output layout: [max_0]
    Output partitioning: SINGLE []
    Aggregate[type = PARTIAL]
    │   Layout: [max_0:integer]
    │   Estimates: {rows: 444557 (2.12MB), cpu: ?, memory: ?, network: ?}
    │   max_0 := max(price)
    └─ TableScan[table = lakekeeper:housing.staging_prices$data@7592

But how big were the files we're scanning?

In [61]:
fs.ls("warehouse/staging/data/month_date_of_transfer=2015-01", detail=True)

[{'Key': 'warehouse/staging/data/month_date_of_transfer=2015-01/00000-10-dd04e1f5-abb6-469e-900c-719bc7f95cc1.parquet',
  'LastModified': datetime.datetime(2025, 5, 20, 18, 54, 35, 278000, tzinfo=tzlocal()),
  'ETag': '"52b9c16fb2d785e00ee0432206d2fea5-1"',
  'Size': 2698586,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 2698586,
  'name': 'warehouse/staging/data/month_date_of_transfer=2015-01/00000-10-dd04e1f5-abb6-469e-900c-719bc7f95cc1.parquet'}]

That's not very big at all - while there is no strict guidelines, consensus is that the Parquet files should be somewhere between 128 MB and 1 GB uncompressed, depending on use case, as the overhead of reading many small files adds up quick. Luckily, we can quickly change our partitioning, without having to rewrite our existing files

In [63]:
from pyiceberg.transforms import YearTransform
with house_prices_t.update_spec() as spec:
    spec.remove_field("month_date_of_transfer")
    spec.add_field("date_of_transfer", YearTransform(), "year_date_of_transfer")

Changing partitioning doesn't alter existing files, it only affects future files. To demonstrate let's load the next file to see the effect

In [64]:
# Load the data into Iceberg
df = read_house_prices(files_to_load[1]).to_arrow().cast(house_prices_schema.as_arrow())
house_prices_t.append(df)

# Tag the new snapshot - retain it for a month
current_snapshot = house_prices_t.current_snapshot().snapshot_id
house_prices_t.manage_snapshots().create_tag(current_snapshot, '2016_load', max_ref_age_ms=2629746000).commit()

Let's look at the file structure now

In [66]:
fs.ls(f"{house_prices_t.location()}/data", refresh=True)

['warehouse/staging/data/month_date_of_transfer=2015-01',
 'warehouse/staging/data/month_date_of_transfer=2015-02',
 'warehouse/staging/data/month_date_of_transfer=2015-03',
 'warehouse/staging/data/month_date_of_transfer=2015-04',
 'warehouse/staging/data/month_date_of_transfer=2015-05',
 'warehouse/staging/data/month_date_of_transfer=2015-06',
 'warehouse/staging/data/month_date_of_transfer=2015-07',
 'warehouse/staging/data/month_date_of_transfer=2015-08',
 'warehouse/staging/data/month_date_of_transfer=2015-09',
 'warehouse/staging/data/month_date_of_transfer=2015-10',
 'warehouse/staging/data/month_date_of_transfer=2015-11',
 'warehouse/staging/data/month_date_of_transfer=2015-12',
 'warehouse/staging/data/year_date_of_transfer=2016']

In [67]:
fs.ls("warehouse/staging/data/year_date_of_transfer=2016", detail=True)

[{'Key': 'warehouse/staging/data/year_date_of_transfer=2016/00000-0-7b6edc16-a2b9-4e05-bb3a-8418736b5e09.parquet',
  'LastModified': datetime.datetime(2025, 5, 20, 20, 16, 52, 67000, tzinfo=tzlocal()),
  'ETag': '"c665170e9c44745f388a8039c1548d78-3"',
  'Size': 23222373,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 23222373,
  'name': 'warehouse/staging/data/year_date_of_transfer=2016/00000-0-7b6edc16-a2b9-4e05-bb3a-8418736b5e09.parquet'}]

Better - parquet compresses well, so this is closer to optimal size. We'll keep this and load the rest

In [71]:
for filename in files_to_load[2:]:
    # Grab the year from the filename
    year = filename.name[3:7]
    # Read in the CSV
    df = read_house_prices(filename).to_arrow().cast(house_prices_schema.as_arrow())
    print(f"Appending {filename.name} - {len(df):,} rows")
    # Write to Iceberg
    house_prices_t.append(df)
    # Get the new snapshot id
    current_snapshot = house_prices_t.current_snapshot().snapshot_id
    # Tag the new snapshot - retain it for a month
    house_prices_t.manage_snapshots().create_tag(current_snapshot, f'{year}_load', max_ref_age_ms=2629746000).commit()
    print(f"Tagged: {year}_load")

Appending pp-2017.csv - 1,067,118 rows
Tagged: 2017_load
Appending pp-2018.csv - 1,037,085 rows
Tagged: 2018_load
Appending pp-2019.csv - 1,011,237 rows
Tagged: 2019_load
Appending pp-2020.csv - 895,168 rows
Tagged: 2020_load
Appending pp-2021.csv - 1,276,537 rows
Tagged: 2021_load
Appending pp-2022.csv - 841,772 rows
Tagged: 2022_load
Appending pp-2023.csv - 841,772 rows
Tagged: 2023_load
Appending pp-2024.csv - 704,344 rows
Tagged: 2024_load


In [74]:
fs.ls(f"{house_prices_t.location()}/data", refresh=True)

['warehouse/staging/data/month_date_of_transfer=2015-01',
 'warehouse/staging/data/month_date_of_transfer=2015-02',
 'warehouse/staging/data/month_date_of_transfer=2015-03',
 'warehouse/staging/data/month_date_of_transfer=2015-04',
 'warehouse/staging/data/month_date_of_transfer=2015-05',
 'warehouse/staging/data/month_date_of_transfer=2015-06',
 'warehouse/staging/data/month_date_of_transfer=2015-07',
 'warehouse/staging/data/month_date_of_transfer=2015-08',
 'warehouse/staging/data/month_date_of_transfer=2015-09',
 'warehouse/staging/data/month_date_of_transfer=2015-10',
 'warehouse/staging/data/month_date_of_transfer=2015-11',
 'warehouse/staging/data/month_date_of_transfer=2015-12',
 'warehouse/staging/data/year_date_of_transfer=2016',
 'warehouse/staging/data/year_date_of_transfer=2017',
 'warehouse/staging/data/year_date_of_transfer=2018',
 'warehouse/staging/data/year_date_of_transfer=2019',
 'warehouse/staging/data/year_date_of_transfer=2020',
 'warehouse/staging/data/year_date

Now Iceberg has two different partitions to keep track of, so it will split the partition planning across the two partitions

![Partition Spec Evolution](images/partition_spec_evolution.png)