# Partitioning

Iceberg (and other lakehouses) don't provide indexes that you may be used to from a more traditional datawarehouse, but they do provide a concept of partitioning, which serves a similar purpose. 

Partitioning refers to structuring the way the files are saved to disk in order to co-locate ranges of values. This makes it more likely that the query engine only has to read a few files to get all the requested data instead of all of them.

If you haven't noticed the theme yet, it's all about eliminating as much disk I/O as possible. The less files we have to scan, the more performant our query is!

Iceberg implements what they call *Hidden Partitioning*, and let's digress a little bit to the past to understand what that means.

Hive implemented *Explicit partitioning*, where the user needs to be aware of the partitioning and explicitly use when reading and writing.

```{figure} images/hive_partitioning.png
:alt: Hive-style partitioning
:align: center
:figwidth: image

Hive-style partitioning
```

The main issue with Hive-style partitioning is that it is explicit.
Given this partitioning scheme, if I wanted to query a range 2024-01-01 <=> 2024-02-28 I might want to write this query

```sql
SELECT * FROM reviews WHERE review_date between '2024-01-01' AND '2024-02-28'
```

This query would not use the index, as Hive is explicitly expecting a year, month and date filter.

```sql
SELECT * from reviews where year = 2024 AND (month = 1 OR month = 2) AND DAY BETWEEN 1 and 31
```

Iceberg hides this complexity away from the user, hence **Hidden Partitioning**

We could have defined our partitioning when we created the table, but like much of data engineering, we often realize later that we needed it. Predicting query patterns up-front is a big ask. 

In [1]:
from schema import house_prices_schema
from utils import read_house_prices, catalog, engine, get_iceberg_metadata, fs
from IPython.display import JSON
import polars as pl

Let's reset everything to start from a clean slate

In [2]:
catalog.drop_table("housing.staging_prices", purge_requested=True)

In [3]:
house_prices_t = catalog.create_table_if_not_exists(
    "housing.staging_prices",
    schema=house_prices_schema,
    location="s3://warehouse/staging",
)

## Hidden Partitioning

Iceberg defines a number of supported `transforms` - functions that Iceberg will use to map a query onto a partition. Dates are pretty common in warehouses, so Year, Month, Day transfomrs enable intelligent date-based partitioning. For keys and identifiers, Bucket and Truncate are used to ensure a distributed write pattern. 

In this case, we know we're interested in date-based queries, and since we don't have a lot of daily activity, partitioning by month sounds like a good starting point.

In [4]:
from pyiceberg.transforms import MonthTransform, YearTransform

with house_prices_t.update_spec() as spec:
    spec.add_field("date_of_transfer", MonthTransform(), "month_date_of_transfer")

Let's have a look at the metadata file after the update

In [5]:
JSON(get_iceberg_metadata(fs, house_prices_t))

<IPython.core.display.JSON object>

Now that we've setup some partitioning - let's load in our data to see what that looks like. 

In [6]:
import pathlib

files_to_load = sorted(list(pathlib.Path("data/house_prices/").glob("*.csv")))
files_to_load

[PosixPath('data/house_prices/pp-2015.csv'),
 PosixPath('data/house_prices/pp-2016.csv'),
 PosixPath('data/house_prices/pp-2017.csv'),
 PosixPath('data/house_prices/pp-2018.csv'),
 PosixPath('data/house_prices/pp-2019.csv'),
 PosixPath('data/house_prices/pp-2020.csv'),
 PosixPath('data/house_prices/pp-2021.csv'),
 PosixPath('data/house_prices/pp-2022.csv'),
 PosixPath('data/house_prices/pp-2023.csv'),
 PosixPath('data/house_prices/pp-2024.csv')]

We could imagine that for each monthly load, we would want to generate a tag to be easily able to roll back to a given load, so let's do that for fun :).

Let's start by reading in the first file and loading it to our Iceberg table

In [7]:
# Load the data into Iceberg
df = read_house_prices(files_to_load[0]).to_arrow().cast(house_prices_schema.as_arrow())
house_prices_t.append(df)

year = files_to_load[0].name[3:7]
# Tag the new snapshot - retain it for a month
current_snapshot = house_prices_t.current_snapshot().snapshot_id
house_prices_t.manage_snapshots().create_tag(
    current_snapshot, f"{year}_load", max_ref_age_ms=2629746000
).commit()

Let's have a look at what is happening in the physical storage

In [8]:
fs.ls(f"{house_prices_t.location()}/data")

['warehouse/staging/data/month_date_of_transfer=2015-01',
 'warehouse/staging/data/month_date_of_transfer=2015-02',
 'warehouse/staging/data/month_date_of_transfer=2015-03',
 'warehouse/staging/data/month_date_of_transfer=2015-04',
 'warehouse/staging/data/month_date_of_transfer=2015-05',
 'warehouse/staging/data/month_date_of_transfer=2015-06',
 'warehouse/staging/data/month_date_of_transfer=2015-07',
 'warehouse/staging/data/month_date_of_transfer=2015-08',
 'warehouse/staging/data/month_date_of_transfer=2015-09',
 'warehouse/staging/data/month_date_of_transfer=2015-10',
 'warehouse/staging/data/month_date_of_transfer=2015-11',
 'warehouse/staging/data/month_date_of_transfer=2015-12']

The data is now physically partitioned by year-month, and we can now use it without having to know anything about the partitioning. To show how query engines can take advantage of this, let's compare two SQL statements in Trino. 

Looking at the Trino query plan, we can see that the first query is scanning twice the number of rows compared to the second

In [9]:
# No partition on 'county'
print(
    pl.read_database(
        "EXPLAIN ANALYZE SELECT max(price) as max_price from housing.staging_prices where county = 'WORCESTERSHIRE'",
        engine,
    ).item(0, 0)
)

Trino version: 475
Queued: 795.06us, Analysis: 35.44ms, Planning: 57.33ms, Execution: 299.54ms
Fragment 1 [SINGLE]
    CPU: 4.23ms, Scheduled: 10.28ms, Blocked 427.47ms (Input: 349.75ms, Output: 0.00ns), Input: 12 rows (60B); per task: avg.: 12.00 std.dev.: 0.00, Output: 1 row (5B)
    Peak Memory: 304B, Tasks count: 1; per task: max: 304B
    Output layout: [max]
    Output partitioning: SINGLE []
    Aggregate[type = FINAL]
    │   Layout: [max:integer]
    │   Estimates: {rows: 1 (5B), cpu: 2.41M, memory: 5B, network: 0B}
    │   CPU: 2.00ms (0.74%), Scheduled: 3.00ms (0.39%), Blocked: 0.00ns (0.00%), Output: 1 row (5B)
    │   Input avg.: 12.00 rows, Input std.dev.: 0.00%
    │   max := max(max_0)
    └─ LocalExchange[partitioning = SINGLE]
       │   Layout: [max_0:integer]
       │   Estimates: {rows: 505404 (2.41MB), cpu: 0, memory: 0B, network: 0B}
       │   CPU: 0.00ns (0.00%), Scheduled: 2.00ms (0.26%), Blocked: 81.00ms (18.79%), Output: 12 rows (60B)
       │   Input avg.: 

In [10]:
# Partition on 'date_of_transfer'
print(
    pl.read_database(
        "EXPLAIN ANALYZE SELECT max(price) as max_price from housing.staging_prices where date_of_transfer between DATE '2015-01-01' AND DATE '2015-06-30'",
        engine,
    ).item(0, 0)
)

Trino version: 475
Queued: 363.33us, Analysis: 56.12ms, Planning: 47.09ms, Execution: 265.00ms
Fragment 1 [SINGLE]
    CPU: 2.10ms, Scheduled: 2.42ms, Blocked 472.89ms (Input: 373.65ms, Output: 0.00ns), Input: 6 rows (30B); per task: avg.: 6.00 std.dev.: 0.00, Output: 1 row (5B)
    Peak Memory: 304B, Tasks count: 1; per task: max: 304B
    Output layout: [max]
    Output partitioning: SINGLE []
    Aggregate[type = FINAL]
    │   Layout: [max:integer]
    │   Estimates: {rows: 1 (5B), cpu: 2.12M, memory: 5B, network: 0B}
    │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Blocked: 0.00ns (0.00%), Output: 1 row (5B)
    │   Input avg.: 6.00 rows, Input std.dev.: 0.00%
    │   max := max(max_0)
    └─ LocalExchange[partitioning = SINGLE]
       │   Layout: [max_0:integer]
       │   Estimates: {rows: 444598 (2.12MB), cpu: 0, memory: 0B, network: 0B}
       │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Blocked: 91.00ms (19.57%), Output: 6 rows (30B)
       │   Input avg.: 1.50 

But how big were the files we're scanning?

In [11]:
fs.ls("warehouse/staging/data/month_date_of_transfer=2015-01", detail=True)

[{'Key': 'warehouse/staging/data/month_date_of_transfer=2015-01/00000-4-49b901b0-d684-4657-9205-7953c9306936.parquet',
  'LastModified': datetime.datetime(2025, 6, 6, 8, 52, 14, 325000, tzinfo=tzlocal()),
  'ETag': '"e8ef58c3cf2e96520f4cd18e1674890d-1"',
  'Size': 2696479,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 2696479,
  'name': 'warehouse/staging/data/month_date_of_transfer=2015-01/00000-4-49b901b0-d684-4657-9205-7953c9306936.parquet'}]

Around 2.7 Mb - that's not very big at all - while there is no strict guidelines, consensus is that the Parquet files should be somewhere between 128 MB and 1 GB **uncompressed**, depending on use case, as the overhead of reading many small files adds up quick. 

Luckily, we can quickly change our partitioning, without having to rewrite our existing files

In [12]:
with house_prices_t.update_spec() as spec:
    spec.remove_field("month_date_of_transfer")
    spec.add_field("date_of_transfer", YearTransform(), "year_date_of_transfer")

Changing partitioning doesn't alter existing files, it only affects future files. To demonstrate let's load the next file to see the effect

In [13]:
# Load the data into Iceberg
df = read_house_prices(files_to_load[1]).to_arrow().cast(house_prices_schema.as_arrow())
house_prices_t.append(df)

year = files_to_load[1].name[3:7]
# Tag the new snapshot - retain it for a month
current_snapshot = house_prices_t.current_snapshot().snapshot_id
house_prices_t.manage_snapshots().create_tag(
    current_snapshot, f"{year}_load", max_ref_age_ms=2629746000
).commit()

Let's look at the file structure now

In [14]:
fs.ls(f"{house_prices_t.location()}/data", refresh=True)

['warehouse/staging/data/month_date_of_transfer=2015-01',
 'warehouse/staging/data/month_date_of_transfer=2015-02',
 'warehouse/staging/data/month_date_of_transfer=2015-03',
 'warehouse/staging/data/month_date_of_transfer=2015-04',
 'warehouse/staging/data/month_date_of_transfer=2015-05',
 'warehouse/staging/data/month_date_of_transfer=2015-06',
 'warehouse/staging/data/month_date_of_transfer=2015-07',
 'warehouse/staging/data/month_date_of_transfer=2015-08',
 'warehouse/staging/data/month_date_of_transfer=2015-09',
 'warehouse/staging/data/month_date_of_transfer=2015-10',
 'warehouse/staging/data/month_date_of_transfer=2015-11',
 'warehouse/staging/data/month_date_of_transfer=2015-12',
 'warehouse/staging/data/year_date_of_transfer=2016']

In [15]:
fs.ls("warehouse/staging/data/year_date_of_transfer=2016", detail=True)

[{'Key': 'warehouse/staging/data/year_date_of_transfer=2016/00000-0-8203efbb-3f38-4952-9346-6cc1eb908d15.parquet',
  'LastModified': datetime.datetime(2025, 6, 6, 8, 55, 12, 675000, tzinfo=tzlocal()),
  'ETag': '"0d0fe2f7d88f943e36a9bcc02367685a-3"',
  'Size': 23232332,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 23232332,
  'name': 'warehouse/staging/data/year_date_of_transfer=2016/00000-0-8203efbb-3f38-4952-9346-6cc1eb908d15.parquet'}]

Around 23 Mb - better, parquet compresses well after all so this is closer to optimal size. We'll keep this and load the rest

In [16]:
for filename in files_to_load[2:]:
    # Grab the year from the filename
    year = filename.name[3:7]
    # Read in the CSV
    df = read_house_prices(filename).to_arrow().cast(house_prices_schema.as_arrow())
    print(f"Appending {filename.name} - {len(df):,} rows")
    # Write to Iceberg
    house_prices_t.append(df)
    # Get the new snapshot id
    current_snapshot = house_prices_t.current_snapshot().snapshot_id
    # Tag the new snapshot - retain it for a month
    house_prices_t.manage_snapshots().create_tag(
        current_snapshot, f"{year}_load", max_ref_age_ms=2629746000
    ).commit()
    print(f"Tagged: {year}_load")

Appending pp-2017.csv - 1,067,186 rows
Tagged: 2017_load
Appending pp-2018.csv - 1,037,178 rows
Tagged: 2018_load
Appending pp-2019.csv - 1,011,415 rows
Tagged: 2019_load
Appending pp-2020.csv - 895,643 rows
Tagged: 2020_load
Appending pp-2021.csv - 1,277,810 rows
Tagged: 2021_load
Appending pp-2022.csv - 1,069,660 rows
Tagged: 2022_load
Appending pp-2023.csv - 848,435 rows
Tagged: 2023_load
Appending pp-2024.csv - 766,641 rows
Tagged: 2024_load


In [17]:
fs.ls(f"{house_prices_t.location()}/data", refresh=True)

['warehouse/staging/data/month_date_of_transfer=2015-01',
 'warehouse/staging/data/month_date_of_transfer=2015-02',
 'warehouse/staging/data/month_date_of_transfer=2015-03',
 'warehouse/staging/data/month_date_of_transfer=2015-04',
 'warehouse/staging/data/month_date_of_transfer=2015-05',
 'warehouse/staging/data/month_date_of_transfer=2015-06',
 'warehouse/staging/data/month_date_of_transfer=2015-07',
 'warehouse/staging/data/month_date_of_transfer=2015-08',
 'warehouse/staging/data/month_date_of_transfer=2015-09',
 'warehouse/staging/data/month_date_of_transfer=2015-10',
 'warehouse/staging/data/month_date_of_transfer=2015-11',
 'warehouse/staging/data/month_date_of_transfer=2015-12',
 'warehouse/staging/data/year_date_of_transfer=2016',
 'warehouse/staging/data/year_date_of_transfer=2017',
 'warehouse/staging/data/year_date_of_transfer=2018',
 'warehouse/staging/data/year_date_of_transfer=2019',
 'warehouse/staging/data/year_date_of_transfer=2020',
 'warehouse/staging/data/year_date

Now Iceberg has two different partitions to keep track of, so it will split the partition planning across the two partitions

![Partition Spec Evolution](images/partition_spec_evolution.png)

Let's re-examine our query plans to see if we can spot the difference

In [18]:
# No partition on 'county'
print(
    pl.read_database(
        "EXPLAIN ANALYZE SELECT max(price) as max_price from housing.staging_prices where county = 'WORCESTERSHIRE'",
        engine,
    ).item(0, 0)
)

Trino version: 475
Queued: 908.67us, Analysis: 108.34ms, Planning: 79.41ms, Execution: 431.94ms
Fragment 1 [SINGLE]
    CPU: 4.37ms, Scheduled: 6.79ms, Blocked 1.02s (Input: 807.85ms, Output: 0.00ns), Input: 24 rows (120B); per task: avg.: 24.00 std.dev.: 0.00, Output: 1 row (5B)
    Peak Memory: 304B, Tasks count: 1; per task: max: 613B
    Output layout: [max]
    Output partitioning: SINGLE []
    Aggregate[type = FINAL]
    │   Layout: [max:integer]
    │   Estimates: {rows: 1 (5B), cpu: 23.92M, memory: 5B, network: 0B}
    │   CPU: 1.00ms (0.16%), Scheduled: 1.00ms (0.05%), Blocked: 0.00ns (0.00%), Output: 1 row (5B)
    │   Input avg.: 24.00 rows, Input std.dev.: 0.00%
    │   max := max(max_0)
    └─ LocalExchange[partitioning = SINGLE]
       │   Layout: [max_0:integer]
       │   Estimates: {rows: 5015418 (23.92MB), cpu: 0, memory: 0B, network: 0B}
       │   CPU: 0.00ns (0.00%), Scheduled: 1.00ms (0.05%), Blocked: 183.00ms (18.47%), Output: 24 rows (120B)
       │   Input avg

In [19]:
# Partition on 'date_of_transfer'
print(
    pl.read_database(
        "EXPLAIN ANALYZE SELECT max(price) as max_price from housing.staging_prices where date_of_transfer between DATE '2015-01-01' AND DATE '2022-12-31'",
        engine,
    ).item(0, 0)
)

Trino version: 475
Queued: 317.52us, Analysis: 55.28ms, Planning: 41.79ms, Execution: 268.97ms
Fragment 1 [SINGLE]
    CPU: 3.77ms, Scheduled: 5.93ms, Blocked 467.57ms (Input: 367.50ms, Output: 0.00ns), Input: 22 rows (110B); per task: avg.: 22.00 std.dev.: 0.00, Output: 1 row (5B)
    Peak Memory: 304B, Tasks count: 1; per task: max: 328B
    Output layout: [max]
    Output partitioning: SINGLE []
    Aggregate[type = FINAL]
    │   Layout: [max:integer]
    │   Estimates: {rows: 1 (5B), cpu: 40.13M, memory: 5B, network: 0B}
    │   CPU: 1.00ms (0.48%), Scheduled: 1.00ms (0.15%), Blocked: 0.00ns (0.00%), Output: 1 row (5B)
    │   Input avg.: 22.00 rows, Input std.dev.: 0.00%
    │   max := max(max_0)
    └─ LocalExchange[partitioning = SINGLE]
       │   Layout: [max_0:integer]
       │   Estimates: {rows: 8415759 (40.13MB), cpu: 0, memory: 0B, network: 0B}
       │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Blocked: 97.00ms (20.86%), Output: 22 rows (110B)
       │   Input av