In [2]:
import polars as pl
from pyiceberg.catalog.rest import RestCatalog

# Querying the data

The key selling point for Iceberg is that we have the option of using many different query engines
to read from the same data storage.
Let's run some simple queries using a few different query engines.

Many of these engines are using pyiceberg as a jumping-off point,
either to directly interface with it, or as a source for the current metadata.json

In [3]:
catalog = RestCatalog("lakekeeper", uri="http://lakekeeper:8181/catalog", warehouse="lakehouse")
table = catalog.load_table("house_prices.raw")

## Pyiceberg

Let's see how Pyiceberg handles querying first. For each of these examples, we'll do something simple - like taking the mean monthly house price per month in 2024.

In [9]:
%%time
from pyiceberg.expressions import GreaterThanOrEqual, LessThanOrEqual, Or

iceberg_results = table.scan(
    selected_fields=["price", "date_of_transfer"],
    row_filter="date_of_transfer >= '2024-01-01' and date_of_transfer <= '2024-12-31'",
)
iceberg_results.to_polars().group_by(
    pl.col("date_of_transfer").dt.month()
).agg(pl.col("price").mean()).sort(by="date_of_transfer").style.fmt_number("price", decimals=2)

CPU times: user 50.6 ms, sys: 4.42 ms, total: 55 ms
Wall time: 31.5 ms


date_of_transfer,price
1,388804.09
2,374737.75
3,400307.61
4,397024.66
5,382252.88
6,367380.4
7,384130.91
8,377343.48
9,376216.13
10,378261.83


## Polars
Pyiceberg provides us with limited filtering and projection capabilities - it provides the building blocks for libraries that build on top of Pyiceberg. We used Polars to finish the job in this example, but polars can read Iceberg directly - no need for the extra step

In [8]:
%%time
polars_df = pl.scan_iceberg(table).group_by(
        pl.col("date_of_transfer").dt.month()
    ).agg(pl.col("price").mean()).sort(by="date_of_transfer").collect()
polars_df.style.fmt_number("price", decimals=2)

CPU times: user 523 ms, sys: 90.2 ms, total: 613 ms
Wall time: 509 ms


date_of_transfer,price
1,392772.3
2,381045.37
3,406686.36
4,398630.34
5,393051.71
6,401433.56
7,395147.61
8,393771.96
9,390506.33
10,383768.35


## Duckdb
Duckdb is also an excellent choice for working with Iceberg, especially if you want to stick to SQL.

It does require some setup, since Duckdb doesn't yet know how to talk to the REST catalog, so it needs to have it's own credentials, but the [duckdb-iceberg](https://github.com/duckdb/duckdb-iceberg) extension recently got additional sponsorship from AWS to improve Iceberg compatibility, so keep an eye on that

In [10]:
import duckdb

In [15]:
# Create a duckdb connection
conn = duckdb.connect()
# Load the Iceberg extension for DuckDB
conn.install_extension('iceberg')
conn.load_extension('iceberg')
conn.load_extension('avro')

# To be able to read the Iceberg metadata, we need credentials for the bucket
conn.sql("""
CREATE OR REPLACE SECRET minio (
TYPE S3,
ENDPOINT 'minio:9000',
KEY_ID 'minio',
SECRET 'minio1234',
USE_SSL false,
URL_STYLE 'path'
)
""")

┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘

In [17]:
%%time
# We can read the iceberg data using DuckDB
conn.sql(f"""
SELECT month(date_of_transfer) as transfer_month, mean(price) as mean_price
FROM iceberg_scan('{table.metadata_location}')
GROUP BY 1
""").show()

┌────────────────┬────────────────────┐
│ transfer_month │     mean_price     │
│     int64      │       double       │
├────────────────┼────────────────────┤
│              1 │  392772.2983928342 │
│              2 │ 381045.36580027133 │
│              3 │  406686.3579577748 │
│              4 │ 398630.34065730515 │
│              5 │ 393051.71221671335 │
│              6 │ 401433.55739373335 │
│              7 │  395147.6149494054 │
│              8 │  393771.9639059624 │
│              9 │  390506.3259618707 │
│             10 │  383768.3534915215 │
│             11 │  361295.1407759433 │
│             12 │  381363.8323481146 │
├────────────────┴────────────────────┤
│ 12 rows                   2 columns │
└─────────────────────────────────────┘

CPU times: user 99.3 ms, sys: 22 ms, total: 121 ms
Wall time: 101 ms


## Trino
Trino is another popular option, especially since AWS provides it as a serverless query engine through Athena. Trino is another SQL-based query engine, so the query looks pretty similar, just using Trino SQL dialect

In [None]:
import sqlalchemy as sa

engine = sa.create_engine("trino://trino:@trino:8080/lakekeeper")

sql = """
SELECT month(date_of_transfer) as transfer_month, avg(price) as mean_price 
FROM house_prices.raw
GROUP BY 1
ORDER BY 1
"""

In [19]:
%%time
with engine.connect() as c:
    df = pl.read_database(sa.text(sql), c)
df

CPU times: user 10.5 ms, sys: 192 μs, total: 10.7 ms
Wall time: 265 ms


transfer_month,mean_price
i64,f64
1,392772.298393
2,381045.3658
3,406686.357958
4,398630.340657
5,393051.712217
…,…
8,393771.963906
9,390506.325962
10,383768.353492
11,361295.140776


## Daft
Daft is a relatively new player in the Dataframe world, similar to Polars, but also designed for scaling out. It's also written in Rust, but Daft has had early support for Iceberg - let's see if that helps

In [20]:
import daft

In [None]:
%%time
(
    daft.read_iceberg(table)
    .groupby(daft.col("date_of_transfer").dt.month())
    .agg(daft.col("price").mean())
    .sort(by=daft.col("date_of_transfer"))
    .show(12)
)

# Query engines
So now we've done a tour of some of the query engines that are also easy to run locally - we've been through Python with Pyiceberg, Rust with Polars and Daft, C++ with Duckdb and finally Java with Trino. One important player we've left out here is Spark. There is no denying that Iceberg was originally a Java project and the Java Iceberg libraries are the most feature-complete. 

In a real enterprise setup, you'll probably a managed service like Databricks or Snowflake that you can rely on as your main Iceberg driver - but the beauty of Iceberg is that you don't have to. You can mix and match these different query engines depending on the task at hand, while not having to move the data anywhere.

# Exercise

Try running a query using your favourite query engine to calculate the average house price for your county. If you don't live in the UK - pick the funniest sounding one. (I quite like WORCESTERSHIRE)