# Data Engineering

Now that we've loaded our data, it's time to do some real data engineering to answer our original question.

At this stage, Iceberg fades into the background, but we're able to pick and choose query engines to perform the various steps - this is the true power of Iceberg

In [2]:
import sqlalchemy as sa
from utils import engine, catalog
import polars as pl

pl.Config.set_fmt_str_lengths(50)
pl.Config.set_thousands_separator(True)

polars.config.Config

We want a way of identifying a given property, so hashing the address related fields seems the easiest. We'll create a dimension table for those addresses, so we can move those rows out of our final data, without losing the ability to filter later. 
```{note} SQL Partitioning
Note that we're adding partitioning to our tables directly through SQL here using a WITH statement on the CREATE TABLE
```

In [2]:
dim_address_sql = """
CREATE OR REPLACE TABLE housing.dim_address 
    WITH ( partitioning = ARRAY['bucket(address_id, 10)'] )
    AS (
    SELECT DISTINCT to_hex(md5(cast(
        coalesce(paon, '') ||
        coalesce(saon, '') ||
        coalesce(street, '') ||
        coalesce(locality, '') ||
        coalesce(town, '') ||
        coalesce(district, '') ||
        coalesce(county, '') ||
        coalesce(postcode, '')
    as varbinary))) AS address_id,
      paon,
      saon,
      street,
      locality,
      town,
      district,
      county,
      postcode
FROM housing.staging_prices)
"""

As described in the data dictionary, the monthly files incluce a `record_status` column which indicates whether a given record is a new record or if it is deleting or updating an existing record. In moving from our staging table to our fact table, we clean our data to ensure we respect the record_status

In [3]:
fct_prices_sql = """
CREATE OR REPLACE TABLE housing.fct_house_prices
    WITH ( partitioning = ARRAY['year(date_of_transfer)'] ) AS (
        WITH ranked_records AS (
            SELECT *,
            ROW_NUMBER () OVER (PARTITION BY transaction_id ORDER BY month(date_of_transfer) DESC) AS rn
            FROM housing.staging_prices
    ),
    latest_records AS (
        SELECT *
        FROM ranked_records
        WHERE rn = 1
    ),
    with_address_id AS (
        SELECT to_hex(md5(cast (
                coalesce(paon, '') ||
                coalesce(saon, '') ||
                coalesce(street, '') ||
                coalesce(locality, '') ||
                coalesce(town, '') ||
                coalesce(district, '') ||
                coalesce(county, '') ||
                coalesce(postcode, '')
            as varbinary))) AS address_id,
                transaction_id,
                price,
                date_of_transfer,
                property_type,
                new_property,
                duration,
                ppd_category_type
        FROM latest_records
        WHERE record_status != 'D' and ppd_category_type = 'A'
    )
    SELECT *
    FROM with_address_id
    )
"""

In [4]:
with engine.begin() as conn:
    num_rows_dim_address = conn.execute(sa.text(dim_address_sql)).fetchone()[0]
    num_rows_fct_prices = conn.execute(sa.text(fct_prices_sql)).fetchone()[0]

print(f"Created dim_address with {num_rows_dim_address:,} rows")
print(f"Created fct_prices with {num_rows_fct_prices:,} rows")

Created dim_address with 7,498,409 rows
Created fct_prices with 7,592,564 rows


Now that the data is loaded, we can create a Pyiceberg reference to it

In [4]:
fct_house_prices_t = catalog.load_table("housing.fct_house_prices")

For a change of pace, let's use `polars` to write our profits calculation. Some things are easier to express in SQL and some are nice to be able to do in Polars. The choice is yours!

In [9]:
polars_result = (
    pl.scan_iceberg(fct_house_prices_t)
    .with_columns(
        pl.col("date_of_transfer").min().over(pl.col("address_id")).alias("first_day"),
        pl.col("date_of_transfer").max().over(pl.col("address_id")).alias("last_day"),
        pl.col("price")
        .sort_by("date_of_transfer")
        .first()
        .over(pl.col("address_id"))
        .alias("first_price"),
        pl.col("price")
        .sort_by("date_of_transfer")
        .last()
        .over(pl.col("address_id"))
        .alias("last_price"),
    )
    .with_columns(
        pl.col("last_day").sub(pl.col("first_day")).dt.total_days().alias("days_held"),
        pl.col("last_price").sub(pl.col("first_price")).alias("profit"),
    )
    .filter(pl.col("days_held") != 0)
    .select(
        pl.col("address_id"),
        pl.col("first_day"),
        pl.col("last_day"),
        pl.col("first_price"),
        pl.col("last_price"),
        pl.col("days_held"),
        pl.col("profit"),
    )
    .unique()
).collect()

polars_result

address_id,first_day,last_day,first_price,last_price,days_held,profit
str,date,date,i32,i32,i64,i32
"""025F3D34C6EA424D860426BE3D8E8C85""",2015-04-17,2019-10-17,164950,195000,1644,30050
"""F5B8C3D2E08D933E3BFE449D84376273""",2019-03-22,2021-06-11,120000,170000,812,50000
"""41F6095A7FC8461E7A6EBC0A0EA00023""",2016-10-07,2023-04-12,282000,320000,2378,38000
"""678BE468391CFE917815E524B2E6534D""",2017-06-27,2018-08-23,364995,400000,422,35005
"""8FAB76FBB2F4C65F37AF3E8904C9CD9B""",2021-03-29,2024-08-02,348000,435000,1222,87000
…,…,…,…,…,…,…
"""D345D15462E4EBAFC087E97C48C77EE6""",2017-12-20,2020-04-24,94500,100000,856,5500
"""D81EC0D2EF45CA4CE0271546E0351A60""",2016-02-19,2021-01-08,102500,120000,1785,17500
"""432015977DB4BFDFEFB120E31C3F6B15""",2019-08-07,2023-02-23,372500,400000,1296,27500
"""25D28388315C28271120AF25751DCA69""",2018-10-03,2021-04-01,315000,455000,911,140000


Let's store the results in a table for future reference - since `polars` is arrow-based, we can use it to define the schema as well if we don't care as much about the details of the resulting schema

In [10]:
profits_t = catalog.create_table_if_not_exists(
    "house_prices.profits", schema=polars_result.to_arrow().schema
)

In [11]:
profits_t.overwrite(polars_result.to_arrow())

To round out the selection of query engines, we can use `daft` to query our newly created table and calculate the mean profits for a given year

In [12]:
import daft


def query_profits(year: int) -> daft.DataFrame:
    table = catalog.load_table("house_prices.profits")
    df = (
        daft.read_iceberg(table)
        .filter(daft.col("first_day").dt.year() == 2016)
        .agg(daft.col("first_day").dt.year().max(), daft.col("profit").mean())
        .collect()
    )
    return df.collect()

In [13]:
query_profits(2016)

  from .autonotebook import tqdm as notebook_tqdm


first_day Int32,profit Float64
2016,54167.65682534417
