Moving to an Incremental Pipeline in Delta Lake: Change Tracking
================================

This post shows how to use Change Tracking (todo link) in Delta Lake 2.0 to convert a batch pipeline to an incremental update pipeline. We'll cover two parts:

1. Capturing change tracking in a Delta Lake Merge job.
1. Converting a series of `join` operations to `merge` operations to produce a cheaper pipeline using incremental operations.

Setting Up a Scenario: 3 Tables
--------------------------------------

I've set up three tables:

1. Invoice
2. InvoiceItem
3. Product

The ground truth for these tables lives in a production system and is dumped to the data lake and merged into a delta lake table. The logic for this merge is given below.

In [1]:
import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 

sc = configure_spark_with_delta_pip(builder).getOrCreate()

In [54]:
# Day 0: Read the data and merge. This is just to get our tables set up. See Day 1 for a "normal" day.
products = sc.read.format("csv") \
                .option("header","true") \
                .load("./data/products/updates/day=0/") \
                .drop('_c0')

products.write.format("delta").save("./outputs/products")

invoices = sc.read.format("csv") \
                .option("header","true") \
                .load("./data/invoice/updates/day=0/") \
                .drop('_c0')

invoices.write.format("delta").save("./outputs/invoices")

invoiceitems = sc.read.format("csv") \
                .option("header","true") \
                .load("./data/invoiceitems/updates/day=0/") \
                .drop('_c0')

invoiceitems.write.format("delta").save("./outputs/invoiceitems")

In [55]:
# Day 1: process both updates and deletes, which come in separate files
def read_data(table_location, day, has_deletes):
    updates = sc.read.format("csv") \
                .option("header","true") \
                .load(f"./data/{table_location}/updates/day={day}/") \
                .drop('_c0')
        
    if has_deletes:
        deletes = sc.read.format("csv") \
                .option("header", "true") \
                .load(f"./data/{table_location}/deletes/day={day}/") \
                .drop("_c0")
    else:
        deletes = None

    return updates, deletes

product_updates, _ = read_data("products", day=1, has_deletes=False)
product_base = DeltaTable.forPath(sc, "./outputs/products")
print(f"Updating {product_updates.count()} products and deleting 0 products.")

invoice_updates, invoice_deletes = read_data("invoice", day=1, has_deletes=True)
invoice_base = DeltaTable.forPath(sc, "./outputs/invoices")
print(f"Updating {invoice_updates.count()} invoices and deleting {invoice_deletes.count()} invoices.")

invoiceitem_updates, invoiceitem_deletes = read_data("invoiceitems", day=1, has_deletes=True)
invoiceitem_base = DeltaTable.forPath(sc, "./outputs/invoiceitems")
print(f"Updating {invoiceitem_updates.count()} invoiceitems and deleting {invoiceitem_deletes.count()} invoiceitems.")

Updating 50 products and deleting 0 products.
Updating 444 invoices and deleting 3 invoices.
Updating 1533 invoiceitems and deleting 8 invoiceitems.


In [68]:
# Day 1 continued: merge tables
product_base.alias("oldData") \
  .merge(
    product_updates.alias("newData"),
    "oldData.product_id = newData.product_id") \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

invoice_base.alias("oldData") \
  .merge(
    product_updates.alias("newData"),
    "oldData.invoice_id = newData.invoice_id") \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \

invoice_base.alias("oldData") \
    .merge(invoice_deletes.alias("newData"), "oldData.invoice_id = newData.invoice_id") \
    .whenMatchedDelete() \
    .execute()

invoiceitem_base.alias("oldData") \
    .merge(
        invoiceitem_updates.alias("newData"),
        "oldData.invoice_item_id = newData.invoice_item_id"
    ) \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll()
invoiceitem_base.alias("oldData") \
    .merge(
        invoiceitem_deletes.alias("newData"),         
        "oldData.invoice_item_id = newData.invoice_item_id"
    ) \
    .whenMatchedDelete() \
    .execute()

Basically, every day we merge in a new set of data from a production system. This could Create, Update, or Delete rows in any table. (An example where deletes as opposed to soft deletes might happen is GDPR compliance.) So, every day we get updated Delta Lake tables representing each table. These are normally created with merge commands to take advantage of partitions.


New, assume there is a job that produces a normalized copy of the data that merges all three tables together. This data has one row per invoice item. We can perform normalization using a couple of joins. Occasionally we see "hiccups" where an invoice and invoice item exist in our data lake but the product has not yet been downloaded. This kind of delay can happen when tables are joined that come from different production systems. So, we'll left join products because they will occasionally be null. Bad things happen in complicated systems.

In [71]:
# build normalized join
product_base = DeltaTable.forPath(sc, "./outputs/products").toDF()
invoice_base = DeltaTable.forPath(sc, "./outputs/invoices").toDF()
invoiceitem_base = DeltaTable.forPath(sc, "./outputs/invoiceitems").toDF()

# Left join with invoice item as the root. This isn't important for invoices and invoice items, but is
# critical for products in this example since products may be pulled at different time cadence and, thus,
# not exist yet.
normalized_view = invoiceitem_base.join(invoice_base, invoiceitem_base.invoice_item_id == invoice_base.invoice_id, how="left")
normalized_view = normalized_view.join(product_base, normalized_view.product == product_base.product_id, how="left")

normalized_view.write.format("delta").save("./outputs/normalized")

There are two things I hate about this join. First, we have to load the entire table every day to produce our join. If we tried to load, say, only data changed on day=1 then we would risk join failures because of products that were not changed on day 1.

Second, the normalized data pulls the most recent value for any product not the value that was active when an invoice item was created. If we change the price in our product table, for instance, then the next day's normalized data will set that new price for all previous invoice items. This can be misleading!


Enabling Change Tracking and Converting to Incremental Jobs
---------------------------------------------------

Say delta 2.0 fixes this...

In [4]:
# set up new spark context with stuff enabled