Moving to an Incremental Pipeline in Delta Lake: Change Tracking
================================

This post shows how to use Change Tracking (todo link) in Delta Lake 2.0 to convert a batch pipeline to an incremental update pipeline. We'll cover two parts:

1. Capturing change tracking in a Delta Lake Merge job.
1. Converting a series of `join` operations to `merge` operations to produce a cheaper pipeline using incremental operations.

Setting Up a Scenario: 3 Tables
--------------------------------------

I've set up three tables:

1. Invoice
2. InvoiceItem
3. Product

The ground truth for these tables lives in a production system and is dumped to the data lake and merged into a delta lake table. The logic for this merge is given below.

In [1]:
import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 

sc = configure_spark_with_delta_pip(builder).getOrCreate()

In [30]:
# Day 0: Read the data and merge. This is just to get our tables set up. See Day 1 for a "normal" day.
products = sc.read.format("csv") \
                .option("header","true") \
                .load("./data/products/updates/day=0/") \
                .drop('_c0')

products.write.format("delta").save("./outputs/products")

invoices = sc.read.format("csv") \
                .option("header","true") \
                .load("./data/invoices/updates/day=0/") \
                .drop('_c0')

invoices.write.format("delta").save("./outputs/invoices")

invoiceitems = sc.read.format("csv") \
                .option("header","true") \
                .load("./data/invoiceitems/updates/day=0/") \
                .drop('_c0')

invoiceitems.write.format("delta").save("./outputs/invoiceitems")

In [48]:
# Day 1: process both updates and deletes, which come in separate files
def read_data(table_location, day, has_deletes):
    updates = sc.read.format("csv") \
                .option("header","true") \
                .load(f"./data/{table_location}/updates/day={day}/") \
                .drop('_c0')
        
    if has_deletes:
        deletes = sc.read.format("csv") \
                .option("header", "true") \
                .load(f"./data/{table_location}/deletes/day={day}/") \
                .drop("_c0")
    else:
        deletes = None

    return updates, deletes

product_updates, _ = read_data("products", day=1, has_deletes=False)
product_base = DeltaTable.forPath(sc, "./outputs/products")
print(f"Updating {product_updates.count()} products and deleting 0 products.")

invoice_updates, invoice_deletes = read_data("invoice", day=1, has_deletes=True)
print(f"Updating {invoice_updates.count()} invoices and deleting {invoice_deletes.count()} invoices.")

invoiceitem_updates, invoiceitem_deletes = read_data("invoiceitems", day=1, has_deletes=True)
print(f"Updating {invoiceitem_updates.count()} invoiceitems and deleting {invoiceitem_deletes.count()} invoiceitems.")



Updating 50 products and deleting 0 products.
Updating 444 invoices and deleting 3 invoices.
Updating 1533 invoiceitems and deleting 8 invoiceitems.


7

Basically, every day we merge in a new set of data from a production system. This could Create, Update, or Delete rows in any table. (Normally I would advise against delete in these cases, but GDPR or other constraints may mean we have deletes in addition to updates and creates). So, every day we get updated Delta Lake tables representing each table.


We also have a job that produces a normalized copy of the data that merges all three tables together. This data has one row per invoice item. We can perform normalization using a couple of joins. Occasionally we see "hiccups" where an invoice and invoice item exist in our data lake but the product has not yet been downloaded. This kind of delay can happen when tables are joined that come from different production systems. So, we'll left join products because they will occasionally be null. Bad things happen in complicated systems.

In [3]:
# build normalized join

Trouble with this:
* It's expensive and repetitive
* Product changes get reflected in historical invoices without more care. For instance, we switch the quantity of widgets from 6 to 4 per package without changing the SKU. This is very bad don't ever change the meaning of a SKU. Sigh.

Enabling Change Tracking and Converting to Incremental Jobs
---------------------------------------------------

In [4]:
# set up new spark context with stuff enabled