# Slowly Changing Dimensions in Delta Lake

SCD's (Slowly Changing Dimensions) are fairly easy to implement with Spark 3.0 and Delta Lake.
In the traditional data warehousing world, where everything is stored in SQL tables, we used to have a construction like the below (from Kimball's [Data Warehouse toolkit](https://www.bookdepository.com/The-Data-Warehouse-Toolkit-Ralph-Kimball/9781118530801?redirected=true&utm_medium=Google&utm_campaign=Base1&utm_source=GR&utm_content=The-Data-Warehouse-Toolkit&selectCurrency=EUR&w=AFFMAU9SYY661SA8V9F5)) for a Type 2 SCD:

![SCD2 Example](img/scd2_ex.png)

Here we have a **Products** dimension table which stores data from one or more source systems. As we have a type 2 SCD, every time a record is updated in source, a new record is added into the dimension table, with the updated version of the source record. There are various records for the same source entity, showing the 'condition' of the source record in various points in time, Historical information is retained this way. The primary key of the record in the source system (e.g. our ERP) is the **SKU** column (hence the NK - natural key designation). The extra columns we have in our SCD 2 Product Dim are the following:
- **Product Key**: this is the PK (primary key) in the DIM Product table itself. As we have multiple rows for each original record in the source system, to retain the various versions, we can have multiple records with the same NK but with a different PK, of course. This PK has no intrisic meaning, hence it's also called a surrogate key, and it is an auto-increment field in most cases.
- **Row Effective Date**: this shows when this particular record (so this version of the source record) was loaded from the source system. The first time we load the dimension table, it is customary to use a date in the distant past, like 1/1/1900 or something. 
- **Row Expiration Date**: this shows when the record/version stops being active. So every time an SCD2 attribute is changed in the source system, the expiration date of currently active record gets updated to the date (or datetime) of the current loading batch, and another record is added with Effective date having the same batch load date, and Expiration Date something like 12/31/9999, to indicate it's the currently active record.
- **Current Row Indicator**: this is a flag that shows if the record is active or has expired. As an expiration date of 12/31/9999 shows the record is active too, sometimes this field is ommitted. It could also be used to indicate a record has been deleted from the source, or we could use another field for that.

We can do the same thing in Delta Lake, although it's not necessary. To start with it, let's load a sample file, which will play the role of the source system. Now please note, that if you have installed Spark 3.0 and pyspark enabled Jupyter in your laptop, and want to try this there, you should kick this off using something like
> pyspark --packages io.delta:delta-core_2.12:0.7.0

as the guys who created Spark note in [Learning Spark 2nd ed.](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/) to have the Delta Lake capabilities available in your Jupyter session.

In [7]:
#define source file
sourceFile = "files/customers1.csv"

# Configure Delta Lake table path
deltaPath = "/tmp/dim_customers"

#read file into spark data frame
df = (spark.read.format("csv")
.option("header", "true")
.load(sourceFile))

#show contents
df.show()

+---+------------------+---------+
| ID|              Name|     City|
+---+------------------+---------+
|  1| John Papadopoulos|   Athens|
|  2|   Matt Protopapas|   Athens|
|  3|  Michael Georgiou| Salonica|
+---+------------------+---------+



So we have a simple file showing our customers and the city they currently live. The primary key which uniquely defines a customer in our source system (the file) is **Customer ID**. We now want to store this table in our Delta lake table **dim_customers** which will have an SCD Type 2 format. So we would like to have an auto-increment primary key along with the effective and exiration dates (we skip the 'current flag' for now). Given that auto-increments are not supported, as the data might reside in different partitions (and different computers in fact), we'll use monotonicaly increasing ID's instead.

In [9]:
from pyspark.sql.functions import *


df1 = (df
      .withColumn("CustomerKey", monotonically_increasing_id())
      .withColumn("ValidFrom",to_date(lit("01/01/1900"), "MM/dd/yyyy"))
      .withColumn("ValidTo",to_date(lit("12/31/9999"),"MM/dd/yyyy")))

Taditionally, the record with PK=0 in the dimension is a 'blank' record which is used to show the absence of a dimension entity. For example if the customer was not recorded for a sale, we would need to map the specific sale fact onto that 'blank' record in the dimension. It's not that difficult to add such a record in our data frame but we'll skip it for now. So, we're ready to write the first batch of the customer dimension into our table.

In [10]:
#create the table and insert the dataframe
df1.write.format("delta").save(deltaPath)
#create a temp view on the table
spark.read.format("delta").load(deltaPath).createOrReplaceTempView("dimCustomer")

In [11]:
spark.sql("SELECT * FROM dimCustomer").show()

+---+------------------+---------+-----------+----------+----------+
| ID|              Name|     City|CustomerKey| ValidFrom|   ValidTo|
+---+------------------+---------+-----------+----------+----------+
|  1| John Papadopoulos|   Athens|          0|1900-01-01|9999-12-31|
|  2|   Matt Protopapas|   Athens|          1|1900-01-01|9999-12-31|
|  3|  Michael Georgiou| Salonica|          2|1900-01-01|9999-12-31|
+---+------------------+---------+-----------+----------+----------+

