## Glue + Iceberg evaluation

###### Bulk Insert 
###### SCD2
###### Impute deletions
###### Deduplication


## Initialise SparkSession

## Clean up existing resources 

## Create Iceberg tables, insert synthetic data (25K rows)

## Bulk Insert 


In [14]:
%session_id_prefix native-iceberg-dataframe-
%glue_version 3.0
%idle_timeout 60
%%configure 
{
  "--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
  "--datalake-formats": "iceberg"
}

Setting session ID prefix to native-iceberg-dataframe-


You are already connected to a glueetl session 3305037b-2670-4661-9e2d-dbeab9788877.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Setting Glue version to: 3.0


You are already connected to a glueetl session 3305037b-2670-4661-9e2d-dbeab9788877.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Current idle_timeout is 60 minutes.
idle_timeout has been set to 60 minutes.


You are already connected to a glueetl session 3305037b-2670-4661-9e2d-dbeab9788877.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


The following configurations have been updated: {'--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions', '--datalake-formats': 'iceberg'}


In [47]:
catalog_name = "glue_catalog"
bucket_name = "sb-test-bucket-ireland"
bucket_prefix = "sb"
database_name = "sb10_iceberg_dataframe"
table_name = "datagensb"
warehouse_path = f"s3://{bucket_name}/{bucket_prefix}"




In [48]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()




In [49]:
query = f"""
CREATE DATABASE IF NOT EXISTS {catalog_name}.{database_name}
"""
spark.sql(query)

DataFrame[]


In [4]:
input_filepath = "s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/full_load.parquet"
output_directory = f"{catalog_name}.{database_name}.{table_name}"
future_end_datetime = "2050-01-01"
primary_key = "product_id"




In [5]:
full_load=spark.read.option('header','true').parquet(input_filepath)




In [6]:
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, Row
import time


def bulk_insert(input_filepath,output_directory,future_end_datetime):
    start = time.time()
    full_load=spark.read.option('header','true').parquet(input_filepath)
    full_load = full_load.withColumn("start_datetime",F.col("extraction_timestamp"))
    full_load = full_load.withColumn("end_datetime", F.to_timestamp(F.lit(future_end_datetime), 'yyyy-MM-dd'))
    full_load = full_load.withColumn("op",F.lit("None"))
    full_load = full_load.withColumn("is_current",F.lit(True))
    full_load.writeTo(output_directory) \
    .create()
    print(time.time()-start)




In [7]:
bulk_insert(input_filepath,output_directory,future_end_datetime)

10.359502792358398


In [8]:
output_directory

'glue_catalog.sb9_iceberg_dataframe.datagensb'


In [9]:
spark.table(f"{catalog_name}.{database_name}.{table_name}") \
    .show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+


In [10]:
spark.table(f"{catalog_name}.{database_name}.{table_name}.history") \
    .show()

+--------------------+-------------------+---------+-------------------+
|     made_current_at|        snapshot_id|parent_id|is_current_ancestor|
+--------------------+-------------------+---------+-------------------+
|2023-05-31 15:14:...|9122969471621716419|     null|               true|
+--------------------+-------------------+---------+-------------------+


## Slowly Changing Dimension Type 2 (SCD2)
The updates are created by replacing one column with the same value to simplify the testing. The soft deletes are not taken into account since very similar process from a performance perspective.

Steps:

Read updates
Join full load with updates on primary key
Set end_datetime to the extraction_timestamp of the updated records
Close the existing records
Add curation columms to updates
Append updated data to existing data

In [11]:
full_load_updates = spark.read.option('header','true').parquet("s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/updates.parquet")




In [12]:
full_load_updates = full_load_updates.withColumn("start_datetime",F.col("extraction_timestamp"))




In [13]:
full_load_updates = full_load_updates.withColumn("end_datetime", F.to_timestamp(F.lit("2050-01-01"), 'yyyy-MM-dd'))




In [14]:
full_load_updates = full_load_updates.withColumn("is_current",F.lit(True))




In [6]:
#full_load_updates = full_load_updates.limit(3)

In [15]:
full_load_updates.show()

+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp| op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|     00001|      Heater| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Blender| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+


In [16]:
full_load_updates.schema

StructType(List(StructField(product_id,StringType,true),StructField(product_name,StringType,true),StructField(price,LongType,true),StructField(extraction_timestamp,TimestampType,true),StructField(op,StringType,true),StructField(start_datetime,TimestampType,true),StructField(end_datetime,TimestampType,true),StructField(is_current,BooleanType,false)))


In [17]:
full_load_updates.show()

+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp| op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|     00001|      Heater| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Blender| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger| 1000| 2023-01-01 01:01:01|  U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+


In [18]:

full_load_updates.createOrReplaceTempView(f"tmp_{table_name}_updates")




In [19]:
query = f"""
MERGE INTO {catalog_name}.{database_name}.{table_name} AS f
USING (SELECT * FROM tmp_{table_name}_updates) AS u
ON f.{primary_key} = u.{primary_key}
WHEN MATCHED THEN UPDATE SET f.end_datetime = u.extraction_timestamp, f.is_current = False 

"""
spark.sql(query)

DataFrame[]


In [20]:
spark.table(f"{catalog_name}.{database_name}.{table_name}") \
    .show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+


In [21]:
full_load_updates.writeTo(f"{catalog_name}.{database_name}.{table_name}").append()




In [22]:
spark.table(f"{catalog_name}.{database_name}.{table_name}") \
    .show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Therm

In [23]:
spark.table(f"{catalog_name}.{database_name}.{table_name}") \
    .show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Therm

In [24]:
primary_key

'product_id'


In [None]:
def scd2_simple(input_filepath, updates_filepath, output_directory, future_end_datetime, primary_key):
    start = time.time()
    full_load_updates = spark.read.option('header','true').parquet(updates_filepath)
    full_load_updates = full_load_updates.withColumn("start_datetime",F.col("extraction_timestamp"))
    full_load_updates = full_load_updates.withColumn("end_datetime", F.to_timestamp(F.lit(future_end_datetime), 'yyyy-MM-dd'))
    full_load_updates = full_load_updates.withColumn("is_current",F.lit(True))

    full_load_updates.createOrReplaceTempView(f"tmp_{table_name}_updates")
    query = f"""
    MERGE INTO {catalog_name}.{database_name}.{table_name} AS f
    USING (SELECT * FROM tmp_{table_name}_updates) AS u
    ON f.{primary_key} = u.{primary_key}
    WHEN MATCHED THEN UPDATE SET f.end_datetime = u.extraction_timestamp, f.is_current = False 

    """
    spark.sql(query)
    full_load_updates.writeTo(f"{catalog_name}.{database_name}.{table_name}").append()
    print(time.time()-start)

## Slowly Changing Dimension Type 2 - Complex
This is a more complex SCD2 process which takes into account:
Late arriving records where an update is processed with an extraction_timestamp that is later than the extraction_timestamp of the last processed record
Batches which contain multiple updates to the same primary key
The process can be summarised as follows:

Concat/union updates with the existing data
Sort by primary key and extraction_timestamp
Window by primary key and set the end_datetime to the next record's extraction_timestamp, otherwise set it to a future distant timestamp
The process could be optimised by separating records which have not received any updates, but this is left out to make the logic easier to follow.

In [23]:
#late_updates = spark.read.option('header','true').parquet("s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/late_updates.parquet")




In [25]:
late_updates = spark.read.option('header','true').parquet("s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/late_updates.parquet")
late_updates = late_updates.withColumn("start_datetime",F.col("extraction_timestamp"))
late_updates = late_updates.withColumn("end_datetime", F.to_timestamp(F.lit("2050-01-01"), 'yyyy-MM-dd'))
late_updates = late_updates.withColumn("is_current",F.lit(True))
primary_key = "product_id"




In [7]:
#primary_key = "product_id"

In [8]:
#late_updates = late_updates.limit(3)

In [26]:
late_updates.show()

+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp| op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|     00001|      Heater|  500| 2022-06-01 01:01:01|  U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|  U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television|  500| 2022-06-01 01:01:01|  U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Blender|  500| 2022-06-01 01:01:01|  U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger|  500| 2022-06-01 01:01:01|  U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+


In [61]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|
+----------+------------+-----+--------------------+----+-------------------+
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|
|     00003|  Television| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|
|     00004|     Blender| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|
|     00005| USB charger| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01

In [9]:
#late_updates.createOrReplaceTempView(f"tmp_{table_name}_late_updates")

In [28]:
late_updates.writeTo(f"{catalog_name}.{database_name}.{table_name}").append()




In [14]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").drop("end_datetime","is_current").writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()




In [33]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").drop("end_datetime","is_current")

DataFrame[product_id: string, product_name: string, price: bigint, extraction_timestamp: timestamp, op: string, start_datetime: timestamp]


In [54]:
#spark.sql(query1).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()

AnalysisException: Found duplicate column(s) in the table definition of sb8_iceberg_dataframe.datagensb: `end_datetime`


In [38]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|
+----------+------------+-----+--------------------+----+-------------------+
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|
|     00003|  Television|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|
|     00004|     Blender|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|
|     00005| USB charger|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01

In [None]:
#spark.table(output_directory).writeTo(output_directory) .create()

In [30]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Blender|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Therm

In [40]:
query1 = f"""
SELECT *,
LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp) AS end_datetime

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY {primary_key}, extraction_timestamp
    """
spark.sql(query1)

DataFrame[product_id: string, product_name: string, price: bigint, extraction_timestamp: timestamp, op: string, start_datetime: timestamp, end_datetime: timestamp]


In [41]:
spark.sql(query1).show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|
+----------+------------+-----+--------------------+----+-------------------+-------------------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00003|  Televi

In [42]:
spark.sql(query1).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()




In [43]:
query2 = f"""
SELECT *,
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY {primary_key}, extraction_timestamp
    """
spark.sql(query2)

DataFrame[product_id: string, product_name: string, price: bigint, extraction_timestamp: timestamp, op: string, start_datetime: timestamp, end_datetime: timestamp, is_current: boolean]


In [44]:
spark.sql(query2).show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Telev

In [45]:
spark.sql(query2).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()




In [46]:
spark.table(output_directory).show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Therm

## test code

In [49]:
# query2 = f"""

# ALTER TABLE {catalog_name}.{database_name}.{table_name}
# DROP COLUMN is_current


#     """
# spark.sql(query2)

DataFrame[]


In [53]:
query1 = f"""
SELECT *,
LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp) AS end_datetime

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY {primary_key}, extraction_timestamp
    """
spark.sql(query1)

DataFrame[product_id: string, product_name: string, price: bigint, extraction_timestamp: timestamp, op: string, start_datetime: timestamp, end_datetime: timestamp, end_datetime: timestamp]


In [None]:
query1 = f"""
ALTER TABLE {catalog_name}.{database_name}.{table_name}
ALTER end_datetime
SELECT *,
LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp)

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY {primary_key}, extraction_timestamp
    """
spark.sql(query1)

In [66]:
query1 = f"""
CREATE TEMPORARY TABLE temp SELECT *  
FROM {catalog_name}.{database_name}.{table_name};

ALTER TABLE temp DROP (end_datetime);
SELECT *,
LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp) END AS end_datetime

FROM temp

ORDER BY {primary_key}, extraction_timestamp
    """
spark.sql(query1)

ParseException: 
extraneous input 'ALTER' expecting {<EOF>, ';'}(line 5, pos 0)

== SQL ==

CREATE TEMPORARY TABLE temp SELECT *  
FROM glue_catalog.sb10_iceberg_dataframe.datagensb;

ALTER TABLE temp DROP (end_datetime);
^^^
SELECT *,
LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY product_id ORDER BY extraction_timestamp) END AS end_datetime

FROM temp

ORDER BY product_id, extraction_timestamp
    



In [56]:
# query1 = f"""
# MERGE INTO {catalog_name}.{database_name}.{table_name} AS f
# USING (SELECT * FROM tmp_{table_name}_late_updates) AS u
# ON f.{primary_key} = u.{primary_key}
# WHEN MATCHED THEN UPDATE SET f.end_datetime = u.extraction_timestamp, f.is_current = False  
#     """
# spark.sql(query1)

DataFrame[]


In [50]:
spark.sql(query2).show()

AnalysisException: Cannot delete missing field is_current in glue_catalog.sb8_iceberg_dataframe.datagensb schema: root
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- price: long (nullable = true)
 |-- extraction_timestamp: timestamp (nullable = true)
 |-- op: string (nullable = true)
 |-- start_datetime: timestamp (nullable = true)
 |-- end_datetime: timestamp (nullable = true)
; line 3 pos 0;
'AlterTable org.apache.iceberg.spark.SparkCatalog@45000dd6, sb8_iceberg_dataframe.datagensb, RelationV2[product_id#1166, product_name#1167, price#1168L, extraction_timestamp#1169, op#1170, start_datetime#1171, end_datetime#1172] glue_catalog.sb8_iceberg_dataframe.datagensb, [org.apache.spark.sql.connector.catalog.TableChange$DeleteColumn@9c98a783]



In [None]:
MERGE INTO {catalog_name}.{database_name}.{table_name} AS f

In [58]:
late_updates.writeTo(f"{catalog_name}.{database_name}.{table_name}").append()




In [70]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Telev

In [None]:
# query1 = f"""
# SELECT *,
# LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp) AS end_datetime

# FROM {catalog_name}.{database_name}.{table_name}

# ORDER BY {primary_key}, extraction_timestamp
#     """
# spark.sql(query1)

In [None]:
spark.sql(query1).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()

In [59]:
query2 = f"""
SELECT * EXCEPT[is_current]
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY product_id, extraction_timestamp
    """
spark.sql(query2)

ParseException: 
mismatched input '[' expecting {<EOF>, ';'}(line 2, pos 15)

== SQL ==

SELECT * EXCEPT[is_current]
---------------^^^
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

FROM glue_catalog.sb7_iceberg_dataframe.datagensb

ORDER BY product_id, extraction_timestamp
    



In [69]:
query2 = f"""
CREATE TEMPORARY TABLE temp SELECT *  
FROM {catalog_name}.{database_name}.{table_name},
ALTER TABLE temp DROP (is_current),
SELECT * FROM temptable
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current
SELECT * FROM temptable

    """
spark.sql(query2)

ParseException: 
mismatched input 'temp' expecting {<EOF>, ';'}(line 4, pos 12)

== SQL ==

CREATE TEMPORARY TABLE temp SELECT *  
FROM glue_catalog.sb7_iceberg_dataframe.datagensb,
ALTER TABLE temp DROP (is_current),
------------^^^
SELECT * FROM temptable
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current
SELECT * FROM temptable

    



In [54]:
query2 = f"""

SELECT * INTO temptable 
FROM {catalog_name}.{database_name}.{table_name}
ALTER TABLE temptable
DROP COLUMN is_current 
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current
SELECT * FROM temptable
DROP TABLE temptable
    """
spark.sql(query2)

ParseException: 
mismatched input 'temptable' expecting {<EOF>, ';'}(line 3, pos 14)

== SQL ==


SELECT * INTO temptable 
--------------^^^
FROM glue_catalog.sb7_iceberg_dataframe.datagensb
ALTER TABLE temptable
DROP COLUMN is_current 
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current
SELECT * FROM temptable
DROP TABLE temptable
    



In [None]:
spark.sql(query2).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()

In [46]:
spark.sql(query1).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()

AnalysisException: Found duplicate column(s) in the table definition of sb2_iceberg_dataframe.datagensb: `end_datetime`


In [14]:
spark.sql(query1).show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+-------------------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|       end_datetime|
+----------+------------+-----+--------------------+----+-------------------+-------------------+-------------------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|2022-06-01 01:01:01|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|2050-01-01 00:00:00|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|2022-06-01 01:01:01|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|2050-01-01 00:00:00|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|2022-06-01 01:01:01|
|     00003|  Television|  500| 2022-06-01 01:01:01|   U

In [15]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|
+----------+------------+-----+--------------------+----+-------------------+-------------------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00003|  Television|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|
|     00004|     Ble

In [60]:
# query2 = f"""
# SELECT *,
# CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

# FROM {catalog_name}.{database_name}.{table_name}

# ORDER BY {primary_key}, extraction_timestamp
#     """
# spark.sql(query2)

DataFrame[product_id: string, product_name: string, price: bigint, extraction_timestamp: timestamp, op: string, start_datetime: timestamp, end_datetime: timestamp, is_current: boolean, is_current: boolean]


In [61]:
spark.sql(query2).writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()

AnalysisException: Found duplicate column(s) in the table definition of sb4_iceberg_dataframe.datagensb: `is_current`


In [62]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Blender|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00002|  Therm

In [34]:
def scd2_complex(input_filepath, late_updates_filepath, output_directory, primary_key):
    start = time.time()
    late_updates = spark.read.option('header','true').parquet(late_updates_filepath)
    late_updates = late_updates.withColumn("start_datetime",F.col("extraction_timestamp"))
    late_updates.writeTo(output_directory).append()
    spark.table(f"{catalog_name}.{database_name}.{table_name}").drop("end_datetime","is_current").writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()
    query1 = f"""
    SELECT *,
    LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp) AS end_datetime

    FROM {catalog_name}.{database_name}.{table_name}

    ORDER BY {primary_key}, extraction_timestamp
    """
    spark.sql(query1)
    spark.sql(query1).writeTo(output_directory).createOrReplace()
    query2 = f"""
    SELECT *,
    CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

    FROM {catalog_name}.{database_name}.{table_name}

    ORDER BY {primary_key}, extraction_timestamp
    """
    spark.sql(query2)
    spark.sql(query2).writeTo(output_directory).createOrReplace()
    print(time.time()-start)




In [35]:
late_updates_filepath="s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/late_updates.parquet"




In [36]:
primary_key = "product_id"




In [37]:
scd2_complex(input_filepath, late_updates_filepath, output_directory, primary_key)

AnalysisException: Cannot write incompatible data to table 'glue_catalog.sb8_iceberg_dataframe.datagensb':
- Cannot find data for output column 'end_datetime'
- Cannot find data for output column 'is_current'


In [61]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00003|  Television|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2050-01-01 00:00:00|      true|
|     00004|     Bl

In [None]:
query1 = f"""
SELECT product_id, product_name, price, extraction_timestamp, op, start_datetime,is_current,
LEAD(start_datetime,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY product_id ORDER BY start_datetime) AS end_datetime

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY product_id, extraction_timestamp
    """
spark.sql(query1)

In [None]:
spark.sql(query1).writeTo(f"{catalog_name}.{database_name}.{table_name}").overwritePartitions()

In [None]:
spark.sql(query1).show()

In [None]:
query2 = f"""
SELECT product_id, product_name, price, extraction_timestamp, op, start_datetime,end_datetime,
CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

FROM {catalog_name}.{database_name}.{table_name}

ORDER BY product_id, extraction_timestamp
    """
spark.sql(query2)

In [None]:
spark.sql(query2).writeTo(f"{catalog_name}.{database_name}.{table_name}").overwritePartitions()

In [None]:
spark.sql(query2).show()

In [None]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

In [5]:
%session_id_prefix native-iceberg-dataframe-
%glue_version 3.0
%idle_timeout 60
%%configure 
{
  "--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
  "--datalake-formats": "iceberg"
}

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.3 
Setting session ID prefix to native-iceberg-dataframe-
Setting Glue version to: 3.0
Current idle_timeout is 2800 minutes.
idle_timeout has been set to 60 minutes.
The following configurations have been updated: {'--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions', '--datalake-formats': 'iceberg'}


In [22]:
catalog_name = "glue_catalog"
bucket_name = "sb-test-bucket-ireland"
bucket_prefix = "sb"
database_name = "sb13_iceberg_dataframe"
table_name = "datagensb"
warehouse_path = f"s3://{bucket_name}/{bucket_prefix}"




In [23]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()




In [24]:
query = f"""
CREATE DATABASE IF NOT EXISTS {catalog_name}.{database_name}
"""
spark.sql(query)

DataFrame[]


In [25]:
input_filepath = "s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/full_load.parquet"
output_directory = f"{catalog_name}.{database_name}.{table_name}"
future_end_datetime = "2050-01-01"
updates_filepath ="s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/updates.parquet"
primary_key = "product_id"
late_updates_filepath="s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/late_updates.parquet"




Functions

In [26]:
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, Row
import time


def bulk_insert(input_filepath,output_directory,future_end_datetime):
    start = time.time()
    full_load=spark.read.option('header','true').parquet(input_filepath)
    full_load = full_load.withColumn("start_datetime",F.col("extraction_timestamp"))
    full_load = full_load.withColumn("end_datetime", F.to_timestamp(F.lit(future_end_datetime), 'yyyy-MM-dd'))
    full_load = full_load.withColumn("op",F.lit("None"))
    full_load = full_load.withColumn("is_current",F.lit(True))
    full_load.sortWithinPartitions("product_name") \
    .writeTo(output_directory) \
    .create()
    print(time.time()-start)




In [27]:
bulk_insert(input_filepath,output_directory,future_end_datetime)

3.3868606090545654


In [28]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2050-01-01 00:00:00|      true|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+


In [7]:
# updates_filepath ="s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/updates.parquet"
# primary_key = "product_id"

In [29]:
def scd2_simple(input_filepath, updates_filepath, output_directory, future_end_datetime, primary_key):
    start = time.time()
    full_load_updates = spark.read.option('header','true').parquet(updates_filepath)
    full_load_updates = full_load_updates.withColumn("start_datetime",F.col("extraction_timestamp"))
    full_load_updates = full_load_updates.withColumn("end_datetime", F.to_timestamp(F.lit(future_end_datetime), 'yyyy-MM-dd'))
    full_load_updates = full_load_updates.withColumn("is_current",F.lit(True))

    full_load_updates.createOrReplaceTempView(f"tmp_{table_name}_updates")
    query = f"""
    MERGE INTO {catalog_name}.{database_name}.{table_name} AS f
    USING (SELECT * FROM tmp_{table_name}_updates) AS u
    ON f.{primary_key} = u.{primary_key}
    WHEN MATCHED THEN UPDATE SET f.end_datetime = u.extraction_timestamp, f.is_current = False 

    """
    spark.sql(query)
    full_load_updates.writeTo(f"{catalog_name}.{database_name}.{table_name}").append()
    print(time.time()-start)




In [30]:
scd2_simple(input_filepath, updates_filepath, output_directory, future_end_datetime, primary_key)

11.610728025436401


In [31]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00004|     Blender|  100| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00003|  Television|  600| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00005| USB charger|   50| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Therm

In [None]:
# input_filepath = "s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/full_load.parquet"
# output_directory = f"{catalog_name}.{database_name}.{table_name}"
# future_end_datetime = "2050-01-01"
# late_updates_filepath="s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/late_updates.parquet"

In [32]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").drop("end_datetime","is_current").writeTo(f"{catalog_name}.{database_name}.{table_name}").createOrReplace()




In [33]:
def scd2_complex(input_filepath, late_updates_filepath, output_directory, primary_key):
    start = time.time()
    late_updates = spark.read.option('header','true').parquet(late_updates_filepath)
    late_updates = late_updates.withColumn("start_datetime",F.col("extraction_timestamp"))
    late_updates.writeTo(output_directory).append()
    spark.table(output_directory).drop("end_datetime","is_current").writeTo(output_directory).createOrReplace()
    #spark.table(output_directory).writeTo(output_directory).drop("end_datetime","is_current").createOrReplace()
    query1 = f"""
    SELECT *,
    LEAD(extraction_timestamp,1,TO_TIMESTAMP('2050-01-01 00:00:00')) OVER(PARTITION BY {primary_key} ORDER BY extraction_timestamp) AS end_datetime

    FROM {catalog_name}.{database_name}.{table_name}

    ORDER BY {primary_key}, extraction_timestamp
    """
    spark.sql(query1)
    spark.sql(query1).writeTo(output_directory).createOrReplace()
    query2 = f"""
    SELECT *,
    CASE WHEN end_datetime = '2050-01-01 00:00:00' THEN True ELSE False END AS is_current

    FROM {catalog_name}.{database_name}.{table_name}

    ORDER BY {primary_key}, extraction_timestamp
    """
    spark.sql(query2)
    spark.sql(query2).writeTo(output_directory).createOrReplace()
    print(time.time()-start)




In [34]:
scd2_complex(input_filepath, late_updates_filepath, output_directory, primary_key)

7.640847444534302


In [35]:
spark.table(f"{catalog_name}.{database_name}.{table_name}").show()

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|  op|     start_datetime|       end_datetime|is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|     00001|      Heater|  250| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00001|      Heater|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|     false|
|     00001|      Heater| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00002|  Thermostat|  400| 2022-01-01 01:01:01|None|2022-01-01 01:01:01|2022-06-01 01:01:01|     false|
|     00002|  Thermostat|  500| 2022-06-01 01:01:01|   U|2022-06-01 01:01:01|2023-01-01 01:01:01|     false|
|     00002|  Thermostat| 1000| 2023-01-01 01:01:01|   U|2023-01-01 01:01:01|2050-01-01 00:00:00|      true|
|     00003|  Telev