In [5]:
%session_id_prefix native-hudi-dataframe-
%glue_version 3.0
%idle_timeout 60
%%configure 
{
  "--conf": "spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false",
  "--datalake-formats": "hudi"
}

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.3 
Setting session ID prefix to native-hudi-dataframe-
Setting Glue version to: 3.0
Current idle_timeout is 2800 minutes.
idle_timeout has been set to 60 minutes.
The following configurations have been updated: {'--conf': 'spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false', '--datalake-formats': 'hudi'}


In [1]:
import boto3
import json




Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::684969100054:role/AdminAccessGlueNotebook
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 5d639154-71f0-4558-9337-eda1025a3a4a
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.3
--enable-glue-datacatalog true
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
--datalake-formats hudi
Waiting for session 5d639154-71f0-4558-9337-eda1025a3a4a to get into ready status...
Session 5d639154-71f0-4558-9337-eda1025a3a4a has been created.



## Bulk Insert and add curation columns

In [2]:
from pyspark.sql import Row
import time,datetime
from pyspark.sql.functions import col, to_timestamp,lit
import pyspark.sql.functions as f

END_DATETIME = "2250-01-01"

bulk_insert_start_time = time.time()


full_load = spark.read.parquet("s3://sb-test-bucket-ireland/dummy_data/full_load.parquet")





In [3]:
full_load = full_load.withColumn("start_datetime",f.col("extraction_timestamp"))
full_load = full_load.withColumn("end_datetime", f.to_timestamp(f.lit(END_DATETIME), 'yyyy-MM-dd'))
full_load = full_load.withColumn("is_current",f.lit(True))
full_load = full_load.drop("op").withColumn("op",f.lit(None).cast("string"))  # dummy data op is type int !
full_load.show(truncate=False)

+----------+------------+-----+--------------------+-------------------+-------------------+----------+----+
|product_id|product_name|price|extraction_timestamp|start_datetime     |end_datetime       |is_current|op  |
+----------+------------+-----+--------------------+-------------------+-------------------+----------+----+
|00001     |Heater      |250  |2022-01-01 01:01:01 |2022-01-01 01:01:01|2250-01-01 00:00:00|true      |null|
|00002     |Thermostat  |400  |2022-01-01 01:01:01 |2022-01-01 01:01:01|2250-01-01 00:00:00|true      |null|
|00003     |Television  |600  |2022-01-01 01:01:01 |2022-01-01 01:01:01|2250-01-01 00:00:00|true      |null|
|00004     |Blender     |100  |2022-01-01 01:01:01 |2022-01-01 01:01:01|2250-01-01 00:00:00|true      |null|
|00005     |USB charger |50   |2022-01-01 01:01:01 |2022-01-01 01:01:01|2250-01-01 00:00:00|true      |null|
+----------+------------+-----+--------------------+-------------------+-------------------+----------+----+


In [4]:
full_load.printSchema()

root
 |-- product_id: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- price: long (nullable = true)
 |-- extraction_timestamp: timestamp (nullable = true)
 |-- start_datetime: timestamp (nullable = true)
 |-- end_datetime: timestamp (nullable = true)
 |-- is_current: boolean (nullable = false)
 |-- op: string (nullable = true)


In [5]:
bucket_name = "sb-test-bucket-ireland"
bucket_prefix = "tm/dummy"
database_name = "hudi_df"
table_name = "dummy_df"
table_prefix = f"{bucket_prefix}/{database_name}/{table_name}"
table_location = f"s3://{bucket_name}/{table_prefix}"

hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.recordkey.field': 'product_id',
    'hoodie.datasource.write.partitionpath.field': 'extraction_timestamp',
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'extraction_timestamp',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'path': table_location,
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.database': database_name,
    'hoodie.datasource.hive_sync.table': table_name,
    'hoodie.datasource.hive_sync.partition_fields': 'extraction_timestamp',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.mode': 'hms'
}




In [6]:
full_load.write.format("hudi")  \
    .options(**hudi_options)  \
    .mode("overwrite")  \
    .save()




In [7]:
bulk_insert_process_time = time.time() - bulk_insert_start_time




In [8]:
print(bulk_insert_process_time)

44.81893348693848


## Slowly Changing Dimension Type 2 (SCD2)

The updates are created by replacing one column with the same value to simplify the testing. The soft deletes are not taken into account since very similar process from a performance perspective.

Steps:

- Read updates
- Join full load with updates on primary key
- Set end_datetime to the extraction_timestamp of the updated records
- Close the existing records
- Add curation columms to updates
- Append updated data to existing data

In [10]:
scd2_start_time = time.time()




In [11]:
#Read updates
updates = spark.read.parquet("s3://sb-test-bucket-ireland/dummy_data/updates/updates.parquet")
updates.show(truncate=False)

+----------+------------+-----+--------------------+---+
|product_id|product_name|price|extraction_timestamp|op |
+----------+------------+-----+--------------------+---+
|00001     |Heater      |1000 |2023-01-01 00:00:00 |U  |
|00002     |Thermostat  |1000 |2023-01-01 00:00:00 |U  |
|00003     |Television  |1000 |2023-01-01 00:00:00 |U  |
|00004     |Blender     |1000 |2023-01-01 00:00:00 |U  |
|00005     |USB charger |1000 |2023-01-01 00:00:00 |U  |
+----------+------------+-----+--------------------+---+


In [12]:
#Join full load with updates on primary key

join_cond = [full_load.product_id == updates.product_id,
             full_load.is_current == True]

## Find customer records to update
rows_to_update_df =full_load.join(updates, join_cond)\
                          .select(full_load.product_id,
                                  full_load.product_name,
                                  full_load.price,
                                  full_load.extraction_timestamp,
                                  full_load.op,
                                  full_load.start_datetime,
                                  updates.extraction_timestamp.alias('end_datetime')) #Set end_datetime to the extraction_timestamp of the updated records

rows_to_update_df = rows_to_update_df.withColumn('is_current', f.lit(False))# Close the existing records                 
rows_to_update_df.show(truncate=False)

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|op  |start_datetime     |end_datetime       |is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|00001     |Heater      |250  |2022-01-01 01:01:01 |null|2022-01-01 01:01:01|2023-01-01 00:00:00|false     |
|00002     |Thermostat  |400  |2022-01-01 01:01:01 |null|2022-01-01 01:01:01|2023-01-01 00:00:00|false     |
|00003     |Television  |600  |2022-01-01 01:01:01 |null|2022-01-01 01:01:01|2023-01-01 00:00:00|false     |
|00004     |Blender     |100  |2022-01-01 01:01:01 |null|2022-01-01 01:01:01|2023-01-01 00:00:00|false     |
|00005     |USB charger |50   |2022-01-01 01:01:01 |null|2022-01-01 01:01:01|2023-01-01 00:00:00|false     |
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+


In [13]:
#Add curation columms to updates
updates = updates.withColumn("start_datetime",f.col("extraction_timestamp"))\
                 .withColumn("end_datetime",f.to_timestamp(f.lit(END_DATETIME)))\
                 .withColumn("is_current", f.lit(True))
updates.show(truncate=False)


+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|op |start_datetime     |end_datetime       |is_current|
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+
|00001     |Heater      |1000 |2023-01-01 00:00:00 |U  |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00002     |Thermostat  |1000 |2023-01-01 00:00:00 |U  |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00003     |Television  |1000 |2023-01-01 00:00:00 |U  |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00004     |Blender     |1000 |2023-01-01 00:00:00 |U  |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00005     |USB charger |1000 |2023-01-01 00:00:00 |U  |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
+----------+------------+-----+--------------------+---+-------------------+-------------------+----------+


In [14]:
# Append updated data to existing data using union 
output = updates.union(rows_to_update_df)
output.show(truncate=False)

+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|product_id|product_name|price|extraction_timestamp|op  |start_datetime     |end_datetime       |is_current|
+----------+------------+-----+--------------------+----+-------------------+-------------------+----------+
|00001     |Heater      |1000 |2023-01-01 00:00:00 |U   |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00002     |Thermostat  |1000 |2023-01-01 00:00:00 |U   |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00003     |Television  |1000 |2023-01-01 00:00:00 |U   |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00004     |Blender     |1000 |2023-01-01 00:00:00 |U   |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00005     |USB charger |1000 |2023-01-01 00:00:00 |U   |2023-01-01 00:00:00|2250-01-01 00:00:00|true      |
|00001     |Heater      |250  |2022-01-01 01:01:01 |null|2022-01-01 01:01:01|2023-01-01 00:00:00|false     |
|00002     |Thermos

In [16]:

#Hudi options from previous operation have the right hudi options
print(json.dumps(hudi_options, indent=4))

{
    "hoodie.table.name": "dummy_df",
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.recordkey.field": "product_id",
    "hoodie.datasource.write.partitionpath.field": "extraction_timestamp",
    "hoodie.datasource.write.table.name": "dummy_df",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.precombine.field": "extraction_timestamp",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.upsert.shuffle.parallelism": 2,
    "hoodie.insert.shuffle.parallelism": 2,
    "path": "s3://sb-test-bucket-ireland/tm/dummy/hudi_df/dummy_df",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.database": "hudi_df",
    "hoodie.datasource.hive_sync.table": "dummy_df",
    "hoodie.datasource.hive_sync.partition_fields": "extraction_timestamp",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.datasour

In [17]:
output.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save()




In [18]:
scd2_process_time = time.time() - scd2_start_time
print(scd2_process_time)

32.12798595428467


## Stop Session

In [8]:
%stop_session

Stopping session: 34ac56c1-8255-47d7-b7e5-5cecec0ac640
Stopped session.
