# Data Engineering Use Cases

This notebook explains the various data engineering use cases using Pandas logic. The idea is to replicate these use cases using the different frameworks. We can then compare the code complexity for the different frameworks, as well as the code performance as the data volumes increase.

You can run this notebook against either a standard python kernel locally or PySpark kernel with the Glue interactive session. You have the option to read/write locally or to S3 by updating the relevant filepaths.

## AWS Credentials

This section is optional and only needed when using Jupyter extension for VScode.

In [1]:
%load_ext dotenv
%dotenv

## Setting up Glue interactive session

This section is optional, please skip when using a python kernel.

In [13]:
%iam_role arn:aws:iam::684969100054:role/AdminAccessGlueNotebook
%region eu-west-1
%session_id_prefix pandas-
%glue_version 3.0
%idle_timeout 60
%worker_type G.1X
%number_of_workers 2

Current iam_role is arn:aws:iam::684969100054:role/aws-reserved/sso.amazonaws.com/eu-west-2/AWSReservedSSO_AdministratorAccess_ab408ccf26c25b37
iam_role has been set to arn:aws:iam::684969100054:role/AdminAccessGlueNotebook.
Previous region: eu-west-1
Setting new region to: eu-west-1
Reauthenticating Glue client with new region: eu-west-1
IAM role has been set to arn:aws:iam::684969100054:role/AdminAccessGlueNotebook. Reauthenticating.
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::684969100054:role/AdminAccessGlueNotebook
Authentication done.
Region is set to: eu-west-1
Setting session ID prefix to native-hudi-dataframe-
Setting Glue version to: 3.0
Current idle_timeout is 2880 minutes.
idle_timeout has been set to 60 minutes.
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 2


The following exception was encountered while parsing the configurations provided: invalid syntax (<unknown>, line 7) 
Traceback (most recent call last):
  File "/Users/soumaya.mauthoor/Documents/GitHub/hudi-vs-iceberg/venv/lib/python3.9/site-packages/aws_glue_interactive_sessions_kernel/glue_pyspark/GlueKernel.py", line 444, in configure
    configs = ast.literal_eval(configs_json)
  File "/Users/soumaya.mauthoor/.pyenv/versions/3.9.10/lib/python3.9/ast.py", line 62, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')
  File "/Users/soumaya.mauthoor/.pyenv/versions/3.9.10/lib/python3.9/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 7
    from awsglue.transforms import *
    ^
SyntaxError: invalid syntax


In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
# args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# job = Job(glueContext)
# job.init(args['JOB_NAME'], args)
# job.commit()

To import a python script, first upload it to S3

In [1]:
! aws s3 cp pandas_functions.py s3://sb-test-bucket-ireland/data-engineering-use-cases/

upload: ./pandas_functions.py to s3://sb-test-bucket-ireland/data-engineering-use-cases/pandas_functions.py


In [1]:
sc.addPyFile(
    "s3://sb-test-bucket-ireland/data-engineering-use-cases/pandas_functions.py"
)

Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 2
Session ID: 3ff9dea8-3101-454d-a4bd-9c4ecc20f49f
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.4
--enable-glue-datacatalog true
Waiting for session 3ff9dea8-3101-454d-a4bd-9c4ecc20f49f to get into ready status...
Session 3ff9dea8-3101-454d-a4bd-9c4ecc20f49f has been created.



## Import python libraries and set variables

This section is not optional. Please update as required.

In [3]:
import pandas as pd
import time, datetime
from pandas_functions import bulk_insert, scd2_simple, scd2_complex

future_end_datetime = datetime.datetime(2250, 1, 1)
primary_key = "product_id"
input_data_directory = (
    "s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/"
)
full_load_filepath = f"{input_data_directory}full_load/full_load.parquet"
updates_filepath = f"{input_data_directory}updates/updates.parquet"
late_updates_filepath = f"{input_data_directory}late_updates/late_updates.parquet"
output_data_directory = (
    "s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-python/"
)

## Bulk Insert

This use case is a very simple process which appends various columns to the full load data and saves it to a parquet file:

1. Set `start_datetime` to `extraction_timestamp`
2. Set `end_datetime` to a future distant timestamp
3. Set `is_current` to `True`

In [4]:
pd.read_parquet(full_load_filepath).head()

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op
0,1,Heater,250,2022-01-01 01:01:01,
1,2,Thermostat,400,2022-01-01 01:01:01,
2,3,Television,600,2022-01-01 01:01:01,
3,4,Blender,100,2022-01-01 01:01:01,
4,5,USB charger,50,2022-01-01 01:01:01,


In [5]:
bulk_insert_filepath = bulk_insert(
    full_load_filepath, output_data_directory, future_end_datetime
)
pd.read_parquet(bulk_insert_filepath).head()

Output saved to s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-python/bulk_insert.parquet in 0.3274099826812744


Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
1,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
2,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
3,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
4,5,USB charger,50,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True


## Slowly Changing Dimension Type 2 - Simple

This use case is a simplified SCD2 process which closes updated records with the extraction_timestamp of updates. For the sake of simplicity it does not take into account deletes, multiple updates to the same primary key or late-arriving records.

It does this joining the updates with the full load on the primary key, and then unioning the updated data with the updates.

In [6]:
pd.read_parquet(updates_filepath).head()

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op
0,1,Heater,1000,2023-01-01 01:01:01,U
1,2,Thermostat,1000,2023-01-01 01:01:01,U
2,3,Television,1000,2023-01-01 01:01:01,U
3,4,Blender,1000,2023-01-01 01:01:01,U
4,5,USB charger,1000,2023-01-01 01:01:01,U


In [7]:
scd2_simple_filepath = scd2_simple(
    bulk_insert_filepath,
    updates_filepath,
    output_data_directory,
    future_end_datetime,
    primary_key,
)
pd.read_parquet(scd2_simple_filepath).head(10)

Output saved to s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-python/scd2_simple.parquet in 0.6974961757659912


Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
1,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
2,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
3,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
4,5,USB charger,50,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
5,1,Heater,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
6,2,Thermostat,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
7,3,Television,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
8,4,Blender,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
9,5,USB charger,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True


## Dedupes

In [8]:
# TODO

## Impute deleted records

In [9]:
# TODO

## Slowly Changing Dimension Type 2 - Complex

This use case is a more complex SCD2 process which takes into account:

- Late arriving records where an update is processed with an extraction_timestamp that is earlier than the extraction_timestamp of the last processed record
- Multiple updates to the same primary key

It does this by unioning the updates with the existing data, windowing by the primary key and setting the end_datetime to the next record's extraction_timestamp within the window. The process is simplified by assuming that all records need to be updated.

In [10]:
pd.read_parquet(late_updates_filepath).head(10)

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op
0,1,Heater,500,2022-06-01 01:01:01,U
1,2,Thermostat,500,2022-06-01 01:01:01,U
2,3,Television,500,2022-06-01 01:01:01,U
3,4,Blender,500,2022-06-01 01:01:01,U
4,5,USB charger,500,2022-06-01 01:01:01,U


In [11]:
scd2_complex_filepath = scd2_complex(
    scd2_simple_filepath,
    late_updates_filepath,
    output_data_directory,
    future_end_datetime,
    primary_key,
)
pd.read_parquet(scd2_complex_filepath).head(20).sort_values(
    by=[primary_key, "extraction_timestamp"]
)

Output saved to s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-python/scd2_complex.parquet in 0.5793406963348389


Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
1,1,Heater,500,2022-06-01 01:01:01,U,2022-06-01 01:01:01,2023-01-01 01:01:01,False
2,1,Heater,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
3,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
4,2,Thermostat,500,2022-06-01 01:01:01,U,2022-06-01 01:01:01,2023-01-01 01:01:01,False
5,2,Thermostat,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
6,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
7,3,Television,500,2022-06-01 01:01:01,U,2022-06-01 01:01:01,2023-01-01 01:01:01,False
8,3,Television,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
9,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
