# Data Engineering Use Cases

This notebook explains the various data engineering use cases using Pandas logic. You have the option to read/write locally or to S3 by updating the relevant filepaths. The idea is to replicate these use cases using the different frameworks. We can then compare the code complexity for the different frameworks, as well as the code performance as the data volumes increase. You can run this notebook against either a standard python kernel locally or PySpark kernel with the Glue interactive session.

## Setting up Glue interactive session

This section is optional, please skip when using a python kernel.

In [13]:
%load_ext dotenv
%dotenv
%iam_role arn:aws:iam::684969100054:role/AdminAccessGlueNotebook
%region eu-west-1
%session_id_prefix native-hudi-dataframe-
%glue_version 3.0
%idle_timeout 60
%worker_type G.1X
%number_of_workers 2
%%configure 
{
  "--conf": "spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false",
  "--datalake-formats": "hudi"
}


from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Current iam_role is arn:aws:iam::684969100054:role/aws-reserved/sso.amazonaws.com/eu-west-2/AWSReservedSSO_AdministratorAccess_ab408ccf26c25b37
iam_role has been set to arn:aws:iam::684969100054:role/AdminAccessGlueNotebook.
Previous region: eu-west-1
Setting new region to: eu-west-1
Reauthenticating Glue client with new region: eu-west-1
IAM role has been set to arn:aws:iam::684969100054:role/AdminAccessGlueNotebook. Reauthenticating.
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::684969100054:role/AdminAccessGlueNotebook
Authentication done.
Region is set to: eu-west-1
Setting session ID prefix to native-hudi-dataframe-
Setting Glue version to: 3.0
Current idle_timeout is 2880 minutes.
idle_timeout has been set to 60 minutes.
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 2


The following exception was encountered while parsing the configurations provided: invalid syntax (<unknown>, line 7) 
Traceback (most recent call last):
  File "/Users/soumaya.mauthoor/Documents/GitHub/hudi-vs-iceberg/venv/lib/python3.9/site-packages/aws_glue_interactive_sessions_kernel/glue_pyspark/GlueKernel.py", line 444, in configure
    configs = ast.literal_eval(configs_json)
  File "/Users/soumaya.mauthoor/.pyenv/versions/3.9.10/lib/python3.9/ast.py", line 62, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')
  File "/Users/soumaya.mauthoor/.pyenv/versions/3.9.10/lib/python3.9/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 7
    from awsglue.transforms import *
    ^
SyntaxError: invalid syntax


To import a python script, first upload it to S3

In [1]:
! aws s3 cp pandas_functions.py s3://sb-test-bucket-ireland/data-engineering-use-cases/

upload: ./pandas_functions.py to s3://sb-test-bucket-ireland/data-engineering-use-cases/pandas_functions.py


In [1]:
sc.addPyFile("s3://sb-test-bucket-ireland/data-engineering-use-cases/pandas_functions.py")

Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 2
Session ID: 3ff9dea8-3101-454d-a4bd-9c4ecc20f49f
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.4
--enable-glue-datacatalog true
Waiting for session 3ff9dea8-3101-454d-a4bd-9c4ecc20f49f to get into ready status...
Session 3ff9dea8-3101-454d-a4bd-9c4ecc20f49f has been created.



## Import python libraries and set variables

This section is not optional. Please update as required.

In [9]:
import pandas as pd
import time, datetime
from pandas_functions import bulk_insert, scd2_simple, scd2_complex

future_end_datetime = datetime.datetime(2250, 1, 1)
input_data_directory = "s3://sb-test-bucket-ireland/data-engineering-use-cases/dummy-data/"
full_load_filepath = f'{input_data_directory}full_load.parquet'
updates_filepath = f'{input_data_directory}updates.parquet'
late_updates_filepath = f'{input_data_directory}late_updates.parquet'
output_data_directory = "s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-pyspark/"




## Bulk Insert



This is a very simple process which appends various columns to the full load data and saves it to a parquet file.

1. Set `start_datetime` to `extraction_timestamp`
2. Set `end_datetime` to a future distant timestamp
3. Set `is_current` to `True`

In [10]:
pd.read_parquet(full_load_filepath).head()

  product_id product_name  price extraction_timestamp    op
0      00001       Heater    250  2022-01-01 01:01:01  None
1      00002   Thermostat    400  2022-01-01 01:01:01  None
2      00003   Television    600  2022-01-01 01:01:01  None
3      00004      Blender    100  2022-01-01 01:01:01  None
4      00005  USB charger     50  2022-01-01 01:01:01  None


In [11]:
bulk_insert_filepath = bulk_insert(full_load_filepath,output_data_directory,future_end_datetime)
pd.read_parquet(bulk_insert_filepath).head()


Output saved to s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-pyspark/bulk_insert.parquet in 0.13838529586791992
  product_id product_name  price  ...      start_datetime end_datetime is_current
0      00001       Heater    250  ... 2022-01-01 01:01:01   2250-01-01       True
1      00002   Thermostat    400  ... 2022-01-01 01:01:01   2250-01-01       True
2      00003   Television    600  ... 2022-01-01 01:01:01   2250-01-01       True
3      00004      Blender    100  ... 2022-01-01 01:01:01   2250-01-01       True
4      00005  USB charger     50  ... 2022-01-01 01:01:01   2250-01-01       True

[5 rows x 8 columns]


## Slowly Changing Dimension Type 2 - Simple

This is simplified SCD2 process which does not take into account deletes.

1. Join full load with updates on primary key
2. Set `end_datetime` to the `extraction_timestamp` of the updated records 
3. Close the existing records
4. Add the SCD2 columms to updates
5. Append updated data to existing data

In [12]:
pd.read_parquet(updates_filepath).head()

  product_id product_name  price extraction_timestamp op
0      00001       Heater   1000  2023-01-01 01:01:01  U
1      00002   Thermostat   1000  2023-01-01 01:01:01  U
2      00003   Television   1000  2023-01-01 01:01:01  U
3      00004      Blender   1000  2023-01-01 01:01:01  U
4      00005  USB charger   1000  2023-01-01 01:01:01  U


In [13]:
scd2_simple_filepath = scd2_simple(bulk_insert_filepath,updates_filepath,output_data_directory,future_end_datetime)
pd.read_parquet(scd2_simple_filepath).head(10)

Output saved to s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-pyspark/scd2_simple.parquet in 0.2599973678588867
  product_id product_name  ...        end_datetime is_current
0      00001       Heater  ... 2023-01-01 01:01:01      False
1      00002   Thermostat  ... 2023-01-01 01:01:01      False
2      00003   Television  ... 2023-01-01 01:01:01      False
3      00004      Blender  ... 2023-01-01 01:01:01      False
4      00005  USB charger  ... 2023-01-01 01:01:01      False
5      00001       Heater  ... 2250-01-01 00:00:00       True
6      00002   Thermostat  ... 2250-01-01 00:00:00       True
7      00003   Television  ... 2250-01-01 00:00:00       True
8      00004      Blender  ... 2250-01-01 00:00:00       True
9      00005  USB charger  ... 2250-01-01 00:00:00       True

[10 rows x 8 columns]


## Dedupes

In [16]:
# TODO

## Impute deleted records

In [17]:
# TODO

## Slowly Changing Dimension Type 2 - Complex

This is a more complex SCD2 process which takes into account:

- Late arriving records where an update is processed with an extraction_timestamp that is later than the extraction_timestamp of the last processed record
- Batches which contain multiple updates to the same primary key

The process can be summarised as follows:

1. Concat/union updates with the existing data
2. Sort by primary key and extraction_timestamp
3. Window by primary key and set the end_datetime to the next record's extraction_timestamp, otherwise set it to a future distant timestamp

The process could be optimised by separating records which have not received any updates, but this is left out to make the logic easier to follow.


In [14]:
pd.read_parquet(late_updates_filepath).head(10)

  product_id product_name  price extraction_timestamp op
0      00001       Heater    500  2022-06-01 01:01:01  U
1      00002   Thermostat    500  2022-06-01 01:01:01  U
2      00003   Television    500  2022-06-01 01:01:01  U
3      00004      Blender    500  2022-06-01 01:01:01  U
4      00005  USB charger    500  2022-06-01 01:01:01  U


In [15]:
scd2_complex_filepath = scd2_complex(scd2_simple_filepath,late_updates_filepath,output_data_directory,future_end_datetime)
pd.read_parquet(scd2_complex_filepath).head(20).sort_values(by=["product_id", "extraction_timestamp"])

Output saved to s3://sb-test-bucket-ireland/soumaya/de-usecases/pandas/pandas-pyspark/scd2_complex.parquet in 0.2845125198364258
   product_id product_name  ...        end_datetime is_current
0       00001       Heater  ... 2022-06-01 01:01:01      False
1       00001       Heater  ... 2023-01-01 01:01:01      False
2       00001       Heater  ... 2250-01-01 00:00:00       True
3       00002   Thermostat  ... 2022-06-01 01:01:01      False
4       00002   Thermostat  ... 2023-01-01 01:01:01      False
5       00002   Thermostat  ... 2250-01-01 00:00:00       True
6       00003   Television  ... 2022-06-01 01:01:01      False
7       00003   Television  ... 2023-01-01 01:01:01      False
8       00003   Television  ... 2250-01-01 00:00:00       True
9       00004      Blender  ... 2022-06-01 01:01:01      False
10      00004      Blender  ... 2023-01-01 01:01:01      False
11      00004      Blender  ... 2250-01-01 00:00:00       True
12      00005  USB charger  ... 2022-06-01 01:01:01 