# Data Engineering Use Cases

This notebook explains the various data engineering use cases using Pandas logic and data available within the repository. The idea is to replicate these use cases using the different frameworks and reading from/writing to AWS S3. We can then compare the code complexity for the different frameworks, as well as the code performance as the data volumes increase.

In [1]:
import pandas as pd
import time, datetime

## Bulk Insert



This is a very simple process which appends various columns to the full load data and saves it to a parquet file.

1. Set `start_datetime` to `extraction_timestamp`
2. Set `end_datetime` to a future distant timestamp
3. Set `is_current` to `True`

In [2]:
bulk_insert_start_time = time.time()

In [3]:
full_load = pd.read_parquet('../helpers/dummy_example_creator/full_load.parquet')
full_load

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op
0,1,Heater,250,2022-01-01 01:01:01,
1,2,Thermostat,400,2022-01-01 01:01:01,
2,3,Television,600,2022-01-01 01:01:01,
3,4,Blender,100,2022-01-01 01:01:01,
4,5,USB charger,50,2022-01-01 01:01:01,


In [4]:
future_end_datetime = datetime.datetime(2250, 1, 1)

full_load['start_datetime'] = full_load['extraction_timestamp']
full_load['end_datetime'] = future_end_datetime
full_load['is_current'] = True
full_load.to_parquet('bulk_insert.parquet')
full_load

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
1,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
2,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
3,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True
4,5,USB charger,50,2022-01-01 01:01:01,,2022-01-01 01:01:01,2250-01-01,True


In [5]:
bulk_insert_process_time = time.time() - bulk_insert_start_time
bulk_insert_process_time

0.221832275390625

## Slowly Changing Dimension Type 2 - Simple

This is simplified SCD2 process which does not take into account deletes.

1. Join full load with updates on primary key
2. Set `end_datetime` to the `extraction_timestamp` of the updated records 
3. Close the existing records
4. Add the SCD2 columms to updates
5. Append updated data to existing data

In [6]:
scd2_start_time = time.time()

In [7]:
updates= pd.read_parquet('../helpers/dummy_example_creator/updates.parquet')
updates

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op
0,1,Heater,1000,2023-01-01 01:01:01,U
1,2,Thermostat,1000,2023-01-01 01:01:01,U
2,3,Television,1000,2023-01-01 01:01:01,U
3,4,Blender,1000,2023-01-01 01:01:01,U
4,5,USB charger,1000,2023-01-01 01:01:01,U


In [8]:
df = pd.merge(full_load,
              updates[['product_id','extraction_timestamp']],
              on='product_id',
              suffixes=(None, "_y")
              )
df['end_datetime'] = df['extraction_timestamp_y']
df.drop(columns=['extraction_timestamp_y'],inplace=True)
df['is_current'] = False
df


Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
1,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
2,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
3,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
4,5,USB charger,50,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False


In [9]:
updates['start_datetime'] = updates['extraction_timestamp']
updates['end_datetime'] = future_end_datetime
updates['is_current'] = True

first_update = pd.concat([df,updates],ignore_index=True)
first_update.to_parquet('first_update.parquet')
first_update

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
1,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
2,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
3,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
4,5,USB charger,50,2022-01-01 01:01:01,,2022-01-01 01:01:01,2023-01-01 01:01:01,False
5,1,Heater,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
6,2,Thermostat,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
7,3,Television,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
8,4,Blender,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
9,5,USB charger,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True


In [10]:
scd2_process_time = time.time() - scd2_start_time
scd2_process_time

0.16663599014282227

## Dedupes

In [11]:
# TODO

## Impute deleted records

In [12]:
# TODO

## Slowly Changing Dimension Type 2 - Complex

This is a more complex SCD2 process which takes into account:

- Late arriving records where an update is processed with an extraction_timestamp that is later than the extraction_timestamp of the last processed record
- Batches which contain multiple updates to the same primary key

The process can be summarised as follows:

1. Concat/union updates with the existing data
2. Sort by primary key and extraction_timestamp
3. Window by primary key and set the end_datetime to the next record's extraction_timestamp, otherwise set it to a future distant timestamp

The process could be optimised by separating records which have not received any updates, but this is left out to make the logic easier to follow.


In [13]:
late_updates_start_time = time.time()

In [14]:
late_updates= pd.read_parquet('../helpers/dummy_example_creator/late_updates.parquet')
late_updates

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op
0,1,Heater,500,2022-06-01 01:01:01,U
1,2,Thermostat,500,2022-06-01 01:01:01,U
2,3,Television,500,2022-06-01 01:01:01,U
3,4,Blender,500,2022-06-01 01:01:01,U
4,5,USB charger,500,2022-06-01 01:01:01,U


In [16]:
df = pd.concat([first_update, late_updates], ignore_index=True)
df.sort_values(
    by=["product_id", "extraction_timestamp"], ignore_index=True, inplace=True
)
df["end_datetime"] = df.groupby(["product_id"])["extraction_timestamp"].shift(
    -1, fill_value=future_end_datetime
)
df["is_current"] = df["end_datetime"].apply(
    lambda x: True if x == future_end_datetime else False
)
df["start_datetime"] =  df["extraction_timestamp"]
first_update.to_parquet('second_update.parquet')
df

Unnamed: 0,product_id,product_name,price,extraction_timestamp,op,start_datetime,end_datetime,is_current
0,1,Heater,250,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
1,1,Heater,500,2022-06-01 01:01:01,U,2022-06-01 01:01:01,2023-01-01 01:01:01,False
2,1,Heater,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
3,2,Thermostat,400,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
4,2,Thermostat,500,2022-06-01 01:01:01,U,2022-06-01 01:01:01,2023-01-01 01:01:01,False
5,2,Thermostat,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
6,3,Television,600,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
7,3,Television,500,2022-06-01 01:01:01,U,2022-06-01 01:01:01,2023-01-01 01:01:01,False
8,3,Television,1000,2023-01-01 01:01:01,U,2023-01-01 01:01:01,2250-01-01 00:00:00,True
9,4,Blender,100,2022-01-01 01:01:01,,2022-01-01 01:01:01,2022-06-01 01:01:01,False
