This notebook outlines the logic behind designing the `etl_process.py` script. Please note that not the entire ETL process is covered here to avoid repetition of scripts after the database setup is complete. 

> To avoid synchronisation errors between identical ETL scripts after the database is set up, we stopped using this notebook for design. Since one script runs in a Docker container and the other locally, both referencing the same servers, they must remain identical. Any discrepancies could lead to issues during execution, so we moved away from using this notebook to ensure consistency.

# Part 1: Extract Data

Here we load the raw files from the 3 separate data sources

In [52]:
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
import os


project_root = "/Users/lainemulvay/Library/CloudStorage/GoogleDrive-laine.mulvay@gmail.com/My Drive/U/CITS5504 Data Warehousing/Project 1 - Fatalities/Project1 Fatalities" 

os.chdir(project_root)
print("Working directory set to:", os.getcwd())

# File paths
data_dir = os.path.join("data", "raw")
fatalities_file = os.path.join(data_dir, "bitre_fatalities_dec2024.xlsx")
crashes_file = os.path.join(data_dir, "bitre_fatal_crashes_dec2024.xlsx")
dwellings_file = os.path.join(data_dir, "LGA (count of dwellings).csv")

# Load data
fatality_df = pd.read_excel(fatalities_file, sheet_name="BITRE_Fatality", skiprows=4)
fatality_count_df = pd.read_excel(fatalities_file, sheet_name="BITRE_Fatality_Count_By_Date", skiprows=2)

crash_df = pd.read_excel(crashes_file, sheet_name="BITRE_Fatal_Crash", skiprows=4)
crash_count_df = pd.read_excel(crashes_file, sheet_name="BITRE_Fatal_Crash_Count_By_Date", skiprows=2)

dwelling_df = pd.read_csv(
    dwellings_file,
    skiprows=7,
    header=None,
    names=["lga_name", "dwelling_count", "extra"],
    usecols=["lga_name", "dwelling_count"]
)

# Clean up dwelling data
dwelling_df = dwelling_df.iloc[2:-5].reset_index(drop=True)

# Display for inspection
display(dwelling_df)

Working directory set to: /Users/lainemulvay/Library/CloudStorage/GoogleDrive-laine.mulvay@gmail.com/My Drive/U/CITS5504 Data Warehousing/Project 1 - Fatalities/Project1 Fatalities


Unnamed: 0,lga_name,dwelling_count
0,Albury,25430
1,Armidale Regional,12955
2,Ballina,20889
3,Balranald,1091
4,Bathurst Regional,18458
...,...,...
551,Migratory - Offshore - Shipping (NT),74
552,Unincorporated ACT,187156
553,Migratory - Offshore - Shipping (ACT),0
554,Unincorp. Other Territories,1376


# Part 2: Transform

### Filling Missing values

In [53]:
# fill all values of -9 with NaN
crash_df = crash_df.replace(-9, pd.NA)
fatality_df = fatality_df.replace(-9, pd.NA)

# fill all blank values with NaN
crash_df = crash_df.fillna(pd.NA)
fatality_df = fatality_df.fillna(pd.NA)

# display(crash_df)
# display(fatality_df)

### Join Fatalities and Crashes

First we must merge dataframes and see if duplicate columns are the exact same

In [54]:
# join fatility_df and crash_df on the "Crash_ID" column and the fatality_count_df and crash_count_df on the "Date" column
crashxfatality_df = fatality_df.merge(crash_df, on="Crash ID", how="left")
# reset the index
crashxfatality_df = crashxfatality_df.reset_index(drop=True)

# Step 1: Clean the column names (remove \n)
cleaned_cols = crashxfatality_df.columns.str.replace("\n", "", regex=False)

# Step 2: Check for duplicate names
seen = {}
final_cols = []

for idx, col in enumerate(cleaned_cols):
    if col in seen:
        # Compare the new column with the original one
        orig_idx = seen[col]
        col_data_1 = crashxfatality_df.iloc[:, orig_idx]
        col_data_2 = crashxfatality_df.iloc[:, idx]
        
        if col_data_1.equals(col_data_2):
            # If they're the same, mark this one to drop by putting None
            final_cols.append(None)
            print(f"Dropping duplicate identical column: {col}")
        else:
            # If they're different, rename this one (add _dup or _2)
            new_col = f"{col}_dup"
            print(f"Renaming column '{col}' to '{new_col}' due to conflict")
            final_cols.append(new_col)
    else:
        seen[col] = idx
        final_cols.append(col)

# Step 3: Apply the updated column names
crashxfatality_df.columns = final_cols

# Step 4: Drop any columns that were marked as None
crashxfatality_df = crashxfatality_df.loc[:, crashxfatality_df.columns.notna()]

# print the cleand column names
print("Cleaned column names:")
print(crashxfatality_df.columns)

# Verify that the columns in crashxfatality_df are clean duplicates
for col in crashxfatality_df.columns:
    if col.endswith('_x'):
        base = col[:-2]
        col_y = base + '_y'
        if col_y in crashxfatality_df.columns:
            diffs = crashxfatality_df[crashxfatality_df[col] != crashxfatality_df[col_y]]
            if not diffs.empty:
                print(f"\nDifferences found in column: {base}")
                print(diffs[['Crash ID', col, col_y]])

Dropping duplicate identical column: Bus Involvement
Cleaned column names:
Index(['Crash ID', 'State_x', 'Month_x', 'Year_x', 'Dayweek_x', 'Time_x',
       'Crash Type_x', 'Bus Involvement', 'Heavy Rigid Truck Involvement_x',
       'Articulated Truck Involvement_x', 'Speed Limit_x', 'Road User',
       'Gender', 'Age', 'National Remoteness Areas_x', 'SA4 Name 2021_x',
       'National LGA Name 2021_x', 'National Road Type_x',
       'Christmas Period_x', 'Easter Period_x', 'Age Group', 'Day of week_x',
       'Time of day', 'State_y', 'Month_y', 'Year_y', 'Dayweek_y', 'Time_y',
       'Crash Type_y', 'Number Fatalities', 'Heavy Rigid Truck Involvement_y',
       'Articulated Truck Involvement_y', 'Speed Limit_y',
       'National Remoteness Areas_y', 'SA4 Name 2021_y',
       'National LGA Name 2021_y', 'National Road Type_y',
       'Christmas Period_y', 'Easter Period_y', 'Day of week_y',
       'Time of Day'],
      dtype='object')

Differences found in column: Time
       Crash ID

In the comparison of columns between fatalities_df and crash_df, most differences are due to missing (NA) values and can be safely ignored. However, for the National Road Type column, we observed a consistent pattern between row indices 9510 and 10797 where the values from fatalities_df (_x columns) are marked as “Undetermined”, while the corresponding values from crash_df (_y columns) contain more informative road types.

Given this, we will prioritize the data from crash_df when merging and retain the _y columns. The duplicate _x columns from fatalities_df will be dropped to ensure we keep the most complete and accurate information.


In [55]:
# Step 1: Drop all _x columns
cols_to_drop = [col for col in crashxfatality_df.columns if col.endswith('_x')]
crashxfatality_df = crashxfatality_df.drop(columns=cols_to_drop)

# Step 2: Rename all _y columns to remove the suffix
crashxfatality_df.columns = [
    col[:-2] if col.endswith('_y') else col for col in crashxfatality_df.columns
]

# Drop 'Time of day' column because it is a duplicate of 'Time of Day'
crashxfatality_df = crashxfatality_df.drop(columns=['Time of day'])

print(crashxfatality_df.columns)



Index(['Crash ID', 'Bus Involvement', 'Road User', 'Gender', 'Age',
       'Age Group', 'State', 'Month', 'Year', 'Dayweek', 'Time', 'Crash Type',
       'Number Fatalities', 'Heavy Rigid Truck Involvement',
       'Articulated Truck Involvement', 'Speed Limit',
       'National Remoteness Areas', 'SA4 Name 2021', 'National LGA Name 2021',
       'National Road Type', 'Christmas Period', 'Easter Period',
       'Day of week', 'Time of Day'],
      dtype='object')


Therefore dataframes have been merged and the columns cleaned up. 

Lets do the same for the count dataframes as well

In [56]:
crashxfatality_count_df = fatality_count_df.merge(crash_count_df, on="Date", how="left")

crashxfatality_count_df.columns = crashxfatality_count_df.columns.str.replace("_x", "")
crashxfatality_count_df.columns = crashxfatality_count_df.columns.str.replace("_y", "")

# remove duplicate columns 
crashxfatality_count_df = crashxfatality_count_df.loc[:, ~crashxfatality_count_df.columns.duplicated()]

columns = crashxfatality_count_df.columns
print(columns)


Index(['Date', 'Number Fatalities', 'Year', 'Month', 'Day Of Week',
       'Number of fatal crashes'],
      dtype='object')


Now we jsut need to reset indicies on both tables

In [57]:
crashxfatality_count_df = crashxfatality_count_df.reset_index(drop=True)
crashxfatality_df = crashxfatality_df.reset_index(drop=True)
display(crashxfatality_count_df)
display(crashxfatality_df)

Unnamed: 0,Date,Number Fatalities,Year,Month,Day Of Week,Number of fatal crashes
0,1989-01-01,8,1989,Jan,Sunday,6
1,1989-01-02,10,1989,Jan,Monday,8
2,1989-01-03,2,1989,Jan,Tuesday,2
3,1989-01-04,5,1989,Jan,Wednesday,5
4,1989-01-05,7,1989,Jan,Thursday,3
...,...,...,...,...,...,...
13144,2024-12-27,3,2024,Dec,Friday,3
13145,2024-12-28,7,2024,Dec,Saturday,5
13146,2024-12-29,1,2024,Dec,Sunday,1
13147,2024-12-30,6,2024,Dec,Monday,5


Unnamed: 0,Crash ID,Bus Involvement,Road User,Gender,Age,Age Group,State,Month,Year,Dayweek,...,Articulated Truck Involvement,Speed Limit,National Remoteness Areas,SA4 Name 2021,National LGA Name 2021,National Road Type,Christmas Period,Easter Period,Day of week,Time of Day
0,20241115,No,Driver,Male,74,65_to_74,NSW,12,2024,Friday,...,No,100,Inner Regional Australia,Riverina,Wagga Wagga,Arterial Road,Yes,No,Weekday,Night
1,20241125,No,Driver,Female,19,17_to_25,NSW,12,2024,Friday,...,No,80,Inner Regional Australia,Sydney - Baulkham Hills and Hawkesbury,Hawkesbury,Local Road,No,No,Weekday,Day
2,20246013,No,Driver,Female,33,26_to_39,Tas,12,2024,Friday,...,No,50,Inner Regional Australia,Launceston and North East,Northern Midlands,Local Road,Yes,No,Weekday,Day
3,20241002,No,Driver,Female,32,26_to_39,NSW,12,2024,Friday,...,No,100,Outer Regional Australia,New England and North West,Armidale Regional,National or State Highway,No,No,Weekday,Day
4,20242261,,Passenger,Male,62,40_to_64,Vic,12,2024,Friday,...,,,Unknown,,,Undetermined,No,No,Weekday,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56869,19896006,No,Passenger,Female,13,0_to_16,Tas,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night
56870,19896006,No,Passenger,Male,13,0_to_16,Tas,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night
56871,19896006,No,Driver,Male,18,17_to_25,Tas,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night
56872,19896006,No,Passenger,Female,14,0_to_16,Tas,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night


In [58]:
# Make the 'State' column uppercase
crashxfatality_df['State'] = crashxfatality_df['State'].str.upper()
display(crashxfatality_df)

Unnamed: 0,Crash ID,Bus Involvement,Road User,Gender,Age,Age Group,State,Month,Year,Dayweek,...,Articulated Truck Involvement,Speed Limit,National Remoteness Areas,SA4 Name 2021,National LGA Name 2021,National Road Type,Christmas Period,Easter Period,Day of week,Time of Day
0,20241115,No,Driver,Male,74,65_to_74,NSW,12,2024,Friday,...,No,100,Inner Regional Australia,Riverina,Wagga Wagga,Arterial Road,Yes,No,Weekday,Night
1,20241125,No,Driver,Female,19,17_to_25,NSW,12,2024,Friday,...,No,80,Inner Regional Australia,Sydney - Baulkham Hills and Hawkesbury,Hawkesbury,Local Road,No,No,Weekday,Day
2,20246013,No,Driver,Female,33,26_to_39,TAS,12,2024,Friday,...,No,50,Inner Regional Australia,Launceston and North East,Northern Midlands,Local Road,Yes,No,Weekday,Day
3,20241002,No,Driver,Female,32,26_to_39,NSW,12,2024,Friday,...,No,100,Outer Regional Australia,New England and North West,Armidale Regional,National or State Highway,No,No,Weekday,Day
4,20242261,,Passenger,Male,62,40_to_64,VIC,12,2024,Friday,...,,,Unknown,,,Undetermined,No,No,Weekday,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56869,19896006,No,Passenger,Female,13,0_to_16,TAS,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night
56870,19896006,No,Passenger,Male,13,0_to_16,TAS,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night
56871,19896006,No,Driver,Male,18,17_to_25,TAS,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night
56872,19896006,No,Passenger,Female,14,0_to_16,TAS,1,1989,Wednesday,...,Yes,100,Unknown,,,Undetermined,No,No,Weekday,Night


### Dimension Table Generation

In [59]:
# --- Dim_Date ---
def create_dim_date(df, date_col):
    df = df.copy()
    df['date_id'] = pd.to_datetime(df[date_col])
    df['year'] = df['date_id'].dt.year
    df['month'] = df['date_id'].dt.month
    df['day'] = df['date_id'].dt.day
    df['day_of_week'] = df['date_id'].dt.day_name()
    df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday'])
    return df[['date_id', 'year', 'month', 'day', 'day_of_week', 'is_weekend']].drop_duplicates()

dim_date = create_dim_date(crashxfatality_count_df, "Date")

dim_date = dim_date.reset_index(drop=True)
display(dim_date)

Unnamed: 0,date_id,year,month,day,day_of_week,is_weekend
0,1989-01-01,1989,1,1,Sunday,True
1,1989-01-02,1989,1,2,Monday,False
2,1989-01-03,1989,1,3,Tuesday,False
3,1989-01-04,1989,1,4,Wednesday,False
4,1989-01-05,1989,1,5,Thursday,False
...,...,...,...,...,...,...
13144,2024-12-27,2024,12,27,Friday,False
13145,2024-12-28,2024,12,28,Saturday,True
13146,2024-12-29,2024,12,29,Sunday,True
13147,2024-12-30,2024,12,30,Monday,False


In [60]:
# --- Dim_State ---
dim_state = crashxfatality_df[["State"]].drop_duplicates()
dim_state = dim_state.rename(columns={"State": "state_id"})
# set the state_name as the full name of the state ie NSW = New South Wales
state_mapping = {
    "NSW": "New South Wales",
    "VIC": "Victoria",
    "QLD": "Queensland",
    "SA": "South Australia",
    "WA": "Western Australia",
    "TAS": "Tasmania",
    "NT": "Northern Territory",
    "ACT": "Australian Capital Territory"
}
dim_state["state_name"] = dim_state["state_id"].replace(state_mapping)

dim_state = dim_state[["state_id", "state_name"]]

dim_state = dim_state.reset_index(drop=True)

display(dim_state)

Unnamed: 0,state_id,state_name
0,NSW,New South Wales
1,TAS,Tasmania
2,VIC,Victoria
3,QLD,Queensland
4,SA,South Australia
5,WA,Western Australia
6,ACT,Australian Capital Territory
7,NT,Northern Territory


In [61]:
# --- Dim_LGA ---
dim_lga = crashxfatality_df[[
    "National LGA Name 2021", "State", "National Remoteness Areas"
]].drop_duplicates()

# Join with state dimension to get state_id
dim_lga = dim_lga.merge(
    dim_state, 
    left_on="State", 
    right_on="state_id", 
    how="left"
)

dim_lga = dim_lga.rename(columns={
    "National LGA Name 2021": "lga_name",
    "National Remoteness Areas": "national_remoteness_area"
})

try:
    dim_lga = pd.merge(dim_lga, dwelling_df, on="lga_name", how="left")
except NameError:
    dim_lga["dwelling_count"] = np.nan

# Remove all rows with NA in lga_name
dim_lga = dim_lga.dropna(subset=["lga_name"])

dim_lga = dim_lga.drop_duplicates(subset=["lga_name"])
dim_lga = dim_lga.reset_index(drop=True)
dim_lga["lga_id"] = range(0, len(dim_lga))
dim_lga = dim_lga[["lga_id", "lga_name", "national_remoteness_area", "dwelling_count", "state_id"]]

display(dim_lga)

Unnamed: 0,lga_id,lga_name,national_remoteness_area,dwelling_count,state_id
0,0,Wagga Wagga,Inner Regional Australia,28244,NSW
1,1,Hawkesbury,Inner Regional Australia,25523,NSW
2,2,Northern Midlands,Inner Regional Australia,6444,TAS
3,3,Armidale Regional,Outer Regional Australia,12955,NSW
4,4,Lockyer Valley,Inner Regional Australia,16162,QLD
...,...,...,...,...,...
506,506,Exmouth,Very Remote Australia,2662,WA
507,507,Sandstone,Very Remote Australia,94,WA
508,508,Boyup Brook,Outer Regional Australia,927,WA
509,509,Cottesloe,Major Cities of Australia,3612,WA


In [62]:
# --- Dim_Time ---
dim_crash = crashxfatality_df[[
    "Crash ID", "Time", "Time of Day"
]].rename(columns={
    "Crash ID": "crash_id",
    "Time": "crash_time",
    "Time of Day": "time_of_day"
}).drop_duplicates()

dim_crash = dim_crash.reset_index(drop=True)

display(dim_crash)

Unnamed: 0,crash_id,crash_time,time_of_day
0,20241115,04:00:00,Night
1,20241125,06:15:00,Day
2,20246013,09:43:00,Day
3,20241002,10:35:00,Day
4,20242261,11:30:00,Day
...,...,...,...
51279,19891246,17:05:00,Day
51280,19892038,18:50:00,Night
51281,19894064,19:00:00,Night
51282,19896006,20:20:00,Night


In [63]:
# --- Dim_Vehicle ---
dim_vehicle = crashxfatality_df[[
    "Crash ID", "Bus Involvement", "Heavy Rigid Truck Involvement", "Articulated Truck Involvement"
]].rename(columns={"Crash ID": "crash_id"}).drop_duplicates()

dim_vehicle = dim_vehicle.reset_index(drop=True)

display(dim_vehicle)

Unnamed: 0,crash_id,Bus Involvement,Heavy Rigid Truck Involvement,Articulated Truck Involvement
0,20241115,No,No,No
1,20241125,No,No,No
2,20246013,No,No,No
3,20241002,No,No,No
4,20242261,,,
...,...,...,...,...
51279,19891246,Yes,,No
51280,19892038,Yes,No,No
51281,19894064,No,,No
51282,19896006,No,,Yes


In [64]:
# --- Dim_Person ---
dim_person = crashxfatality_df[[
    "Crash ID", "Gender", "Age", "Age Group", "Road User"
]].rename(columns={
    "Crash ID": "crash_id",
    "Age Group": "age_group",
    "Road User": "road_user"
}).drop_duplicates()

# Generate a surrogate key for person_id
dim_person["person_id"] = dim_person.apply(
    lambda row: f"{row['crash_id']}_{str(row['Age'])}_{str(row['Gender'])}_{str(row['road_user'])}".replace(" ", "_"), 
    axis=1
)

dim_person = dim_person.rename(columns={
    "Gender": "gender",
    "Age": "age"
})

dim_person = dim_person[["person_id", "crash_id", "gender", "age", "age_group", "road_user"]]

display(dim_person)

Unnamed: 0,person_id,crash_id,gender,age,age_group,road_user
0,20241115_74_Male_Driver,20241115,Male,74,65_to_74,Driver
1,20241125_19_Female_Driver,20241125,Female,19,17_to_25,Driver
2,20246013_33_Female_Driver,20246013,Female,33,26_to_39,Driver
3,20241002_32_Female_Driver,20241002,Female,32,26_to_39,Driver
4,20242261_62_Male_Passenger,20242261,Male,62,40_to_64,Passenger
...,...,...,...,...,...,...
56868,19896006_11_Female_Passenger,19896006,Female,11,0_to_16,Passenger
56869,19896006_13_Female_Passenger,19896006,Female,13,0_to_16,Passenger
56871,19896006_18_Male_Driver,19896006,Male,18,17_to_25,Driver
56872,19896006_14_Female_Passenger,19896006,Female,14,0_to_16,Passenger


In [65]:
# --- Dim_Event ---
dim_event = crashxfatality_df[[
    "Crash ID", "Christmas Period", "Easter Period"
]].rename(columns={"Crash ID": "crash_id"}).drop_duplicates()

dim_event = dim_event.reset_index(drop=True)

display(dim_event)

Unnamed: 0,crash_id,Christmas Period,Easter Period
0,20241115,Yes,No
1,20241125,No,No
2,20246013,Yes,No
3,20241002,No,No
4,20242261,No,No
...,...,...,...
51279,19891246,No,No
51280,19892038,No,No
51281,19894064,No,No
51282,19896006,No,No


In [66]:
# --- Dim_Road ---
dim_road = crashxfatality_df[[
    "Crash ID", "Speed Limit", "National Road Type"
]].rename(columns={
    "Crash ID": "crash_id",
    "Speed Limit": "speed_limit",
    "National Road Type": "national_road_type"
}).drop_duplicates()


dim_road["national_road_type"] = dim_road["national_road_type"].replace("Undetermined", pd.NA)

dim_road = dim_road.reset_index(drop=True)

display(dim_road)

Unnamed: 0,crash_id,speed_limit,national_road_type
0,20241115,100,Arterial Road
1,20241125,80,Local Road
2,20246013,50,Local Road
3,20241002,100,National or State Highway
4,20242261,,
...,...,...,...
51279,19891246,60,
51280,19892038,60,
51281,19894064,100,
51282,19896006,100,


### Fact Table Generation

In [67]:
# --- Fact_Fatalities ---
# Create individual fatality records
# First, identify all fatality records from crashxfatality_df
# Assuming each row in crashxfatality_df represents a person involved in a crash
# and we need to identify which ones are fatalities

# First ensure person_id is created correctly in dim_person
if "person_id" not in dim_person.columns:
    dim_person["person_id"] = dim_person.apply(
        lambda row: f"{row['crash_id']}_{str(row['age'])}_{str(row['gender'])}_{str(row['road_user'])}".replace(" ", "_"), 
        axis=1
    )

# Create fact_fatalities by joining with dim_person
fact_fatalities = crashxfatality_df[["Crash ID", "Gender", "Age", "Road User"]].copy()
fact_fatalities = fact_fatalities.rename(columns={"Crash ID": "crash_id"})

# Generate the same person_id to join with dim_person
fact_fatalities["person_id"] = fact_fatalities.apply(
    lambda row: f"{row['crash_id']}_{str(row['Age'])}_{str(row['Gender'])}_{str(row['Road User'])}".replace(" ", "_"), 
    axis=1
)

# Generate a unique fatality_id
fact_fatalities["fatality_id"] = range(0, len(fact_fatalities))

# Final fact_fatalities table
fact_fatalities = fact_fatalities[["fatality_id", "person_id", "crash_id"]]

display(fact_fatalities)

Unnamed: 0,fatality_id,person_id,crash_id
0,0,20241115_74_Male_Driver,20241115
1,1,20241125_19_Female_Driver,20241125
2,2,20246013_33_Female_Driver,20246013
3,3,20241002_32_Female_Driver,20241002
4,4,20242261_62_Male_Passenger,20242261
...,...,...,...
56869,56869,19896006_13_Female_Passenger,19896006
56870,56870,19896006_13_Male_Passenger,19896006
56871,56871,19896006_18_Male_Driver,19896006
56872,56872,19896006_14_Female_Passenger,19896006


In [68]:
# --- Fact_Crashes ---
fact_crashes = crashxfatality_df[[
    'Crash ID', 'Year', 'Month', 'National LGA Name 2021', 'State'
]].rename(columns={
    'Crash ID': 'crash_id',
    'National LGA Name 2021': 'lga_name',
    'State': 'state_id'
}).drop_duplicates()

# Create date_id for joining with dim_date
fact_crashes['date_str'] = fact_crashes['Year'].astype(str) + '-' + fact_crashes['Month'].astype(str) + '-01'
fact_crashes['date_id'] = pd.to_datetime(fact_crashes['date_str'])
fact_crashes.drop(columns=['date_str', 'Year', 'Month'], inplace=True)

# Join with dim_lga to get lga_id
fact_crashes = fact_crashes.merge(
    dim_lga[['lga_id', 'lga_name']], 
    on='lga_name', 
    how='left'
)

# Final fact_crashes table with required columns
fact_crashes = fact_crashes[['crash_id', 'date_id', 'lga_id', 'state_id']]

display(fact_crashes)

Unnamed: 0,crash_id,date_id,lga_id,state_id
0,20241115,2024-12-01,0.0,NSW
1,20241125,2024-12-01,1.0,NSW
2,20246013,2024-12-01,2.0,TAS
3,20241002,2024-12-01,3.0,NSW
4,20242261,2024-12-01,,VIC
...,...,...,...,...
51279,19891246,1989-01-01,,NSW
51280,19892038,1989-01-01,,VIC
51281,19894064,1989-01-01,,SA
51282,19896006,1989-01-01,,TAS


In [69]:
# --- Fact_Number ---
fact_number = crashxfatality_count_df[[
    "Date", "Number Fatalities", "Number of fatal crashes"
]].copy()

fact_number["date_id"] = pd.to_datetime(fact_number["Date"])
fact_number = fact_number.drop(columns=["Date"])

fact_number = fact_number.rename(columns={
    "Number Fatalities": "total_fatalities",
    "Number of fatal crashes": "total_crashes"
})

# Generate a unique key for the number fact table
fact_number["number_date_id"] = range(0, len(fact_number))

# Final fact_number table
fact_number = fact_number[["number_date_id", "date_id", "total_fatalities", "total_crashes"]]

display(fact_number)

Unnamed: 0,number_date_id,date_id,total_fatalities,total_crashes
0,0,1989-01-01,8,6
1,1,1989-01-02,10,8
2,2,1989-01-03,2,2
3,3,1989-01-04,5,5
4,4,1989-01-05,7,3
...,...,...,...,...
13144,13144,2024-12-27,3,3
13145,13145,2024-12-28,7,5
13146,13146,2024-12-29,1,1
13147,13147,2024-12-30,6,5


# Part 3: Load tables into database

First we will set up the database connection

In [70]:
from sqlalchemy import create_engine

# Create the database engine to connect to PostgreSQL
DATABASE_URL = "postgresql://postgres:postgres@localhost:5433/datawarehouse"

# Create the SQLAlchemy engine
engine = create_engine(DATABASE_URL)

print("Connected to PostgreSQL database at", DATABASE_URL)

Connected to PostgreSQL database at postgresql://postgres:postgres@localhost:5433/datawarehouse


Next we will load the tables

In [71]:
tables = {
    "dim_date": dim_date,
    "dim_state": dim_state,
    "dim_lga": dim_lga,
    "dim_crash": dim_crash,
    "dim_vehicle": dim_vehicle,
    "dim_person": dim_person,
    "dim_event": dim_event,
    "dim_road": dim_road,
    "fact_fatalities": fact_fatalities,
    "fact_crashes": fact_crashes,
    "fact_number": fact_number
}

# Iterate over the dictionary and load data into each table
for table_name, df in tables.items():
    # Load the dataframe into PostgreSQL
    df.to_sql(table_name, engine, index=False, if_exists='replace')  # 'replace' ensures that the table is replaced if it exists
    print(f"Table {table_name} has been loaded into the database.")


Table dim_date has been loaded into the database.
Table dim_state has been loaded into the database.
Table dim_lga has been loaded into the database.
Table dim_crash has been loaded into the database.
Table dim_vehicle has been loaded into the database.
Table dim_person has been loaded into the database.
Table dim_event has been loaded into the database.
Table dim_road has been loaded into the database.
Table fact_fatalities has been loaded into the database.
Table fact_crashes has been loaded into the database.
Table fact_number has been loaded into the database.


Note that this doesnt include adding the relationships between tables. This was added later to the etl_process.py. For lack of redundancy in runninging environments, changes were only made to etl_process.py at this point. 