# CITS3401 Project 1
### Kean Scott - 23850149

## Introduction

In Australia, motor vehicle accidents make up a significant proportion of potentially preventable 

The following python notebook contains all the steps for the analysis of the Australian Road Deaths Database (**ARDD**), also providing the steps taken to build a daata warehouse suitable for ongoing observation of key metrics and statistics. These key metrics will be defined to answer particular questions reguarding road deaths in Australia, in an effort to uncover underlying trends and correlations in these fatalities.  

Here are the original sources of my data:

- [Fatal crashes (updated Feb 2025)](https://catalogue.data.infrastructure.gov.au/dataset/australian-road-deaths-database/resource/457dbf98-419e-4f1e-a45f-4d568ff0ff69?inner_span=True)
- [Fatalities (updated Feb 2025)](https://catalogue.data.infrastructure.gov.au/dataset/australian-road-deaths-database/resource/80091814-9a39-444c-a329-b27561d8fcc6?inner_span=True)

## Step 1: Initial Data Exploration - Identifying the questions to be asked of our data warehouse, and defining it's dimensions

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 

In [2]:
fatal_crashes = pd.read_csv("./data/fatal_crashes_feb2025.csv")
fatalities = pd.read_csv("./data/fatalities_feb2025.csv")

I want to explore the following questions using the data:
- How did fatalities vary based on the age and vehicle type?
- Which locations have the highest crash rates
- Is there any discrepancy in fatality rates between Male and Female drivers? How does has this changed over time?
- How do fatalities change based the type of road user and the speed limit? How does this vary over time, and between states?
- What time of day do fatal crashes occur the most? Is there a particlar day of the week that has the highest rate?

In [3]:
fatalities.head()

Unnamed: 0,Crash ID,State,Month,Year,Dayweek,Time,Crash Type,Bus Involvement,Heavy Rigid Truck Involvement,Articulated Truck Involvement,Speed Limit,Road User,Gender,Age,National Remoteness Areas 2021,SA4 Name 2021,National LGA Name 2021,National Road Type,Christmas Period,Easter Period
0,120161098270,NSW,4,2016,Monday,15:29:00,Multiple,No,No,No,100,Driver,Male,76,Inner Regional Australia,New England and North West,Tamworth Regional,Arterial Road,No,No
1,120161097596,NSW,4,2016,Tuesday,16:40:00,Multiple,No,No,Yes,110,Driver,Female,49,Inner Regional Australia,Far West and Orana,Dubbo Regional,National or State Highway,No,No
2,120161097596,NSW,4,2016,Tuesday,16:40:00,Multiple,No,No,Yes,110,Passenger,Female,80,Inner Regional Australia,Far West and Orana,Dubbo Regional,National or State Highway,No,No
3,120161098282,NSW,4,2016,Sunday,14:00:00,Single,No,No,No,80,Passenger,Male,13,Inner Regional Australia,Riverina,Lockhart,Local Road,No,No
4,120161098913,NSW,4,2016,Saturday,07:30:00,Single,No,No,No,110,Driver,Male,21,Outer Regional Australia,Far West and Orana,Narromine,National or State Highway,No,No


In [4]:
fatal_crashes.head()

Unnamed: 0,Crash ID,State,Month,Year,Dayweek,Time,Crash Type,Number Fatalities,Bus Involvement,Heavy Rigid Truck Involvement,Articulated Truck Involvement,Speed Limit,National Remoteness Areas 2021,SA4 Name 2021,National LGA Name 2021,National Road Type,Christmas Period,Easter Period
0,620172123,TAS,1,2017,Friday,16:10:00,Multiple,1,No,No,No,60,Inner Regional Australia,Hobart,Hobart,Sub-arterial Road,No,No
1,620172124,TAS,1,2017,Friday,19:00:00,Single,1,No,No,No,100,Outer Regional Australia,Launceston and North East,Northern Midlands,Arterial Road,No,No
2,620172125,TAS,2,2017,Monday,13:50:00,Multiple,1,No,Yes,No,100,Outer Regional Australia,Launceston and North East,Break O'Day,Arterial Road,No,No
3,620172126,TAS,3,2017,Thursday,11:29:00,Single,1,No,No,No,50,Inner Regional Australia,Launceston and North East,West Tamar,Local Road,No,No
4,620172127,TAS,3,2017,Saturday,13:20:00,Multiple,1,No,No,No,100,Remote Australia,West and North West,West Coast,Arterial Road,No,No


In [5]:
fatal_crashes_cols = [
    'Crash ID', 'State', 'Year', 'Month', 'Dayweek', 'Time', 
    'Crash Type', 'Number Fatalities', 'Speed Limit',
    'Christmas Period', 'Easter Period'
]

fatalities_cols = [
    'Crash ID', 'State', 'Year', 'Month', 'Dayweek', 'Time', 
    'Crash Type', 'Road User', 'Gender', 'Age', 
    'Speed Limit', 'Christmas Period', 'Easter Period'
]

fatal_crashes_filtered = fatal_crashes[fatal_crashes_cols].copy()
fatalities_filtered = fatalities[fatalities_cols].copy()


In [6]:

fatal_crashes_filtered2 = fatal_crashes_filtered[fatal_crashes_filtered['Speed Limit'] != -9].copy()
fatalities_filtered2 = fatalities_filtered[fatalities_filtered['Speed Limit'] != -9].copy()

fatal_crashes_filtered2.replace('Unknown', pd.NA, inplace=True)
fatal_crashes_filtered2.dropna(inplace=True)

fatalities_filtered2.replace('Unknown', pd.NA, inplace=True)
fatalities_filtered2.dropna(inplace=True)

print("fatal_crashes_filtered shape:", fatal_crashes_filtered2.shape)
print("fatalities_filtered shape:", fatalities_filtered2.shape)

fatal_crashes_filtered shape: (50192, 11)
fatalities_filtered shape: (55500, 13)


In [7]:
fatalities_filtered2.dtypes

Crash ID             int64
State               object
Year                 int64
Month                int64
Dayweek             object
Time                object
Crash Type          object
Road User           object
Gender              object
Age                  int64
Speed Limit          int64
Christmas Period    object
Easter Period       object
dtype: object

In [8]:
fatal_crashes_filtered2.dtypes

Crash ID              int64
State                object
Year                  int64
Month                 int64
Dayweek              object
Time                 object
Crash Type           object
Number Fatalities     int64
Speed Limit           int64
Christmas Period     object
Easter Period        object
dtype: object

In [9]:
print("Count of 'Unknown' values in fatal_crashes:")
print(((fatal_crashes_filtered2 == 'Unknown') | (fatal_crashes_filtered2.isna())).sum())

print("\nCount of 'Unknown' values in fatalities:")
print(((fatalities_filtered2 == 'Unknown') | (fatal_crashes_filtered2.isna())).sum())


Count of 'Unknown' values in fatal_crashes:
Crash ID             0
State                0
Year                 0
Month                0
Dayweek              0
Time                 0
Crash Type           0
Number Fatalities    0
Speed Limit          0
Christmas Period     0
Easter Period        0
dtype: int64

Count of 'Unknown' values in fatalities:
Age                  0.0
Christmas Period     0.0
Crash ID             0.0
Crash Type           0.0
Dayweek              0.0
Easter Period        0.0
Gender               0.0
Month                0.0
Number Fatalities    0.0
Road User            0.0
Speed Limit          0.0
State                0.0
Time                 0.0
Year                 0.0
dtype: float64


In [10]:
print(sorted(fatal_crashes_filtered2['Time'].unique()))


['00:00:00', '00:01:00', '00:02:00', '00:03:00', '00:04:00', '00:05:00', '00:06:00', '00:07:00', '00:08:00', '00:09:00', '00:10:00', '00:11:00', '00:12:00', '00:13:00', '00:14:00', '00:15:00', '00:16:00', '00:17:00', '00:18:00', '00:20:00', '00:21:00', '00:22:00', '00:23:00', '00:24:00', '00:25:00', '00:26:00', '00:27:00', '00:28:00', '00:29:00', '00:30:00', '00:31:00', '00:32:00', '00:33:00', '00:34:00', '00:35:00', '00:36:00', '00:37:00', '00:38:00', '00:39:00', '00:40:00', '00:41:00', '00:42:00', '00:43:00', '00:44:00', '00:45:00', '00:46:00', '00:47:00', '00:48:00', '00:49:00', '00:50:00', '00:51:00', '00:52:00', '00:53:00', '00:54:00', '00:55:00', '00:56:00', '00:57:00', '00:58:00', '00:59:00', '01:00:00', '01:01:00', '01:02:00', '01:03:00', '01:04:00', '01:05:00', '01:06:00', '01:07:00', '01:08:00', '01:09:00', '01:10:00', '01:11:00', '01:12:00', '01:13:00', '01:14:00', '01:15:00', '01:16:00', '01:17:00', '01:18:00', '01:19:00', '01:20:00', '01:21:00', '01:22:00', '01:23:00', '01

In [11]:
def convert_time(time_str):
    """Convert time string, rounded down to the nearest hour.
    Args:
        time_str (str): Time string in the format 'HH:MM:SS'.
    Returns:
        int: Hour of the day (0-23).
    """
    try:
        hour = int(str(time_str).split(':')[0])
        if 0 <= hour <= 23:
            return int(hour)
        else:
            print(time_str)
            return None
    except (ValueError, IndexError, AttributeError):
        print(time_str)
        return None

In [12]:
fatal_crashes_filtered2['Time']

0        16:10:00
1        19:00:00
2        13:50:00
3        11:29:00
4        13:20:00
           ...   
51474    11:00:00
51475    13:00:00
51476    21:27:00
51477    23:00:00
51478    09:56:00
Name: Time, Length: 50192, dtype: object

In [13]:
fatal_crashes_filtered2['hour'] = fatal_crashes_filtered2['Time'].apply(convert_time)
fatalities_filtered2['hour'] = fatal_crashes_filtered2['Time'].apply(convert_time)

99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99
99:99:99


In [14]:
fatal_crashes_filtered3 = fatal_crashes_filtered2.dropna().copy()
fatalities_filtered3 = fatalities_filtered2.dropna().copy()

fatal_crashes_filtered3['hour'] = fatal_crashes_filtered3['hour'].astype(int)
fatalities_filtered3['hour'] = fatalities_filtered3['hour'].astype(int)

In [15]:
fatal_crashes_filtered3.dtypes

Crash ID              int64
State                object
Year                  int64
Month                 int64
Dayweek              object
Time                 object
Crash Type           object
Number Fatalities     int64
Speed Limit           int64
Christmas Period     object
Easter Period        object
hour                  int32
dtype: object

In [16]:
fatalities_filtered3.dtypes

Crash ID             int64
State               object
Year                 int64
Month                int64
Dayweek             object
Time                object
Crash Type          object
Road User           object
Gender              object
Age                  int64
Speed Limit          int64
Christmas Period    object
Easter Period       object
hour                 int32
dtype: object

In [17]:
fatal_crashes_filtered3.rename(columns={'Crash ID': 'crash_id', 'State': 'state', 'Year':'year', 
                                       'Month':'month', 'Dayweek':'dayweek', 'Time':'time', 
                                       'Crash Type': 'crash_type', 'Number Fatalities': 'num_fatalities',
                                       'Speed Limit': 'speed_limit', 'Christmas Period': 'christmas',
                                       'Easter Period': 'easter'}, inplace=True)

fatalities_filtered3.rename(columns={'Crash ID': 'crash_id', 'State': 'state', 'Year':'year', 
                                       'Month':'month', 'Dayweek':'dayweek', 'Time':'time', 
                                       'Crash Type': 'crash_type', 'Road User': 'road_user',
                                       'Gender': 'gender', 'Age':'age', 'Speed Limit': 'speed_limit', 
                                       'Christmas Period': 'christmas', 'Easter Period': 'easter'}, inplace=True)

In [18]:
fatal_crashes_filtered3.head()

Unnamed: 0,crash_id,state,year,month,dayweek,time,crash_type,num_fatalities,speed_limit,christmas,easter,hour
0,620172123,TAS,2017,1,Friday,16:10:00,Multiple,1,60,No,No,16
1,620172124,TAS,2017,1,Friday,19:00:00,Single,1,100,No,No,19
2,620172125,TAS,2017,2,Monday,13:50:00,Multiple,1,100,No,No,13
3,620172126,TAS,2017,3,Thursday,11:29:00,Single,1,50,No,No,11
4,620172127,TAS,2017,3,Saturday,13:20:00,Multiple,1,100,No,No,13


In [19]:
dimtime = fatal_crashes_filtered3[['year', 'month', 'dayweek', 'hour', 'christmas', 'easter']].drop_duplicates().reset_index(drop=True)

In [20]:
dimtime.head()

Unnamed: 0,year,month,dayweek,hour,christmas,easter
0,2017,1,Friday,16,No,No
1,2017,1,Friday,19,No,No
2,2017,2,Monday,13,No,No
3,2017,3,Thursday,11,No,No
4,2017,3,Saturday,13,No,No


In [21]:
# checking for null values
dimtime.isnull().sum()

year         0
month        0
dayweek      0
hour         0
christmas    0
easter       0
dtype: int64

In [22]:
dimtime['time_id'] = dimtime.index + 1

dimtime = dimtime[['time_id', 'year', 'month', 'dayweek', 'hour', 'christmas', 'easter']]

In [23]:
dimtime.head()

Unnamed: 0,time_id,year,month,dayweek,hour,christmas,easter
0,1,2017,1,Friday,16,No,No
1,2,2017,1,Friday,19,No,No
2,3,2017,2,Monday,13,No,No
3,4,2017,3,Thursday,11,No,No
4,5,2017,3,Saturday,13,No,No


In [24]:
print(dimtime.info())
print(dimtime.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35047 entries, 0 to 35046
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   time_id    35047 non-null  int64 
 1   year       35047 non-null  int64 
 2   month      35047 non-null  int64 
 3   dayweek    35047 non-null  object
 4   hour       35047 non-null  int32 
 5   christmas  35047 non-null  object
 6   easter     35047 non-null  object
dtypes: int32(1), int64(3), object(3)
memory usage: 1.7+ MB
None
            time_id          year         month          hour
count  35047.000000  35047.000000  35047.000000  35047.000000
mean   17524.000000   2005.276600      6.546951     12.467173
std    10117.341779     10.477128      3.514356      6.474411
min        1.000000   1989.000000      1.000000      0.000000
25%     8762.500000   1996.000000      3.000000      7.000000
50%    17524.000000   2005.000000      7.000000     13.000000
75%    26285.500000   2014.000000     10.0

Next I will construct the dim_location table.

In [25]:
dimlocation = fatal_crashes_filtered3[['state']].drop_duplicates().reset_index(drop=True)
dimlocation.columns = ['state']

In [26]:
print(dimlocation['state'].unique())

['TAS' 'WA' 'NT' 'NSW' 'QLD' 'ACT' 'VIC' 'SA']


In [27]:
dimlocation.reset_index()
dimlocation['location_id'] = dimlocation.index + 1

dimlocation = dimlocation[['location_id', 'state']]

In [28]:
dimlocation.head(10)

Unnamed: 0,location_id,state
0,1,TAS
1,2,WA
2,3,NT
3,4,NSW
4,5,QLD
5,6,ACT
6,7,VIC
7,8,SA


Now for the dimperson table.

In [29]:
# function to convert age to age_group
def age_group(age):
    if age < 18:
        return  '0-17'
    elif age <= 24:
        return '18-24'
    elif age <= 29:
        return '25-29'
    elif age <= 34:
        return '30-34'
    elif age <= 39:
        return '35-39'
    elif age <= 44:
        return '40-44'
    elif age <= 49:
        return '45-49'
    elif age <= 54:
        return '50-54'
    elif age <= 59:
        return '55-59'
    elif age <= 64:
        return '60-64'
    elif age <= 69:
        return '65-69'
    else:
        return '70+'

In [30]:
# create age_group column
fatalities_filtered3['age_group'] = fatalities_filtered3['age'].apply(age_group)

# create dimpersondetails dataframe
dimpersondetails = fatalities_filtered3[['gender', 'age_group', 'road_user']].drop_duplicates().reset_index(drop=True)
dimpersondetails.reset_index(drop=True, inplace=True)

In [31]:
print(dimpersondetails['gender'].unique())
print(dimpersondetails['road_user'].unique())
print(dimpersondetails['age_group'].unique())

['Male' 'Female']
['Driver' 'Passenger' 'Motorcycle rider' 'Pedal cyclist' 'Pedestrian'
 'Motorcycle pillion passenger']
['70+' '45-49' '0-17' '18-24' '25-29' '50-54' '65-69' '30-34' '40-44'
 '60-64' '55-59' '35-39']


In [32]:
# add primary key
dimpersondetails['person_id'] = dimpersondetails.index + 1

# reorder columns
dimpersondetails = dimpersondetails[['person_id', 'gender', 'age_group', 'road_user']]

In [33]:
dimpersondetails.head()

Unnamed: 0,person_id,gender,age_group,road_user
0,1,Male,70+,Driver
1,2,Female,45-49,Driver
2,3,Female,70+,Passenger
3,4,Male,0-17,Passenger
4,5,Male,18-24,Driver


In [34]:
dimcrashdetails = fatal_crashes_filtered3[['crash_type', 'speed_limit']].drop_duplicates().reset_index(drop=True)

In [35]:
dimcrashdetails.describe()

Unnamed: 0,speed_limit
count,31.0
mean,58.225806
std,37.717113
min,5.0
25%,25.0
50%,60.0
75%,85.0
max,130.0


In [36]:
print(dimcrashdetails[dimcrashdetails['speed_limit'] <= 0])

Empty DataFrame
Columns: [crash_type, speed_limit]
Index: []


In [37]:
# add primary key
dimcrashdetails['crashdetails_id'] = dimcrashdetails.index + 1

# reorder columns
dimcrashdetails = dimcrashdetails[['crashdetails_id', 'crash_type', 'speed_limit']]

In [38]:
dimcrashdetails.head()

Unnamed: 0,crashdetails_id,crash_type,speed_limit
0,1,Multiple,60
1,2,Single,100
2,3,Multiple,100
3,4,Single,50
4,5,Multiple,80


In [39]:
dimtime.head()

Unnamed: 0,time_id,year,month,dayweek,hour,christmas,easter
0,1,2017,1,Friday,16,No,No
1,2,2017,1,Friday,19,No,No
2,3,2017,2,Monday,13,No,No
3,4,2017,3,Thursday,11,No,No
4,5,2017,3,Saturday,13,No,No


In [40]:
dimtime.dtypes

time_id       int64
year          int64
month         int64
dayweek      object
hour          int32
christmas    object
easter       object
dtype: object

In [41]:
dimlocation.head()

Unnamed: 0,location_id,state
0,1,TAS
1,2,WA
2,3,NT
3,4,NSW
4,5,QLD


In [42]:
dimpersondetails.head()

Unnamed: 0,person_id,gender,age_group,road_user
0,1,Male,70+,Driver
1,2,Female,45-49,Driver
2,3,Female,70+,Passenger
3,4,Male,0-17,Passenger
4,5,Male,18-24,Driver


In [43]:
factfatality = fatalities_filtered3.drop_duplicates(subset='crash_id').copy()
print(factfatality[['year', 'month', 'dayweek', 'hour']].dtypes)

# Create fatality_id
factfatality['fatality_id'] = factfatality.index + 1

# Join with dim_person
factfatality = factfatality.merge(dimpersondetails, how='left', 
    left_on=['gender', 'age_group', 'road_user'], 
    right_on=['gender', 'age_group', 'road_user'])

# Join with dim_location
factfatality = factfatality.merge(dimlocation, how='left', 
    left_on=['state'],
    right_on=['state'])

# Join with dim_time
factfatality = factfatality.merge(dimtime, how='left',
    left_on=['year', 'month', 'dayweek', 'hour', 'christmas', 'easter'],
    right_on=['year', 'month', 'dayweek', 'hour', 'christmas', 'easter'])

# Join with dim_crashdetails
factfatality = factfatality.merge(dimcrashdetails, how='left',
    left_on=['crash_type', 'speed_limit'],
    right_on=['crash_type', 'speed_limit'])


year        int64
month       int64
dayweek    object
hour        int32
dtype: object


In [44]:
factfatality.head()

Unnamed: 0,crash_id,state,year,month,dayweek,time,crash_type,road_user,gender,age,speed_limit,christmas,easter,hour,age_group,fatality_id,person_id,location_id,time_id,crashdetails_id
0,120161098270,NSW,2016,4,Monday,15:29:00,Multiple,Driver,Male,76,100,No,No,16,70+,1,1,4,,3
1,120161097596,NSW,2016,4,Tuesday,16:40:00,Multiple,Driver,Female,49,110,No,No,19,45-49,2,2,4,,9
2,120161098282,NSW,2016,4,Sunday,14:00:00,Single,Passenger,Male,13,80,No,No,11,0-17,4,4,4,1435.0,11
3,120161098913,NSW,2016,4,Saturday,07:30:00,Single,Driver,Male,21,110,No,No,13,18-24,5,5,4,,6
4,120161098283,NSW,2016,4,Sunday,17:00:00,Single,Driver,Male,26,80,No,No,17,25-29,8,7,4,1430.0,11


In [45]:
factfatality = factfatality[[
    'fatality_id', 'crash_id', 'person_id'
]]


In [46]:
print(factfatality.isna().sum())

fatality_id    0
crash_id       0
person_id      0
dtype: int64


In [47]:
factfatality.head()

Unnamed: 0,fatality_id,crash_id,person_id
0,1,120161098270,1
1,2,120161097596,2
2,4,120161098282,4
3,5,120161098913,5
4,8,120161098283,7


In [48]:
factcrash = fatal_crashes_filtered3.drop_duplicates(subset='crash_id').copy()

factcrash = factcrash.merge(dimlocation, how='left',
    left_on=['state'],
    right_on=['state'])

factcrash = factcrash.merge(dimtime, how='left',
    left_on=['year', 'month', 'dayweek', 'hour'],
    right_on=['year', 'month', 'dayweek', 'hour'])

factcrash = factcrash.merge(dimcrashdetails, how='left',
    left_on=['crash_type', 'speed_limit'],
    right_on=['crash_type', 'speed_limit'])




In [49]:
factcrash = factcrash[[
    'crash_id', 'location_id', 'time_id', 'crashdetails_id', 'num_fatalities'
]]

In [50]:
factcrash.head()

Unnamed: 0,crash_id,location_id,time_id,crashdetails_id,num_fatalities
0,620172123,1,1,1,1
1,620172124,1,2,2,1
2,620172125,1,3,3,1
3,620172126,1,4,4,1
4,620172127,1,5,3,1


In [51]:
# export tables to .csv files
dimtime.to_csv('data/dimtime.csv', index=False)
dimlocation.to_csv('data/dimlocation.csv', index=False)
dimpersondetails.to_csv('data/dimpersondetails.csv', index=False)
dimcrashdetails.to_csv('data/dimcrashdetails.csv', index=False)

factfatality.to_csv('data/factfatality.csv', index=False)
factcrash.to_csv('data/factcrash.csv', index=False)
