In this notebook we will design the Data Warehouse for the datasets using Kimball's four steps

### Step 1: Identify the Process

The primary process being modeled is road **crashes** and **fatalities**. We are interested in understanding:
- **Crashes**: Information about crashes themselves.
- **Fatalities**: Information about individuals involved in crashes who have been fatally injured.

This modeling process will allow us to analyse crash events and fatalities by various factors like time, location, person attributes, vehicle involvement and more.


### Step 2: Determine the grain at which facts can be stored


The **grain** (or level of detail) for the fact tables in this data warehouse is defined as follows:

1. **Fact_Crashes**:
   - One row per crash event, with detailed attributes like the date, location, time, vehicle involvement, and road type.

2. **Fact_Fatalities**:
   - One row per **fatality** associated with a singular fatality, capturing details about the persons age, gender and type of road user.

3. **Fact_Number**:
   - One row per date, containing aggregated metrics about the **total fatalities** and **total crashes** that occurred on that specific day.

The grain of **Fact_Crashes** is at the level of a **single crash event**. This is where each crash's details, such as time, date, LGA, vehicles, and the event, are captured.

The grain of **Fact_Fatalities** is at the level of a **single fatality** associated with a crash. It stores data about the individual involved, including demographics and the fatality count.

The grain of **Fact_Number** is at the **date level**, where the total number of fatalities and crashes for each day are stored for high-level aggregation.

### Step 3: Choose the dimensions

**Dim_Date**
| Column Name | Description |
|-------------|-------------|
| `date_id` (PK) | Unique identifier for each date. |
| `year` | Year of the crash/fatality. |
| `month` | Month of the crash/fatality. |
| `day` | Day of the month (e.g. 1-31). |
| `day_of_week` | Day of the week (e.g., Monday, Tuesday). |
| `is_weekend` | Boolean indicating if the crash occurred on a weekend. |

**Dim_State**
| Column Name | Description |
|-------------|-------------|
| `state_id` (PK) | Unique identifier for each state. |
| `state_name` | Name of the state. |

**Dim_LGA**
| Column Name | Description |
|-------------|-------------|
| `lga_id` (PK) | Unique identifier for each LGA. |
| `lga_name` | Local Government Area name. |
| `state_id` (FK) | Reference to `Dim_State` table. |
| `national_remoteness_area` | Area classification based on remoteness. |
| `dwelling_count` | Number of dwellings in the LGA. |

**Dim_Time**
| Column Name | Description |
|-------------|-------------|
| `crash_id` (PK) | Unique identifier for the time. |
| `crash_time` | Exact time of the crash (in timestamp format). |
| `time_of_day` | Time of day (e.g., Morning, Afternoon, Evening). |

**Dim_Vehicle**
| Column Name | Description |
|-------------|-------------|
| `crash_id` (PK) | Unique identifier for vehicle data related to a crash. |
| `bus_involvement` | Boolean indicating if a bus was involved. |
| `heavy_rigid_truck_involvement` | Boolean indicating if a heavy rigid truck was involved. |
| `articulated_truck_involvement` | Boolean indicating if an articulated truck was involved. |

**Dim_Person**
| Column Name | Description |
|-------------|-------------|
| `person_id` (PK) | Surrogate key made from a combination of `CrashID`, `Age`, `Gender`, `RoadUser`. |
| `crash_id` (FK) | ID of the crash in which the person was involved. |
| `gender` | Gender of the individual. |
| `age` | Age of the individual. |
| `age_group` | Age group of the individual (e.g., 18-25, 26-40). |
| `road_user` | Type of road user (e.g., Pedestrian, Driver, Passenger). |

**Dim_Event**
| Column Name | Description |
|-------------|-------------|
| `crash_id` (PK) | Unique identifier for the event. |
| `christmas_period` | Boolean indicating if the crash occurred during the Christmas period. |
| `easter_period` | Boolean indicating if the crash occurred during the Easter period. |

**Dim_Road**
| Column Name | Description |
|-------------|-------------|
| `crash_id` (PK) | Unique identifier for road data. |
| `speed_limit` | Speed limit of the road where the crash occurred. |
| `national_road_type` | Type of road (e.g., highway, local road). |

### Step 4: Identify the numeric measures for the facts



**Fact_Crashes**
| Column Name | Description |
|-------------|-------------|
| `crash_id` (PK) | Unique identifier for the crash event. |
| `date_id` (FK) | Reference to `Dim_Date` table. |
| `lga_id` (FK) | Reference to `Dim_LGA` table, which itself joind `Dim_State`. |
| `state_id` (FK) | Reference to `Dim_State` table. |

**Fact_Fatalities**
| Column Name | Description |
|-------------|-------------|
| `fatality_id` (PK) | Unique identifier for each fatality. |
| `person_id` (FK) | Reference to `Dim_Person` table. |
| `crash_id` (FK) | Reference to `Fact_Crashes` table. |


 **Fact_Number**
| Column Name | Description |
|-------------|-------------|
| `number_date_id` (PK) | Unique identifier for the daily summary record. |
| `date_id` (FK) | Reference to `Dim_Date` table. |
| `total_fatalities` | Total number of fatalities on the given date (and optionally by state). |
| `total_crashes` | Total number of crashes on the given date (and optionally by state). |