# CS145: Project 2 | Traffic Fatalaties Across the United States

## Collaborators:
Please list the names and SUNet IDs of your collaborators below:
* *Leon MacAlister, leonmac*
* *Jonathan Affeld, jaffeld*

---
## Project Overview
---

---
**MAIN QUESTION**

- Are Traffic Fatalities consistent across the United States or are there specific states with Higher or Lower fatality rates? What factors might contribute to these regional variations?

**PROJECT OVERVIEW**

- We are using the nhtsa_traffic_fatalities data set to determine how differences in contributing factors increase the fatality rate in accidents from state to state. The factors we will explore include the number of vehicles involved in a crash, the number of people involved in a crash, day of the week, hour, month, type of landuse, atmospheric conditions, time to scene of crash, time to hospital, and number of drunk drivers. 

**SUPPLEMENTARY QUESTIONS**
- Are the number of deaths in a crash determined by the emergency response time? Perhaps the severity of a crash may be contingent upon the effectiveness of emergency services or proximity to hospitals where emergency care can be provided. And does emergency response time on average differ from state to state?

- What times/days/months are the most fatal, and what proportion of those crashes during those times involve drunk drivers. We wonder if there are more drunk drivers in certain states? And if hours with most fatal crashes vary across these states?

- What atmospheric conditions cause a higher percentage of fatal accidents? And do differences in unfavourable conditions across states increase number of fatal crashes in a statistically significant way?

- How dependent is the severity of fatal car crashes on the number of drunk drivers in an accident?

- Does the type of land effect the fatality rate across states?

- How do the number of vehicles/people involved in fatal accidents effect the fatality rate?
---

In [None]:
%pip install pandas
%pip install pandas-gbq
%pip install matplotlib
%pip install plotly
%pip install plotly-express

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly_express as px

In [None]:
# Run this cell to authenticate yourself to BigQuery
from google.oauth2 import service_account
key_path = './cs145-project2-406000-9a59fc7c0b3d.json'
credential = service_account.Credentials.from_service_account_file(key_path)

In [None]:
# Initialize BiqQuery client
from google.cloud import bigquery
%load_ext google.cloud.bigquery
%env GOOGLE_APPLICATION_CREDENTIALS=$key_path
project_id = "cs145-project2-406000"
client = bigquery.Client(credentials=credential, project=project_id)

----
## Analysis of Dataset - **nhtsa_fatalities**
----

The nhtsa_fatalities data set contains 108 tables. Each table contains a year's worth of traffic-related data. There are 18 different schemas (i.e. impaired driving, damage reports, traffic violations, series of events, etc.) over 6 years (2015-2020 inclusive) for a total of 108 tables. We will now delve into each schema.


### Taste of the Dataset (just using 2016)

`ACCIDENTS` (~30mb each for 6 years - 180mb)

Column names: (All Primary Keys)

state_number, state_name, consecutive_number, number_of_vehicle_forms_submitted_all, number_of_motor_vehicles_in_transport_mvit, number_of_parked_working_vehicles, number_of_forms_submitted_for_persons_not_in_motor_vehicles, number_of_forms_submitted_for_persons_in_motor_vehicles, number_of_persons_in_motor_vehicles_in_transport_mvit, number_of_persons_not_in_motor_vehicles_in_transport_mvit, county, county_name, city, city_name, day_of_crash, day_name, month_of_crash, month_of_crash_name, year_of_crash, day_of_week, day_of_week_name, hour_of_crash, hour_of_crash_name, minute_of_crash, minute_of_crash_name, national_highway_system, national_highway_system_name, route_signing, route_signing_name, trafficway_identifier, trafficway_identifier_2, land_use, land_use_name, functional_system, functional_system_name, ownership, ownership_name, milepoint, milepoint_name, latitude, latitude_name, longitude, longitude_name, special_jurisdiction, special_jurisdiction_name, first_harmful_event, first_harmful_event_name, manner_of_collision, manner_of_collision_name, relation_to_junction_within_interchange_area, relation_to_junction_within_interchange_area_name, relation_to_junction_specific_location, relation_to_junction_specific_location_name, type_of_intersection, type_of_intersection_name, work_zone, work_zone_name, relation_to_trafficway, relation_to_trafficway_name, light_condition, light_condition_name, atmospheric_conditions_1, atmospheric_conditions_1_name, atmospheric_conditions_2, atmospheric_conditions_2_name, atmospheric_conditions, atmospheric_conditions_name, school_bus_related, school_bus_related_name, rail_grade_crossing_identifier, rail_grade_crossing_identifier_name, hour_of_notification, hour_of_notification_name, minute_of_notification, minute_of_notification_name, hour_of_arrival_at_scene, hour_of_arrival_at_scene_name, minute_of_arrival_at_scene, minute_of_arrival_at_scene_name, hour_of_ems_arrival_at_hospital, hour_of_ems_arrival_at_hospital_name, minute_of_ems_arrival_at_hospital, minute_of_ems_arrival_at_hospital_name, related_factors_crash_level_1, related_factors_crash_level_1_name, related_factors_crash_level_2, related_factors_crash_level_2_name, related_factors_crash_level_3, related_factors_crash_level_3_name, number_of_fatalities, number_of_drunk_drivers, timestamp_of_crash

ACCIDENTS is the main table of the dataset. It has a high level description of each crash e.g. when, where, and description of the actual crash. Moreover it gives us a handful of other useful columns such as atmospheric condition which indicates if it was raining, cloudy etc. Another column like light_condition gives information on how bright or dark it is on a numerical scale (e.g. dark and lighted, dusk, daylight etc.). However the more interesting columns are the crash time and arrival of emergency services at the scene - which may give us a lot of insight into how the quickness of emergency response can help reduce the number of fatalities in a car accident. Also note that the consecutive number is the code for the crash, which includes all cars involved in that particular accident. We will also use this consecutive number to join all the other tables to make the dataset more robust. There are some data inconsistencies from year to year, such as 2015 does not provide city data.

Details on the columns chosen to be evaluated will be given below.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
LIMIT 5


`DAMAGE` (~11.5mb over 6 years - 69mb)
state_number
state_name
consecutive_number
vehicle_number
damaged_areas
damaged_areas_name

damaged_areas and damaged_areas_names details which parts of the car were damaged in the fatal accident e.g. undercarriage. This dataset is less helpful as we are focused on contributing factors, not the result of a crash

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. damage_2016`
LIMIT 5

`DISTRACT` (~3mb over 6 years - 18mb)
Consecutive_number  (secondary)
Vehicle_number  (secondary)
Driver_distracted_by_name (primary)

This is a simple add on where we can join consecutive numbers to the accidents table. Here the driver_distracted_by_name will give us information on whether or not the driver was distracted. If not it will give us “not reported” or “unknown”. Or if the driver was distracted, will give us the source of distraction e.g. “distracted by an outside person”. Distract contains perspectives from each driver involved in the crash so the consecutive_number is not unique.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. distract_2016`
LIMIT 5

`DRIMPAIR` (~3.5mb over 6 years - 21mb)
Consecutive_number  (secondary)
Vehicle_number  (secondary)
Condition_impairment_at_time_of_crash_driver_name (primary)

This is another add on table where we can join consecutive numbers to the accidents table. Here Condition_impairment_at_time_of_crash_driver_name will give us information on whether or not the driver had a physical impairment that may have contributed to the crash e.g. "Vision impairment" or "Deaf". If not it will give us “None/Apparently Normal” or “unknown”. Drimpair contains perspectives from each driver involved in the crash so the consecutive_number is not unique.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. drimpair_2016`
LIMIT 5

`FACTOR` (~2.5mb over 6 years - 15mb)
state_number (Secondary)
state_name (Secondary)
consecutive_number (Secondary)
vehicle_number (Secondary)
contributing_circumstances_motor_vehicle (primary)
contributing_circumstances_motor_vehicle_name (primary)

contributing_circumstances_motor_vehicle_name will give us the name of a mechanical defect in a car or vehicle that may have contributed to the fatal crash such as "faulty brakes" etc. However this table is not very relevant as we found that a very very small amount of fatal accidents are caused by mechanical failures. Yet again, Factor contains perspectives from each driver involved in the crash so the consecutive_number is not unique.


In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. factor_2016`
LIMIT 5

`MANEUVER` (~3mb over 6 years - 18mb)
Consecutive_number  (secondary)
Vehicle_number  (secondary)
Driver_maneuvered_to_avoid_name (primary)

Another add on where we can join consecutive numbers to the accidents table. Here Driver_maneuvered_to_avoid_name will give us information on whether or not the driver maneuvered to avoid an object or car e.g. “motor vehicle” or "pedestrian". Or if they didn't "did not maneuver". We wondered if a person that was drunk was less likely to maneuver in time, thus increasing the severity of accidents. Maneuver contains perspectives from each driver involved in the crash so the consecutive_number is not unique.


In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. maneuver_2016`
LIMIT 5

`NMCRASH` (~<1mb over 6 years - 5mb)
Consecutive_number  (secondary)
Vehicle_number  (secondary)
Non_motorist_contributing_circumstances_name (primary)

Another add on where we can join consecutive numbers to the accidents table. Here Non_motorist_contributing_circumstances_name will give us information on how a non_motorist may have contributed to a crash. For example "wearing dark clothes” or "drunk". Nmcrash contains perspectives from each non-motorist involved in the crash so the consecutive_number is not unique. Additionally, not all accidents involve non-motorists so there not all accidents have an entry in nmcrash.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. nmcrash_2016`
LIMIT 5

`NMIMPAIR` (0.5mb over 6 years - 3mb)
state_number (secondary)
state_name (secondary)
consecutive_number (secondary)
vehicle_number (secondary)
person_number (primary)
condition_impairment_at_time_of_crash_non_motorist (primary)
condition_impairment_at_time_of_crash_non_motorist_name (primary)

Here condition_impairment_at_time_of_crash_non_motorist_name details whether or not a non motorist's impairments contributed to a crash and what that impairment was - e.g. "drunk/drug use". Nmimpair contains perspectives from each non-motorist involved in the crash so the consecutive_number is not unique. Additionally, not all accidents involve non-motorists so there not all accidents have an entry in nmimpair.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. nmimpair_2016`
LIMIT 5

`PBTYPE` (~3.5mb over 6 years - 21mb)
state_number
state_name
consecutive_number
vehicle_number
person_number
person_type
person_type_name
age
age_name
sex
sex_name
marked_crosswalk_present
marked_crosswalk_present_name
sidewalk_present
sidewalk_present_name
school_zone
school_zone_name
crash_type_pedestrian
crash_type_pedestrian_name
crash_type_bicycle
crash_type_bicycle_name
crash_location_pedestrian
crash_location_pedestrian_name
crash_location_bicycle
crash_location_bicycle_name
pedestrian_position
pedestrian_position_name
bicyclist_position
bicyclist_position_name
pedestrian_initial_direction_of_travel
pedestrian_initial_direction_of_travel_name
bicyclist_initial_direction_of_travel
bicyclist_initial_direction_of_travel_name
motorist_initial_direction_of_travel
motorist_initial_direction_of_travel_name
motorist_maneuver
motorist_maneuver_name
intersection_leg
intersection_leg_name
pedestrian_scenario
pedestrian_scenario_name
crash_group_pedestrian
crash_group_pedestrian_name
crash_group_bicycle
crash_group_bicycle_name

Pedestrian/cyclist information involved in crash. This is less relevant for our question as we are focused on contributing factors for drivers, not where and how a person/bicyclist was involved in a fatal crash

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. pbtype_2016`
LIMIT 5

`SAFETYEQ` (~0.5mb over 6 years - 3mb)
state_number (secondary)
state_name (secondary)
consecutive_number (secondary)
vehicle_number (secondary)
person_number (secondary)
non_motorist_safety_equipment_use (primary)
non_motorist_safety_equipment_use_name (primary)

This table can be joined on consecutive number. Here non_motorist_safety_equipment_use_name details if safety equipment was used by a non motorist and the name of it e.g. seatbelt etc. 

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. safetyeq_2016`
LIMIT 5

`VIOLATN` (~2.5mb over 6 years - 15mb)
Consecutive_number  (secondary)
Vehicle_number  (secondary)
Violations_charged_name (primary)

An add on where we can join consecutive numbers to the accidents table. Violations_charged_name delineates the number and type of road rule violations that occurred e.g. “no license plate”. Violation contains perspectives from each non-motorist involved in the crash so the consecutive_number is not unique.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. violatn_2016`
LIMIT 5

`VISION` (~3mb over 6 years - 18mb)
Consecutive_number  (secondary)
Vehicle_number  (secondary)
Drivers_vision_obscured_by_name (primary)

An add on where we can join consecutive numbers to the accidents table. This delineates the number and type of vision obstruction that occurred and contributed to the car accident e.g. “reflected glare, bright sunlight”. Vision contains perspectives from each non-motorist involved in the crash so the consecutive_number is not unique.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. vision_2016`
LIMIT 5

**Issues/Inconsistencies with the dataset** 
- CEVENT, VSOE, VEVENT seem to be describing very similar things, all including sequence of event coding for each crash (consec. number). Very detailed and descriptive data with not a lot of matching entries, hard to do much quantitative analysis on this.Quite unrelated to our question as this covers when crash actually occurs instead of contributing factors 
- For the contributing factor tables - although they all have encoded types, severity of impairment, distraction etc. is often not delineated and therefore hard to implement more gradiation in queries 
- Inconsistent type for hours, and therefore had to change it to string  
- Consecutive number provides unique ID codes for each accident, however they are reused from year to year, therefore we had to append year to these codes to ensure all accidents are unique in the aggregated table we made


**Pivoting Away From Using Other Tables Outside of Accidents**
- We initially joined several tables on consecutive number to ACCIDENTS including DRIMPAIR, DISTRACT, VIOLATIONS, VISION, NMIMPAIR etc. Consecutive number gives us the ID of a specific crash.
- Moreover for certain tables such as nmcrash, drimpair, nmimpair, and violation we had to account for the fact that if there were multiple participants within an accident there will be multiple entries for a specific accident id number. 
 e.g. if there is a head-on collision, there will be row entries for both drivers in drimpair table under the same accident ID that states if either driver was physically impaired or not. Therefore when joining tables that have many entries for the same accident we get several duplicates and permutations that add extreme bulk to our dataset. This resulted in certain specific crashes having thousands of different entries after joining on multiple different tables due to the sheer amount of permutations. 
- Hence we initially decided to cut down on the number of feature tables to join on to ACCIDENTS to just two: DISTRACT and DRIMPAIR. However we continued to encounter more data consistency issues.
- In the end we decided to use just the ACCIDENTS table and join across years as it already had a swathe of interesting features.





### Investigate how we should create our fatal car accident severity label for the classification model

Below is a query that aims to investigate the number of accidents with different numbers of vehicles in transport. As we can see here, most accidents involve one car. And number of accidents increase as number of motor vehicles in transport decreases. Since average fatalities generally increases as the number of motor vehicles in transport increases, number_of_motor_vehicles_in_transport_mvit may be a useful indicator of accident severity.

In [None]:
%%bigquery --project $project_id

SELECT number_of_motor_vehicles_in_transport_mvit, AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_motor_vehicles_in_transport_mvit
ORDER BY number_of_motor_vehicles_in_transport_mvit

Next is a query that aims to investigate the number of accidents with different numbers of parked vehicles. It appears that most accidents do not involve parked vehicles. And number of accidents increase as number of parked vehicles in transport decreases. It is important to note that there does not appear to be a correlation between number of parked vehicles and average fatalities for when we decide on how to calculate accident severity.

In [None]:
%%bigquery --project $project_id

SELECT number_of_parked_working_vehicles, AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_parked_working_vehicles
ORDER BY number_of_parked_working_vehicles

Now a query that aims to investigate the number of accidents with different numbers of people in moving vehicles. With the exception of 0 people in moving vehicles, as the number of people in moving vehicles increases, the number of accidents decreases. From this we can also see that there is a slight increase in average fatalities as the number of people in moving vehicles increases.

In [None]:
%%bigquery --project $project_id

SELECT number_of_persons_in_motor_vehicles_in_transport_mvit, AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_persons_in_motor_vehicles_in_transport_mvit
ORDER BY number_of_persons_in_motor_vehicles_in_transport_mvit

Next lets explore the number of accidents in comparison to the different numbers of people not in moving vehicles. From the table we can gather that most accidents involve 0 or 1 persons not in moving vehicles. Additionally, as number of people not in moving vehicles increases, average fatalities also generally increases.

In [None]:
%%bigquery --project $project_id

SELECT number_of_persons_not_in_motor_vehicles_in_transport_mvit, AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_persons_not_in_motor_vehicles_in_transport_mvit
ORDER BY number_of_persons_not_in_motor_vehicles_in_transport_mvit

We also wanted to see how accidents were stratified by number of fatalities. Below we noticed that like number of vehicles involved (which makes sense), number of fatalities in an accident is mostly 1. As such we see that most fatal accidents in the US involve just one driver and no other participants.

In [None]:
%%bigquery --project $project_id

SELECT number_of_fatalities, COUNT(number_of_fatalities) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_fatalities
ORDER BY number_of_fatalities

**Determination of Label**

When determining the label, we were disappointed to find that the data set only includes accidents with fatalities. We could do so much more with a dataset that included accidents with and without fatalities. However this makes sense as minor accidents are typically not reported, nor do they require the detailed forensics and investigations. Therefore datasets that provide numbers on minor accidents and other non-fatal accidents might be innaccurate or provide estimates at best. We found that in the vast majority of the dataset, entries only had one fatality, thus we realized that determining severity purely based on number of fatalities might not be realistic for a classification model.

Just so, we decided to artificially make a label variable that estimates the severity of a crash and creates a nice distribution between low severity and high severity. We let the severity = number of fatalities * ratio of people to vehicles. We experimented with different ways to calculate severity to get different distributions of data but ultimately landed on this equation. We felt that the more people to cars there are, the higher the potential for fatalitites. Additionally, multiplying this ratio by the number of fatalities emphasizes the importance of number of deaths to the severity of the crash because the more deaths there are the more severe we feel it is. We set the label to be anything where the severity is greater than 1 as the severity is 1 when the number of fatalities is 1 and there are the same number of cars to people.

At first we wanted to create a logistic regression model predicting the number of fatalities. However we ran into issues with the model being highly accurate but with low precision. We theorized that this high accuracy and low precision is caused by the vast number of accidents with only one death. The model learned to mostly predict 1 fatality and as a result had high accuracy but low precision on predicting any more fatalities. As a result we pivoted to the label.

In [None]:
%%bigquery --project $project_id

SELECT IF(number_of_fatalities * ((number_of_persons_not_in_motor_vehicles_in_transport_mvit + number_of_persons_in_motor_vehicles_in_transport_mvit) / (number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles)) > 1, 1, 0) AS label, COUNT(*) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY label
ORDER BY label

----
## Exploration of our Dataset
----

### Aggregate the Data we want into one Table (320.87 MB)

MASTER TABLE:
- We unioned all the accident tables from year to year (2015-2020) to make a master accidents table. As a result we got ~200,000 fatal accidents for a total of ~300mb of data
- We made number of fatalities the label as mentioned above

ENGINEERED FEATURES:
- ID - we noticed that the consecutive numbers (the unique ID code for a specific accident within a given year) are reused from year to year. To overcome this fact we appended the year number to the consecutive number to ensure that all accidents are unique when we join by year.

- Number_of_vehicles - We want to explore the total number of vehicles involved in a crash and its effects
- Number_of_people - Likewise, we want to explore the total number of people involved in a crash and its effects

- Time_to_scene - we noticed that there was an Emergency Arrival at Scene Column, along with time of accident. And therefore to find the Emergency response time we found the difference between the two, accounting for different unit of measurement types (see time_to_scene below)
- Time_to_hospital - similarly, but we ended up finding this was less helpful as we predicted that in fatal accidents there might not be a lot of urgency especially if injured people in an accident are already dead.

- Label - The label is constructed by computing the product of the number of fatalities to the ratio of number of total people to number of total cars

FEATURES:
- state_number - the id number for a US state
- state_name - the name of the state
- number_of_motor_vehicles_in_transport_mvit - details the number of moving vehicles involved in a fatal accident
- number_of_parked_working_vehicles - details the number of parked working vehicles that were hit in a fatal accident
- number_of_persons_in_motor_vehicles_in_transport_mvit - the number of people in moving vehicles that were in the accident
- number_of_persons_not_in_motor_vehicles_in_transport_mvit - the number of people involved in accidents that aren't in motor vehicles
- month_of_crash - numerical value for month 1-12
- month_of_crash_name - name of month that accident occured
- day_of_week - number for day of week 1-7
- day_of_week_name - is the name of the day of the week e.g. monday etc. that crash occured
- hour_of_crash - is the hour in which an accident occured e.g. 23 -> 11pm 
- hour_of_crash_name - the name of the hour
- land_use = codes for landuse e.g. 1 = "urban", 2 = "rural"
- land_use_name - "urban", "rural" etc.
- atmospheric_conditions - integer code for atmospheric condition
- atmospheric_conditions_1 - integer code for primary atmospheric condition only included in 2020 data 
- atmospheric_conditions_name - details weather conditions the time of crash e.g. "cloudy", "hail", "snow", "rain" etc.
- atmospheric_conditions_1_name - details primary weather conditions the time of crash e.g. "cloudy", "hail", "snow", "rain" etc. only included in 2020 data 
- number_of_drunk_drivers - the number of drunk drivers involved in a fatal accident
- number_of_fatalities - the number of fatalities within a given fatal accident

Other Tables:
- `bigquery-public-data.geo_us_boundaries.states` - gives us information on states including state abbreviation. Here we join to accidents on state_name to find the state abbreviation then use the state abbreviation to populate the Chloropleth map.
- `bigquery-public-data.census_bureau_acs.state_2020_5yr` - gives us population for each state so we can calculate per capita metrics

In [None]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE traffic_fatalities.traffic_features AS
SELECT
  CONCAT(accident2015.consecutive_number, accident2015.year_of_crash) AS id,    
  accident2015.state_number,
  accident2015.state_name,
  accident2015.number_of_motor_vehicles_in_transport_mvit,
  accident2015.number_of_parked_working_vehicles,
  (accident2015.number_of_motor_vehicles_in_transport_mvit + accident2015.number_of_parked_working_vehicles) AS num_vehicles,
  accident2015.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2015.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  (accident2015.number_of_persons_in_motor_vehicles_in_transport_mvit + accident2015.number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS num_people,
  accident2015.month_of_crash,
  accident2015.month_of_crash_name,
  accident2015.day_of_week,
  accident2015.day_of_week_name,
  accident2015.hour_of_crash,
  accident2015.hour_of_crash_name,
  accident2015.land_use,
  accident2015.land_use_name,
  accident2015.atmospheric_conditions,
  accident2015.atmospheric_conditions_name,
  IF(accident2015.hour_of_notification > 23 OR accident2015.minute_of_notification > 59 OR accident2015.hour_of_arrival_at_scene > 23 OR accident2015.minute_of_arrival_at_scene > 59, 9999, (IF(accident2015.hour_of_notification <= accident2015.hour_of_arrival_at_scene, accident2015.hour_of_arrival_at_scene - accident2015.hour_of_notification, accident2015.hour_of_arrival_at_scene - accident2015.hour_of_notification + 24) * 60 + IF(accident2015.minute_of_notification <= accident2015.minute_of_arrival_at_scene, accident2015.minute_of_arrival_at_scene - accident2015.minute_of_notification, accident2015.minute_of_arrival_at_scene - accident2015.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2015.hour_of_notification > 23 OR accident2015.minute_of_notification > 59 OR accident2015.hour_of_ems_arrival_at_hospital > 23 OR accident2015.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2015.hour_of_notification <= accident2015.hour_of_ems_arrival_at_hospital, accident2015.hour_of_ems_arrival_at_hospital - accident2015.hour_of_notification, accident2015.hour_of_ems_arrival_at_hospital - accident2015.hour_of_notification + 24) * 60 + IF(accident2015.minute_of_notification <= accident2015.minute_of_ems_arrival_at_hospital, accident2015.minute_of_ems_arrival_at_hospital - accident2015.minute_of_notification, accident2015.minute_of_ems_arrival_at_hospital - accident2015.minute_of_notification + 60))) AS time_to_hospital,
  accident2015.number_of_drunk_drivers,
  accident2015.number_of_fatalities,
  IF(accident2015.number_of_fatalities * ((accident2015.number_of_persons_not_in_motor_vehicles_in_transport_mvit + accident2015.number_of_persons_in_motor_vehicles_in_transport_mvit) / (accident2015.number_of_motor_vehicles_in_transport_mvit + accident2015.number_of_parked_working_vehicles)) > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2015` AS accident2015
UNION ALL
-- (Your next query or continuation of the script)
SELECT
  CONCAT(accident2016.consecutive_number, accident2016.year_of_crash) AS id,    
  accident2016.state_number,
  accident2016.state_name,
  accident2016.number_of_motor_vehicles_in_transport_mvit,
  accident2016.number_of_parked_working_vehicles,
  (accident2016.number_of_motor_vehicles_in_transport_mvit + accident2016.number_of_parked_working_vehicles) AS num_vehicles,
  accident2016.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2016.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  (accident2016.number_of_persons_in_motor_vehicles_in_transport_mvit + accident2016.number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS num_people,
  accident2016.month_of_crash,
  accident2016.month_of_crash_name,
  accident2016.day_of_week,
  accident2016.day_of_week_name,
  accident2016.hour_of_crash,
  accident2016.hour_of_crash_name,
  accident2016.land_use,
  accident2016.land_use_name,
  accident2016.atmospheric_conditions,
  accident2016.atmospheric_conditions_name,
  IF(accident2016.hour_of_notification > 23 OR accident2016.minute_of_notification > 59 OR accident2016.hour_of_arrival_at_scene > 23 OR accident2016.minute_of_arrival_at_scene > 59, 9999, (IF(accident2016.hour_of_notification <= accident2016.hour_of_arrival_at_scene, accident2016.hour_of_arrival_at_scene - accident2016.hour_of_notification, accident2016.hour_of_arrival_at_scene - accident2016.hour_of_notification + 24) * 60 + IF(accident2016.minute_of_notification <= accident2016.minute_of_arrival_at_scene, accident2016.minute_of_arrival_at_scene - accident2016.minute_of_notification, accident2016.minute_of_arrival_at_scene - accident2016.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2016.hour_of_notification > 23 OR accident2016.minute_of_notification > 59 OR accident2016.hour_of_ems_arrival_at_hospital > 23 OR accident2016.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2016.hour_of_notification <= accident2016.hour_of_ems_arrival_at_hospital, accident2016.hour_of_ems_arrival_at_hospital - accident2016.hour_of_notification, accident2016.hour_of_ems_arrival_at_hospital - accident2016.hour_of_notification + 24) * 60 + IF(accident2016.minute_of_notification <= accident2016.minute_of_ems_arrival_at_hospital, accident2016.minute_of_ems_arrival_at_hospital - accident2016.minute_of_notification, accident2016.minute_of_ems_arrival_at_hospital - accident2016.minute_of_notification + 60))) AS time_to_hospital,
  accident2016.number_of_drunk_drivers,
  accident2016.number_of_fatalities,
  IF(accident2016.number_of_fatalities * ((accident2016.number_of_persons_not_in_motor_vehicles_in_transport_mvit + accident2016.number_of_persons_in_motor_vehicles_in_transport_mvit) / (accident2016.number_of_motor_vehicles_in_transport_mvit + accident2016.number_of_parked_working_vehicles)) > 1, 1, 0) AS label
  FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016` AS accident2016
UNION ALL
SELECT
  CONCAT(accident2017.consecutive_number, accident2017.year_of_crash) AS id,    
  accident2017.state_number,
  accident2017.state_name,
  accident2017.number_of_motor_vehicles_in_transport_mvit,
  accident2017.number_of_parked_working_vehicles,
  (accident2017.number_of_motor_vehicles_in_transport_mvit + accident2017.number_of_parked_working_vehicles) AS num_vehicles,
  accident2017.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2017.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  (accident2017.number_of_persons_in_motor_vehicles_in_transport_mvit + accident2017.number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS num_people,
  accident2017.month_of_crash,
  accident2017.month_of_crash_name,
  accident2017.day_of_week,
  accident2017.day_of_week_name,
  accident2017.hour_of_crash,
  accident2017.hour_of_crash_name,
  accident2017.land_use,
  accident2017.land_use_name,
  accident2017.atmospheric_conditions,
  accident2017.atmospheric_conditions_name,
  IF(accident2017.hour_of_notification > 23 OR accident2017.minute_of_notification > 59 OR accident2017.hour_of_arrival_at_scene > 23 OR accident2017.minute_of_arrival_at_scene > 59, 9999, (IF(accident2017.hour_of_notification <= accident2017.hour_of_arrival_at_scene, accident2017.hour_of_arrival_at_scene - accident2017.hour_of_notification, accident2017.hour_of_arrival_at_scene - accident2017.hour_of_notification + 24) * 60 + IF(accident2017.minute_of_notification <= accident2017.minute_of_arrival_at_scene, accident2017.minute_of_arrival_at_scene - accident2017.minute_of_notification, accident2017.minute_of_arrival_at_scene - accident2017.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2017.hour_of_notification > 23 OR accident2017.minute_of_notification > 59 OR accident2017.hour_of_ems_arrival_at_hospital > 23 OR accident2017.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2017.hour_of_notification <= accident2017.hour_of_ems_arrival_at_hospital, accident2017.hour_of_ems_arrival_at_hospital - accident2017.hour_of_notification, accident2017.hour_of_ems_arrival_at_hospital - accident2017.hour_of_notification + 24) * 60 + IF(accident2017.minute_of_notification <= accident2017.minute_of_ems_arrival_at_hospital, accident2017.minute_of_ems_arrival_at_hospital - accident2017.minute_of_notification, accident2017.minute_of_ems_arrival_at_hospital - accident2017.minute_of_notification + 60))) AS time_to_hospital,
  accident2017.number_of_drunk_drivers,
  accident2017.number_of_fatalities,
  IF(accident2017.number_of_fatalities * ((accident2017.number_of_persons_not_in_motor_vehicles_in_transport_mvit + accident2017.number_of_persons_in_motor_vehicles_in_transport_mvit) / (accident2017.number_of_motor_vehicles_in_transport_mvit + accident2017.number_of_parked_working_vehicles)) > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2017` AS accident2017
UNION ALL
SELECT
  CONCAT(accident2018.consecutive_number, accident2018.year_of_crash) AS id,    
  accident2018.state_number,
  accident2018.state_name,
  accident2018.number_of_motor_vehicles_in_transport_mvit,
  accident2018.number_of_parked_working_vehicles,
  (accident2018.number_of_motor_vehicles_in_transport_mvit + accident2018.number_of_parked_working_vehicles) AS num_vehicles,
  accident2018.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2018.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  (accident2018.number_of_persons_in_motor_vehicles_in_transport_mvit + accident2018.number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS num_people,
  accident2018.month_of_crash,
  accident2018.month_of_crash_name,
  accident2018.day_of_week,
  accident2018.day_of_week_name,
  accident2018.hour_of_crash,
  accident2018.hour_of_crash_name,
  accident2018.land_use,
  accident2018.land_use_name,
  accident2018.atmospheric_conditions,
  accident2018.atmospheric_conditions_name,
  IF(accident2018.hour_of_notification > 23 OR accident2018.minute_of_notification > 59 OR accident2018.hour_of_arrival_at_scene > 23 OR accident2018.minute_of_arrival_at_scene > 59, 9999, (IF(accident2018.hour_of_notification <= accident2018.hour_of_arrival_at_scene, accident2018.hour_of_arrival_at_scene - accident2018.hour_of_notification, accident2018.hour_of_arrival_at_scene - accident2018.hour_of_notification + 24) * 60 + IF(accident2018.minute_of_notification <= accident2018.minute_of_arrival_at_scene, accident2018.minute_of_arrival_at_scene - accident2018.minute_of_notification, accident2018.minute_of_arrival_at_scene - accident2018.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2018.hour_of_notification > 23 OR accident2018.minute_of_notification > 59 OR accident2018.hour_of_ems_arrival_at_hospital > 23 OR accident2018.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2018.hour_of_notification <= accident2018.hour_of_ems_arrival_at_hospital, accident2018.hour_of_ems_arrival_at_hospital - accident2018.hour_of_notification, accident2018.hour_of_ems_arrival_at_hospital - accident2018.hour_of_notification + 24) * 60 + IF(accident2018.minute_of_notification <= accident2018.minute_of_ems_arrival_at_hospital, accident2018.minute_of_ems_arrival_at_hospital - accident2018.minute_of_notification, accident2018.minute_of_ems_arrival_at_hospital - accident2018.minute_of_notification + 60))) AS time_to_hospital,
  accident2018.number_of_drunk_drivers,
  accident2018.number_of_fatalities,
  IF(accident2018.number_of_fatalities * ((accident2018.number_of_persons_not_in_motor_vehicles_in_transport_mvit + accident2018.number_of_persons_in_motor_vehicles_in_transport_mvit) / (accident2018.number_of_motor_vehicles_in_transport_mvit + accident2018.number_of_parked_working_vehicles)) > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2018` AS accident2018
UNION ALL
SELECT
  CONCAT(accident2019.consecutive_number, accident2019.year_of_crash) AS id,    
  accident2019.state_number,
  accident2019.state_name,
  accident2019.number_of_motor_vehicles_in_transport_mvit,
  accident2019.number_of_parked_working_vehicles,
  (accident2019.number_of_motor_vehicles_in_transport_mvit + accident2019.number_of_parked_working_vehicles) AS num_vehicles,
  accident2019.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2019.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  (accident2019.number_of_persons_in_motor_vehicles_in_transport_mvit + accident2019.number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS num_people,
  accident2019.month_of_crash,
  accident2019.month_of_crash_name,
  accident2019.day_of_week,
  accident2019.day_of_week_name,
  accident2019.hour_of_crash,
  accident2019.hour_of_crash_name,
  accident2019.land_use,
  accident2019.land_use_name,
  accident2019.atmospheric_conditions,
  accident2019.atmospheric_conditions_name,
  IF(accident2019.hour_of_notification > 23 OR accident2019.minute_of_notification > 59 OR accident2019.hour_of_arrival_at_scene > 23 OR accident2019.minute_of_arrival_at_scene > 59, 9999, (IF(accident2019.hour_of_notification <= accident2019.hour_of_arrival_at_scene, accident2019.hour_of_arrival_at_scene - accident2019.hour_of_notification, accident2019.hour_of_arrival_at_scene - accident2019.hour_of_notification + 24) * 60 + IF(accident2019.minute_of_notification <= accident2019.minute_of_arrival_at_scene, accident2019.minute_of_arrival_at_scene - accident2019.minute_of_notification, accident2019.minute_of_arrival_at_scene - accident2019.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2019.hour_of_notification > 23 OR accident2019.minute_of_notification > 59 OR accident2019.hour_of_ems_arrival_at_hospital > 23 OR accident2019.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2019.hour_of_notification <= accident2019.hour_of_ems_arrival_at_hospital, accident2019.hour_of_ems_arrival_at_hospital - accident2019.hour_of_notification, accident2019.hour_of_ems_arrival_at_hospital - accident2019.hour_of_notification + 24) * 60 + IF(accident2019.minute_of_notification <= accident2019.minute_of_ems_arrival_at_hospital, accident2019.minute_of_ems_arrival_at_hospital - accident2019.minute_of_notification, accident2019.minute_of_ems_arrival_at_hospital - accident2019.minute_of_notification + 60))) AS time_to_hospital,
  accident2019.number_of_drunk_drivers,
  accident2019.number_of_fatalities,
  IF(accident2019.number_of_fatalities * ((accident2019.number_of_persons_not_in_motor_vehicles_in_transport_mvit + accident2019.number_of_persons_in_motor_vehicles_in_transport_mvit) / (accident2019.number_of_motor_vehicles_in_transport_mvit + accident2019.number_of_parked_working_vehicles)) > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2019` AS accident2019
UNION ALL
SELECT
  CONCAT(accident2020.consecutive_number, accident2020.year_of_crash) AS id,    
  accident2020.state_number,
  accident2020.state_name,
  accident2020.number_of_motor_vehicles_in_transport_mvit,
  accident2020.number_of_parked_working_vehicles,
  (accident2020.number_of_motor_vehicles_in_transport_mvit + accident2020.number_of_parked_working_vehicles) AS num_vehicles,
  accident2020.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2020.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  (accident2020.number_of_persons_in_motor_vehicles_in_transport_mvit + accident2020.number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS num_people,
  accident2020.month_of_crash,
  accident2020.month_of_crash_name,
  accident2020.day_of_week,
  accident2020.day_of_week_name,
  accident2020.hour_of_crash,
  accident2020.hour_of_crash_name,
  accident2020.land_use,
  accident2020.land_use_name,
  accident2020.atmospheric_conditions_1 AS atmospheric_conditions,
  accident2020.atmospheric_conditions_1_name AS atmospheric_conditions_name,
  IF(accident2020.hour_of_notification > 23 OR accident2020.minute_of_notification > 59 OR accident2020.hour_of_arrival_at_scene > 23 OR accident2020.minute_of_arrival_at_scene > 59, 9999, (IF(accident2020.hour_of_notification <= accident2020.hour_of_arrival_at_scene, accident2020.hour_of_arrival_at_scene - accident2020.hour_of_notification, accident2020.hour_of_arrival_at_scene - accident2020.hour_of_notification + 24) * 60 + IF(accident2020.minute_of_notification <= accident2020.minute_of_arrival_at_scene, accident2020.minute_of_arrival_at_scene - accident2020.minute_of_notification, accident2020.minute_of_arrival_at_scene - accident2020.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2020.hour_of_notification > 23 OR accident2020.minute_of_notification > 59 OR accident2020.hour_of_ems_arrival_at_hospital > 23 OR accident2020.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2020.hour_of_notification <= accident2020.hour_of_ems_arrival_at_hospital, accident2020.hour_of_ems_arrival_at_hospital - accident2020.hour_of_notification, accident2020.hour_of_ems_arrival_at_hospital - accident2020.hour_of_notification + 24) * 60 + IF(accident2020.minute_of_notification <= accident2020.minute_of_ems_arrival_at_hospital, accident2020.minute_of_ems_arrival_at_hospital - accident2020.minute_of_notification, accident2020.minute_of_ems_arrival_at_hospital - accident2020.minute_of_notification + 60))) AS time_to_hospital,
  accident2020.number_of_drunk_drivers,
  accident2020.number_of_fatalities,
  IF(accident2020.number_of_fatalities * ((accident2020.number_of_persons_not_in_motor_vehicles_in_transport_mvit + accident2020.number_of_persons_in_motor_vehicles_in_transport_mvit) / (accident2020.number_of_motor_vehicles_in_transport_mvit + accident2020.number_of_parked_working_vehicles)) > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2020` AS accident2020

### Total Number of Accidents

Here we see that there are 203,465 fatal crashes in the US from 2015-2020 equating to roughly 34,000 accidents per year.

In [None]:
%%bigquery --project $project_id
SELECT COUNT(*)
FROM `traffic_fatalities.traffic_features`

### Total Number of Fatalitites

There have been 221,137 fatalities in the past 6 years as a result of traffic accidents which is almost 37,000 deaths per year due to car accidents. This is quite a saddening statistic 

In [None]:
%%bigquery --project $project_id
SELECT SUM(number_of_fatalities)
FROM `traffic_fatalities.traffic_features`

Since this dataset contains traffic fatalities, each accident must have at least one death. So, we can use these two values to discover that the national fatalities per fatal accident is 1.09 (2015-2020). Therefore, we conclude that the majority of fatal accidents only involve one death. Nevertheless we would like to investigate further into differences between states for these base statistics.

### Preliminary State Investigations

As we investigate state differences, it is very important to account for the difference in populations. This is because we are almost certain that the number of fatal accidents has a strong correlation with population density. Here we use census data to pull population figures.

In [None]:
%%bigquery --project $project_id

SELECT t.state_name, s.total_pop, b.state
FROM `traffic_fatalities.traffic_features` AS t
JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS s ON t.state_number = CAST(s.geo_id AS INT64)
JOIN `bigquery-public-data.geo_us_boundaries.states` AS b ON t.state_name = b.state_name
GROUP BY state_name, total_pop, state
ORDER BY state_name

In [None]:
query_pop = """SELECT t.state_name, s.total_pop, b.state
               FROM `traffic_fatalities.traffic_features` AS t
               JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS s ON t.state_number = CAST(s.geo_id AS INT64)
               JOIN `bigquery-public-data.geo_us_boundaries.states` AS b ON t.state_name = b.state_name
               GROUP BY state_name, total_pop, state
               ORDER BY state_name"""
pop = pd.read_gbq(query_pop, project_id=project_id, dialect='standard')

FATALITIES PER STATE

As we see below, states have more fatal crashes if they have a larger population. Thus it is no surprise that the most populous states in the US California, Texas and Florida have the highest number of fatal accidents. We see this correlation almost perfectly align as the fatalities by states here are ranked similarly to their state rank in terms of population size。

In [None]:
%%bigquery --project $project_id

WITH state_fatalities AS (
    SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, state_number
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, state_number
    ) 
SELECT f.num_fatalities AS total_fatalities, f.state_name, s.state 
FROM state_fatalities AS f 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
ORDER BY state_name

Lets's Visualize this Data

In [None]:
query_total_fatalities_per_state = """ WITH state_fatalities AS (
                                            SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, state_number
                                            FROM `traffic_fatalities.traffic_features` 
                                            GROUP BY state_name, state_number
                                        ) 
                                        SELECT f.num_fatalities AS total_fatalities, f.state_name, s.state 
                                        FROM state_fatalities AS f 
                                        JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
                                        ORDER BY state_name"""
total_fatalities_per_state = pd.read_gbq(query_total_fatalities_per_state, project_id=project_id, dialect='standard')
fig_total_fatalities_per_state = px.choropleth(
    total_fatalities_per_state,
    locations='state',
    locationmode='USA-states',
    color='total_fatalities',
    color_continuous_scale = 'Reds',
)

fig_total_fatalities_per_state.update_layout(
    title_text = 'Number of Fatalities 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

FATALITIES PER CAPITA

Due to the relationship between the number of fatalities and state population, we must normalize for the differences in population and find which states have the highest amount of fatal accidents per capita. We will get the state population over a 5 year period from the dataset `bigquery-public-data.census_bureau_acs.state_2020_5yr` by joining the state_number with the geo_id in the census data.

In [None]:
%%bigquery --project $project_id

WITH state_fatalities AS (
    SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, state_number
    FROM `traffic_fatalities.traffic_features`
    GROUP BY state_name, state_number
)
SELECT f.num_fatalities / p.total_pop / 6 * 100 AS yearly_fatalities_per_capita, f.state_name, s.state
FROM state_fatalities AS f
JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS p ON f.state_number = CAST(p.geo_id AS INT64)
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
ORDER BY state_name

We see that the yearly fatalities per capita vary vastly across states, ranging from 0.004060 (District of Columbia) to 0.022944 (Mississippi). Moreover the below heat map tells a very different story compared to when the data wasn't controlled for differences in population. This indicates significant differences in safety or risk factors among states. Initially, it appears there might be regional patterns evident in the data. For example, several states in the South (e.g., Mississippi, South Carolina) and Northwest (e.g., Wyoming, Montana) appear to have higher fatality rates compared to states in the Northeast or West. What factors influence these states to have a significantly larger fatalities per capita? To understand the reasons behind these variations, further analysis is needed. But first, let's visualize the data.

In [None]:
query_percent_fatalities_per_state = """WITH state_fatalities AS (
                                            SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, state_number
                                            FROM `traffic_fatalities.traffic_features`
                                            GROUP BY state_name, state_number
                                        )
                                        SELECT f.num_fatalities / p.total_pop / 6 * 100 AS yearly_fatalities_per_capita, f.state_name, s.state
                                        FROM state_fatalities AS f
                                        JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS p ON f.state_number = CAST(p.geo_id AS INT64)
                                        JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
                                        ORDER BY state_name"""
percent_fatalities_per_state = pd.read_gbq(query_percent_fatalities_per_state, project_id=project_id, dialect='standard')
fig_percent_fatalities_per_state = px.choropleth(
    percent_fatalities_per_state,
    locations='state',
    locationmode='USA-states',
    color='yearly_fatalities_per_capita',
    color_continuous_scale = 'Blues',
)

fig_percent_fatalities_per_state.update_layout(
    title_text = 'Yearly Fatalities Per Capita 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

FATALITIES PER ACCIDENT 

Alternatively, we can normalize the number of fatalities by the number of accidents that occur in the state.



In [None]:
%%bigquery --project $project_id

WITH state_avg_fatalities AS (
    SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, state_number 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, state_number
    ) 
SELECT f.avg_fatalities, f.state_name, s.state 
FROM state_avg_fatalities AS f 
JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS p ON f.state_number = CAST(p.geo_id AS INT64) 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
ORDER BY state_name

Interestingly we see that the data is somewhat similar to the prior relationships between state and fatalities per capita. However, this time we learn that the overall variation is relatively narrow, ranging from 1.042683 (District of Columbia) to 1.146084 (Wyoming) suggesting that, on average, states have somewhat similar fatality rates. District of Columbia yet again has the lowest average fatality rate, while Wyoming has the highest. This is consistent with the previous analysis where District of Columbia had the lowest individual yearly fatality rate, and Wyoming had one of the highest. Additionally, geographic trends maintain consistent from fatalities per capita as Southern and Northwestern states have higher average fatalities.

In [None]:
query_avg_fatalities_per_state = """WITH state_avg_fatalities AS (
                                        SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, state_number 
                                        FROM `traffic_fatalities.traffic_features` 
                                        GROUP BY state_name, state_number
                                    ) 
                                    SELECT f.avg_fatalities, f.state_name, s.state 
                                    FROM state_avg_fatalities AS f 
                                    JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS p ON f.state_number = CAST(p.geo_id AS INT64) 
                                    JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
                                    ORDER BY state_name"""
avg_fatalities_per_state = pd.read_gbq(query_avg_fatalities_per_state, project_id=project_id, dialect='standard')
fig_avg_fatalities_per_state = px.choropleth(
    avg_fatalities_per_state,
    locations='state',
    locationmode='USA-states',
    color='avg_fatalities',
    color_continuous_scale = 'Greens',
)

fig_avg_fatalities_per_state.update_layout(
    title_text = 'Fatalities Per Accident 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Plotting the average fatalities per state against the fatalities per capita shows us a strong positive correlation between the two. This means that as the average fatalities per state increase, the fatalities per capita also tend to increase, and vice versa. In other words, there is a tendency for states with higher average fatality rates to also have higher individual fatality rates per capita. This phenomenom makes sense given the data as the number of accidents per capita correlated strongly to the population of states. Therefore, we can abstract that there may be underlying common factors within states that increase the rate of fatalities within them.

In [None]:
plt.scatter(percent_fatalities_per_state["yearly_fatalities_per_capita"], avg_fatalities_per_state["avg_fatalities"])
plt.xlabel("Fatalities Per Capita")
plt.ylabel("Fatalities Per Accident")
plt.title("Fatalities Per Accident vs. Fatalities Per Capita")

Now that we have identified the relationship between states and fatalities per capita in addition to the relationship between states and fatalities per fatal accident, we can start investigating how certain factors influence these relationships.

### Vehicles

In this section we want to investigate how the number of moving and parked vehicles within a fatal crash determines the number of deaths. We hypothesize that increased number of  vehicles involved is directly correlated with the number of deaths within a fatal accident because more vehicles implies more people which increases the number of people that could die.

First, let's evaluate the average number of moving vehicles in fatal accidents

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_motor_vehicles_in_transport_mvit) AS avg_moving_vehicles
FROM `traffic_fatalities.traffic_features`

Interestingly we see that the number of moving vehicles involved in a fatal accident is more than one on average but still below two. This means that many traffic fatalities involve only one moving vehicle.

Next, let's evaluate the average number of parked vehicles.

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_parked_working_vehicles) AS avg_parked_vehicles
FROM `traffic_fatalities.traffic_features`

Average parked vehicles per accident is very small. This means that the majority of accidents do not involve parked vehicles. Potential justifications for this could include that many fatal accidents occur on roads where there are no parked vehicles such as highways and motorways, or collisions with parked vehicles do not commonly result in fatal accidents.

Combining the two we discover the average number of vehicles involved in fatal accidents.

In [None]:
%%bigquery --project $project_id
SELECT AVG(num_vehicles) AS avg_vehicles
FROM `traffic_fatalities.traffic_features`

Seeing the average number of vehicles be below 2 goes against our hypothesis as we hypothesised that multicar accidents yield more fatal accidents. When, in fact, the average vehicles tells us that many fatal accidents only have one vehicle involved.

Moving on to multicar crashes,

In [None]:
%%bigquery --project $project_id
WITH multicar AS (
    SELECT IF(num_vehicles > 1, 1, 0) AS multicar, id
    FROM `traffic_fatalities.traffic_features`
)
SELECT COUNT(*)/(SELECT COUNT(*) FROM `traffic_fatalities.traffic_features`) AS percent, COUNT(*) AS count, AVG(f.number_of_fatalities) AS avg_fatalities, 
FROM `traffic_fatalities.traffic_features` AS f
JOIN multicar AS m ON m.id = f.id
GROUP BY multicar

We find that less than half of fatal accidents are multicar which contradicts our hypothesis that more multicar accidents results in more fatal accidents. The higher perentage of non-multicar accidents justifies our inference from the average vehicles involved in fatal accidents that the majority of fatal accidents only involve one car. However, of the fatal accidents, the ones that were multicar accidents had higher average fatalities which supports our hypothesis.

Let's extrapolate this trend across states:

In [None]:
%%bigquery --project $project_id
WITH multicar AS (
    SELECT IF(num_vehicles > 1, 1, 0) AS multicar, id
    FROM `traffic_fatalities.traffic_features`
), state_accidents AS (
    SELECT state_name, COUNT(*) AS total_accidents
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)
SELECT m.multicar, COUNT(*)/t.total_accidents AS percent, COUNT(*) AS count, AVG(f.number_of_fatalities) AS avg_fatalities, f.state_name, s.state
FROM `traffic_fatalities.traffic_features` AS f
JOIN multicar AS m ON m.id = f.id
JOIN state_accidents AS t ON t.state_name = f.state_name
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
GROUP BY multicar, state_name, state, total_accidents
ORDER BY state_name

We see that the trend holds true across states where multicar accidents tend to have higher average fatalities. This strongly supports our hypothesis that multicar accidents yield a higher fatality rate. Graphing this data:

In [None]:
query_multicar_per_state = """  WITH multicar AS (
                                    SELECT IF(num_vehicles > 1, 1, 0) AS multicar, id
                                    FROM `traffic_fatalities.traffic_features`
                                ), state_accidents AS (
                                    SELECT state_name, COUNT(*) AS total_accidents
                                    FROM `traffic_fatalities.traffic_features` 
                                    GROUP BY state_name
                                )
                                SELECT m.multicar, COUNT(*)/t.total_accidents AS percent, COUNT(*) AS count, AVG(f.number_of_fatalities) AS avg_fatalities, f.state_name, s.state
                                FROM `traffic_fatalities.traffic_features` AS f
                                JOIN multicar AS m ON m.id = f.id
                                JOIN state_accidents AS t ON t.state_name = f.state_name
                                JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
                                WHERE multicar = 1
                                GROUP BY multicar, state_name, state, total_accidents"""
multicar_per_state = pd.read_gbq(query_multicar_per_state, project_id=project_id, dialect='standard')
fig_multicar_per_state = px.choropleth(
    multicar_per_state,
    locations='state',
    locationmode='USA-states',
    color='percent',
    color_continuous_scale = 'Reds',
)

fig_multicar_per_state.update_layout(
    title_text = 'Percent Multicar Accidents 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

It's surprising to see the percentage of multicar accidents vary so much from state to state. We observe that the states with less percentage of multicar accidents tend to be more rural states (e.g. Montana, Wyoming, Main) where highway and interstate usage may not be as prevalent. On the other hand, states like Nebraska (NE), Florida (FL), and Texas (TX) have relatively high percentages of multicar accidents. Texas, Florida, and Nebraska all have expansive highway systems. The percent multicar accidents by state extremely vary with the lowest percentage state, Montana, having just 29% of its accidents be multicar wheras the highest percentage state, Nebraska, having 54%.

In [None]:
plt.scatter(multicar_per_state["avg_fatalities"], multicar_per_state["percent"])
plt.ylabel("Percent Multicar")
plt.xlabel("Average Fatalities")
plt.title("Percent Multicar vs. Fatalities Per Accident")

In [None]:
plt.scatter(percent_fatalities_per_state["yearly_fatalities_per_capita"], multicar_per_state["percent"])
plt.ylabel("Percent Multicar")
plt.xlabel("Fatalities Per Capita")
plt.title("Percent Multicar vs. Fatalities Per Capita")

Graphing the average fatalities against the percent mutlicar accidents yields a graph that shows a weak, positive correlation between state average fatalities and percent multicar as well as yearly fatalities per capita against percent multicar. In other words, the graphical representations of both state average fatalities and yearly fatalities per capita against the percentage of multicar accidents suggests that, on average, states experiencing a higher incidence of multicar accidents also tend to have slightly elevated average fatalities and yearly fatalities per capita. While the positive correlation is discernible, it is important to note that the relationship is not particularly strong, indicating that other factors likely contribute significantly to variations in fatality rates. Thus, percent multicar accidents within a state is only slightly effects fatality rates within states. Further in-depth analysis and consideration of additional factors are necessary to comprehensively grasp the underlying factors influencing fatalities on the road.

### People

In this section we want to investigate how the number of people in and out of vehicles within a fatal crash determines the number of fatalities. We hypothesize that increased number of people involved is directly correlated with the number of deaths within a fatal accident because more people implies more that could die.

First, Lets evalute the average person in moving vehicles across all traffic fatalities:

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_persons_in_motor_vehicles_in_transport_mvit) AS avg_people_in_vehicles
FROM `traffic_fatalities.traffic_features`

Here we see that the average people in moving vehicles is 2.24 which indicates that there are more multiperson accidents than accidents with only one person in vehicles

Next, observe the number of persons not in moving vehicles involved in fatal crashes:

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_persons_not_in_motor_vehicles_in_transport_mvit) AS avg_people_not_in_vehicles
FROM `traffic_fatalities.traffic_features`

Combining these two terms yields the average number of people per fatal accident:

In [None]:
%%bigquery --project $project_id
SELECT AVG(num_people) AS avg_people_per_crash
FROM `traffic_fatalities.traffic_features`

Seeing the average number of persons per crash be above 2 supports our hypothesis as we hypothesised that multiperson accidents yield more fatal accidents.

Next, Lets categorize each accident as either multiperson or not multiperson to understand how multiperson accidents effects fatality rate

In [None]:
%%bigquery --project $project_id
WITH multiperson AS (
    SELECT IF(num_people > 1, 1, 0) AS multiperson, id
    FROM `traffic_fatalities.traffic_features`
)
SELECT COUNT(*)/(SELECT COUNT(*) FROM `traffic_fatalities.traffic_features`) AS percent, COUNT(*) AS count, AVG(f.number_of_fatalities) AS avg_fatalities
FROM `traffic_fatalities.traffic_features` AS f
JOIN multiperson AS m ON m.id = f.id
GROUP BY multiperson

We find that substanitally more than half of fatal accidents are multiperson which strongly supports our hypothesis that multiperson accidents results in more fatal accidents. The higher perentage of multiperson accidents justifies our inference from the average persons involved in fatal accidents that the majority of fatal accidents involve more than one person. Additionally, of the fatal accidents, the ones that were multiperson accidents had higher average fatalities which supports our hypothesis. Note that the non-mulitperson accidents have an average fatality of one because if an accident is non-multiperson then it involves only one person so the maximum number of fatalities possible is one.

Now, extrapolating this trend across states we find:

In [None]:
%%bigquery --project $project_id
WITH multiperson AS (
    SELECT IF(num_people > 1, 1, 0) AS multiperson, id
    FROM `traffic_fatalities.traffic_features`
), state_accidents AS (
    SELECT state_name, COUNT(*) AS total_accidents
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)
SELECT COUNT(*)/t.total_accidents AS percent, COUNT(*) AS count, m.multiperson, AVG(f.number_of_fatalities) AS avg_fatalities, f.state_name, s.state
FROM `traffic_fatalities.traffic_features` AS f
JOIN multiperson AS m ON m.id = f.id
JOIN state_accidents AS t ON t.state_name = f.state_name
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
GROUP BY multiperson, state_name, state, total_accidents
ORDER BY state_name

Since non-multiperson accidents cannot have an average fatality above 1, multiperson accidents will always have a higher average fatality. Thus our hypothesis is upheld that increased number of people involved is directly correlated with the number of deaths

Graphing the multiperson rate across states yields:

In [None]:
query_multiperson_per_state = """   WITH multiperson AS (
                                        SELECT IF(num_people > 1, 1, 0) AS multiperson, id
                                        FROM `traffic_fatalities.traffic_features`
                                    ), state_accidents AS (
                                        SELECT state_name, COUNT(*) AS total_accidents
                                        FROM `traffic_fatalities.traffic_features` 
                                        GROUP BY state_name
                                    )
                                    SELECT COUNT(*)/t.total_accidents AS percent, COUNT(*) AS count, m.multiperson, AVG(f.number_of_fatalities) AS avg_fatalities, f.state_name, s.state
                                    FROM `traffic_fatalities.traffic_features` AS f
                                    JOIN multiperson AS m ON m.id = f.id
                                    JOIN state_accidents AS t ON t.state_name = f.state_name
                                    JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name
                                    WHERE multiperson = 1
                                    GROUP BY multiperson, state_name, state, total_accidents"""
multiperson_per_state = pd.read_gbq(query_multiperson_per_state, project_id=project_id, dialect='standard')
fig_multiperson_per_state = px.choropleth(
    multiperson_per_state,
    locations='state',
    locationmode='USA-states',
    color='percent',
    color_continuous_scale = 'Reds',
)

fig_multiperson_per_state.update_layout(
    title_text = 'Percent Multiperson Accidents 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

The percentage of multiperson accidents deviates greatly from state to state. The percent multiperson ranges from 59.80% (Maine) to 82.93% (District of Columbia). This indicates substantial disparities in the driving behaviors within the state. Interestingly, states with large population such as California (80%), Texas (76%), New York (77%), and Florida (81%) tend to have higher percent multiperson accidents. This would lead us to believe that percentage of multiperson accidents is strongly correlated with population size. However, as stated earlier, District of Columbia has the highest percentage of multiperson accidents and it has the smallest population in the United States. Another possible explanation could be that states with large metropolises have high rates of multiperson accidents. This could be caused by the concentration of people during a commute or inefficient traffic patterns.

In [None]:
plt.scatter(multiperson_per_state["avg_fatalities"], multiperson_per_state["percent"])
plt.ylabel("Percent Multiperson")
plt.xlabel("Average Fatalities")
plt.title("Percent Multiperson vs. Fatalities Per Accident")

In [None]:
plt.scatter(percent_fatalities_per_state["yearly_fatalities_per_capita"], multiperson_per_state["percent"])
plt.ylabel("Percent Multiperson")
plt.xlabel("Fatalities Per Capita")
plt.title("Percent Multiperson vs. Fatalities Per Capita")

Graphing the average state fatalities and yearly fatalities per capita by state against the percent multiperson accidents yields a graph that shows a weak, negative correlation between the variables. In other words, percent multiperson cannot be predicted by either state average fatalities or yearly fatalities per capita. As stated prior, we hypothesized that the percent multiperson accidents could be related to state population.

In [None]:
plt.scatter(pop["total_pop"], multiperson_per_state["percent"])
plt.ylabel("Percent Multiperson")
plt.xlabel("Population")
plt.title("Percent Multiperson vs. Population")

After viewing the graph of percent multiperson accidents against state population, we see that there is a slight, weak, positive correlation between the two variables. As population increases, percent multiperson accidents within a state will tend to slightly go up. While the positive correlation is there, it is important to note that the relationship is not particularly strong, indicating that other factors likely contribute significantly to variations in fatality rates. Additionally, due to the lack of correlation found between average state fatalities and percent multiperson as well as yearly fatalities per capita by state and the percent multiperson, the state fatality rates do not appear to be effected by percent multiperson accidents.

### Day of Week

Now we want to determine if the occurence of fatal accidents differ from day to day. Does the day of the week effect fatality rate? We suspect that weekends, when there are more people on the roads, will increase the fatality rate.

First, Let's view the distribution of fatalities by day of week across the United States

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 AS percent_fatalities, day_of_week_name
FROM `traffic_fatalities.traffic_features`
GROUP BY day_of_week_name
ORDER BY num_fatalities DESC

Interestingly Friday, Saturday, and Sunday have have a significantly higher number of fatal crashes as opposed to the rest of the week. However, the increased number of fatalities on the weekend may also be due to the increased number of drivers on the weekend.

Graphing the distribution we find:

In [None]:
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

In [None]:
query_days_num_fatalities = """ SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 AS percent_fatalities, day_of_week_name
                                    FROM `traffic_fatalities.traffic_features`
                                    GROUP BY day_of_week_name
                                    ORDER BY 
                                        CASE 
                                            WHEN day_of_week_name = 'Monday' THEN 0 
                                            WHEN day_of_week_name = 'Tuesday' THEN 1 
                                            WHEN day_of_week_name = 'Wednesday' THEN 3 
                                            WHEN day_of_week_name = 'Thursday' THEN 4 
                                            WHEN day_of_week_name = 'Friday' THEN 5 
                                            WHEN day_of_week_name = 'Saturday' THEN 6 
                                            WHEN day_of_week_name = 'Sunday' THEN 7 
                                        END"""

days_num_fatalities = pd.read_gbq(query_days_num_fatalities, project_id=project_id, dialect='standard')

plt.bar(days, days_num_fatalities["num_fatalities"])
plt.title("Number of Fatalities by Day of Week")
plt.xlabel("Day of Week")
plt.ylabel("Frequency")

By determining what day of the week has the highest proportion of fatalities for each state, we can determine if the day of the week effects state fatality rate or if there are regional variances.

In [None]:
%%bigquery --project $project_id

WITH state_weekdays_fatalities AS (
    SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, day_of_week_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(number_of_fatalities) DESC) AS rank
    FROM `traffic_fatalities.traffic_features`
    GROUP BY state_name, day_of_week_name
), state_fatalities AS (
    SELECT state_name, SUM(number_of_fatalities) AS total_fatalities
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)
SELECT w.num_fatalities, t.total_fatalities AS total_fatalities, w.num_fatalities/t.total_fatalities * 100 AS proportion_fatalities, w.day_of_week_name, w.state_name, s.state
FROM state_weekdays_fatalities AS w
JOIN state_fatalities AS t ON w.state_name = t.state_name
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name
WHERE rank = 1
ORDER BY state_name

As shown above, most states have Saturday as the day of the week with the highest proportion of fatal accidents. This corresponds with the distribution across the United States that we evaluated before. Additionally, the proportion of fatalities that the top hour within each state takes up is similar to the distribution of fatalities noted nationwide. The lack of regional variance shows us that the day of the week with the most fatalities does not vary by geolocation.

Graphing this we see:

In [None]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Create a color set using the 'Set1' color scale
days_of_week_color_set = px.colors.qualitative.Pastel1[:len(days_of_week)]

# Create a dictionary mapping days of the week to colors
days_of_week_color_dict = dict(zip(days_of_week, days_of_week_color_set))


In [None]:
query_weekday_num_fatalities_state = """WITH state_weekdays_fatalities AS (
                                                SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, day_of_week_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(number_of_fatalities) DESC) AS rank
                                                FROM `traffic_fatalities.traffic_features`
                                                GROUP BY state_name, day_of_week_name
                                            ), state_fatalities AS (
                                                SELECT state_name, SUM(number_of_fatalities) AS total_fatalities
                                                FROM `traffic_fatalities.traffic_features` 
                                                GROUP BY state_name
                                            )
                                            SELECT w.num_fatalities, t.total_fatalities AS total_fatalities, w.num_fatalities/t.total_fatalities * 100 AS proportion_fatalities, w.day_of_week_name, w.state_name, s.state
                                            FROM state_weekdays_fatalities AS w
                                            JOIN state_fatalities AS t ON w.state_name = t.state_name
                                            JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name
                                            WHERE rank = 1
                                            ORDER BY
                                                CASE 
                                                    WHEN day_of_week_name = 'Monday' THEN 0 
                                                    WHEN day_of_week_name = 'Tuesday' THEN 1 
                                                    WHEN day_of_week_name = 'Wednesday' THEN 3
                                                    WHEN day_of_week_name = 'Thursday' THEN 4 
                                                    WHEN day_of_week_name = 'Friday' THEN 5 
                                                    WHEN day_of_week_name = 'Saturday' THEN 6 
                                                    WHEN day_of_week_name = 'Sunday' THEN 7 
                                                END"""
weekday_num_fatalities_state = pd.read_gbq(query_weekday_num_fatalities_state, project_id=project_id, dialect='standard')
fig_weekday_num_fatalities_state = px.choropleth(
    weekday_num_fatalities_state,
    locations='state',
    locationmode='USA-states',
    color='day_of_week_name',
    color_discrete_map = days_of_week_color_dict,
)

fig_weekday_num_fatalities_state.update_layout(
    title_text = 'Day of Week with Highest Number of Fatalities 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

As mentioned earlier, while the weekend plus Friday have a higher number of fatal accidents, it could be due to an increased number of drivers on the weekend. Normalizing by the number of accidents and finding the average number of fatalities for each accident by day of week removes the fluctuations in number of drivers.

In [None]:
%%bigquery --project $project_id

SELECT AVG(number_of_fatalities) as avg_fatalities, day_of_week_name
FROM `traffic_fatalities.traffic_features`
GROUP BY day_of_week_name
ORDER BY avg_fatalities DESC

Surprisingly, there is a slight but negligible difference in average fatalities by day of week. This further disproves our hypothesis that the day of the week effects fatality rate.

The table in Graph Form:

In [None]:
query_days_avg_fatalities  = """  SELECT AVG(number_of_fatalities) as avg_fatalities, day_of_week_name
                                        FROM `traffic_fatalities.traffic_features`
                                        GROUP BY day_of_week_name
                                        ORDER BY 
                                            CASE 
                                                WHEN day_of_week_name = 'Monday' THEN 0 
                                                WHEN day_of_week_name = 'Tuesday' THEN 1 
                                                WHEN day_of_week_name = 'Wednesday' THEN 3 
                                                WHEN day_of_week_name = 'Thursday' THEN 4 
                                                WHEN day_of_week_name = 'Friday' THEN 5 
                                                WHEN day_of_week_name = 'Saturday' THEN 6 
                                                WHEN day_of_week_name = 'Sunday' THEN 7 
                                            END"""

days_avg_fatalities = pd.read_gbq(query_days_avg_fatalities, project_id=project_id, dialect='standard')

plt.bar(days, days_avg_fatalities["avg_fatalities"])
plt.title("Fatalities Per Accident By Day of Week")
plt.xlabel("Day of Week")
plt.ylabel("Fatalities per Accident")

Following the same line of reasoning as before, determining the day of week with the highest average fatality rate by state will help us learn if the day of week effects state fatality rates and evaluates geological trends.

In [None]:
%%bigquery --project $project_id

WITH state_weekdays_avg_fatalities AS (
    SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, day_of_week_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(number_of_fatalities) DESC) AS rank
    FROM `traffic_fatalities.traffic_features`
    GROUP BY state_name, state_number, day_of_week_name
)
SELECT w.avg_fatalities, w.day_of_week_name, w.state_name, s.state
FROM state_weekdays_avg_fatalities AS w
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name
WHERE rank = 1
ORDER BY state_name

The results of the query show us that there is a large variation in day of week with the highest average fatality rate by state. Graphing the result will help us determine if there are geological trends:

In [None]:
query_weekday_avg_fatalities_state = """WITH state_weekdays_avg_fatalities AS (
                                            SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, day_of_week_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(number_of_fatalities) DESC) AS rank
                                            FROM `traffic_fatalities.traffic_features`
                                            GROUP BY state_name, state_number, day_of_week_name
                                        )
                                        SELECT w.avg_fatalities, w.day_of_week_name, w.state_name, s.state
                                        FROM state_weekdays_avg_fatalities AS w
                                        JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name
                                        WHERE rank = 1
                                        ORDER BY 
                                            CASE 
                                                WHEN day_of_week_name = 'Monday' THEN 0 
                                                WHEN day_of_week_name = 'Tuesday' THEN 1 
                                                WHEN day_of_week_name = 'Wednesday' THEN 3 
                                                WHEN day_of_week_name = 'Thursday' THEN 4 
                                                WHEN day_of_week_name = 'Friday' THEN 5 
                                                WHEN day_of_week_name = 'Saturday' THEN 6 
                                                WHEN day_of_week_name = 'Sunday' THEN 7 
                                            END"""
weekday_avg_fatalities_state = pd.read_gbq(query_weekday_avg_fatalities_state, project_id=project_id, dialect='standard')
fig_weekday_avg_fatalities_state = px.choropleth(
    weekday_avg_fatalities_state,
    locations='state',
    locationmode='USA-states',
    color='day_of_week_name',
    color_discrete_map = days_of_week_color_dict,
)

fig_weekday_avg_fatalities_state.update_layout(
    title_text = 'Day of Week with Highest Fatalities Per Accident 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

The above plot shows that most states have Fri/Sat/Sun as the day with highest average fatalities per accident. We also see that some states have Wednesday as the day with highest average fatalities and very few have Monday and Tuesday. Surprisingly, not a single state has Thursday as the day of the week with the highest fatality rate. 

For some states, the day of the week does appear to have a statistically significant effect on the fatality rate. Monday is Alaska's day with the highest fatality rate of 1.275. This is significantly higher than the national rate for Monday at 1.08. This tells us that the day of the week may play a factor into the fatality rate for some states. However, without knowing additional factors for each state such as cultural, social, or policy trends, we cannot decide what can be attributed to this difference. Each state's unique traffic dynamics play a pivotal role in shaping the day of the week with the highest average fatality rate. States experience diverse traffic volumes and densities, leading to fluctuations in accident rates on specific days. For instance, some regions witness intensified traffic on weekends due to increased recreational travel, potentially resulting in higher fatality rates, while others face heightened risks during weekdays due to commuter congestion. Varying work hours and commuting habits can influence traffic congestions and accident risks on specific days. Moreover, demographic factors such as age distribution and driving habits within each state further shape the deadliest days. Cultural and regional disparities also contribute significantly. Various states host distinct events or festivities on particular days, influencing travel behaviors and subsequently impacting accident probabilities. As a result this may be why we see that some states have wednesday as the day of the week with the highest average number of fatalities per fatal crash. Additionally, the enforcement of traffic laws and safety measures can vary, affecting driver conduct and accident rates across different days. Understanding these intricate interplays between traffic patterns, culture, enforcement policies, and regional factors would be key to offer insight into why states exhibit discrepancies in days with the highest average number of fatalities per car accident. But sadly we would need a much more robust data set here.

Despite the lack of additional data for each state, we can determine that there does not appear to be a geological effect on the day of the week with the highest average fatality rate by state. This can be seen as the days of the week appear randomly distributed across the US rather than creating patterns among regions. Therefore, the day of the week does not appear to directly effect the state fatality rates.

### Hour
We will now take a deep dive into how fatal accidents vary by hour, and by hour on specific days of the week. We expect that fatal accidents will be more common during the night due to low light conditions and increase incidence of drunk driving. We also expect to see regional variations in hours with the highest average fatality rate.

To start, let's analyze the distribution of fatalities by hour

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 AS proportion_fatalities, hour_of_crash_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY hour_of_crash_name
ORDER BY num_fatalities DESC

In [None]:
query_hours_fatalities  = """SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 AS proportion_fatalities, hour_of_crash_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY hour_of_crash_name
ORDER BY 
    CASE 
        WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 
        WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 
        WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 
        WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 
        WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 
        WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 
        WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 
        WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 
        WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 
        WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 
        WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 
        WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 
        WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 
        WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 
        WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 
        WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 
        WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 
        WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 
        WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 
        WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 
        WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 
        WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 
        WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 
        WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 
    END
DESC"""

hours_fatalities = pd.read_gbq(query_hours_fatalities, project_id=project_id, dialect='standard')

plt.barh(hours_fatalities["hour_of_crash_name"], hours_fatalities["num_fatalities"])
plt.title("Fatalities By Hour")
plt.xlabel("Frequency")
plt.ylabel("Hour")

The identified hours, particularly between 5:00 PM and 9:59 PM, represent the periods with the highest number of fatal accidents. These times coincide with increased traffic as a result of rush hour or extra curriculars. Alternatively, the hours of 7:00 AM to 2:59pm have a notably lower proportion of fatalities that may be caused by a lack of people on roadways during school or work.

The chart below shows us a visualisation of accidents on hour and day of week. Here we see that deadliest time of the week is Sunday morning. We believe this is due to the amount of drunk driving. But it is important to note that incidences of fatal crashes mostly occur in the afternoon and night regardless of the day. This also is probably due to the factors mentioned above. 

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 AS proportion_fatalities, hour_of_crash_name, day_of_week_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY hour_of_crash_name, day_of_week_name
ORDER BY num_fatalities DESC

In [None]:
query_hours_days_fatalities  = "SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 AS proportion_fatalities, hour_of_crash_name, day_of_week_name  FROM `traffic_fatalities.traffic_features` GROUP BY hour_of_crash_name, day_of_week_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Monday' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Monday' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Monday' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Monday' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Monday' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Monday' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Monday' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Monday' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Monday' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Monday' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Monday' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Monday' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Monday' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Monday' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Monday' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Monday' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Monday' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Monday' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Monday' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Monday' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Monday' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Monday' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Monday' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Monday' THEN 23 WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Tuesday' THEN 0 + 24 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Tuesday' THEN 1 + 24 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Tuesday' THEN 2 + 24 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Tuesday' THEN 3 + 24 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Tuesday' THEN 4 + 24 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Tuesday' THEN 5 + 24 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Tuesday' THEN 6 + 24 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Tuesday' THEN 7 + 24 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Tuesday' THEN 8 + 24 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Tuesday' THEN 9 + 24 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Tuesday' THEN 10 + 24 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Tuesday' THEN 11 + 24 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Tuesday' THEN 12 + 24 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Tuesday' THEN 13 + 24 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Tuesday' THEN 14 + 24 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Tuesday' THEN 15 + 24 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Tuesday' THEN 16 + 24 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Tuesday' THEN 17 + 24 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Tuesday' THEN 18 + 24 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Tuesday' THEN 19 + 24 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Tuesday' THEN 20 + 24 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Tuesday' THEN 21 + 24 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Tuesday' THEN 22 + 24 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Tuesday' THEN 23 + 24 WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Wednesday' THEN 0 + 48 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Wednesday' THEN 1 + 48 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Wednesday' THEN 2 + 48 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Wednesday' THEN 3 + 48 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Wednesday' THEN 4 + 48 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Wednesday' THEN 5 + 48 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Wednesday' THEN 6 + 48 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Wednesday' THEN 7 + 48 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Wednesday' THEN 8 + 48 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Wednesday' THEN 9 + 48 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Wednesday' THEN 10 + 48 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Wednesday' THEN 11 + 48 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Wednesday' THEN 12 + 48 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Wednesday' THEN 13 + 48 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Wednesday' THEN 14 + 48 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Wednesday' THEN 15 + 48 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Wednesday' THEN 16 + 48 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Wednesday' THEN 17 + 48 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Wednesday' THEN 18 + 48 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Wednesday' THEN 19 + 48 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Wednesday' THEN 20 + 48 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Wednesday' THEN 21 + 48 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Wednesday' THEN 22 + 48 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Wednesday' THEN 23 + 48 WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Thursday' THEN 0 + 72 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Thursday' THEN 1 + 72 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Thursday' THEN 2 + 72 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Thursday' THEN 3 + 72 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Thursday' THEN 4 + 72 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Thursday' THEN 5 + 72 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Thursday' THEN 6 + 72 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Thursday' THEN 7 + 72 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Thursday' THEN 8 + 72 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Thursday' THEN 9 + 72 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Thursday' THEN 10 + 72 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Thursday' THEN 11 + 72 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Thursday' THEN 12 + 72 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Thursday' THEN 13 + 72 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Thursday' THEN 14 + 72 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Thursday' THEN 15 + 72 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Thursday' THEN 16 + 72 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Thursday' THEN 17 + 72 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Thursday' THEN 18 + 72 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Thursday' THEN 19 + 72 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Thursday' THEN 20 + 72 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Thursday' THEN 21 + 72 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Thursday' THEN 22 + 72 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Thursday' THEN 23 + 72 WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Friday' THEN 0 + 96 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Friday' THEN 1 + 96 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Friday' THEN 2 + 96 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Friday' THEN 3 + 96 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Friday' THEN 4 + 96 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Friday' THEN 5 + 96 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Friday' THEN 6 + 96 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Friday' THEN 7 + 96 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Friday' THEN 8 + 96 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Friday' THEN 9 + 96 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Friday' THEN 10 + 96 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Friday' THEN 11 + 96 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Friday' THEN 12 + 96 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Friday' THEN 13 + 96 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Friday' THEN 14 + 96 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Friday' THEN 15 + 96 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Friday' THEN 16 + 96 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Friday' THEN 17 + 96 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Friday' THEN 18 + 96 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Friday' THEN 19 + 96 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Friday' THEN 20 + 96 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Friday' THEN 21 + 96 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Friday' THEN 22 + 96 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Friday' THEN 23 + 96 WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Saturday' THEN 0 + 120 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Saturday' THEN 1 + 120 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Saturday' THEN 2 + 120 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Saturday' THEN 3 + 120 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Saturday' THEN 4 + 120 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Saturday' THEN 5 + 120 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Saturday' THEN 6 + 120 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Saturday' THEN 7 + 120 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Saturday' THEN 8 + 120 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Saturday' THEN 9 + 120 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Saturday' THEN 10 + 120 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Saturday' THEN 11 + 120 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Saturday' THEN 12 + 120 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Saturday' THEN 13 + 120 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Saturday' THEN 14 + 120 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Saturday' THEN 15 + 120 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Saturday' THEN 16 + 120 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Saturday' THEN 17 + 120 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Saturday' THEN 18 + 120 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Saturday' THEN 19 + 120 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Saturday' THEN 20 + 120 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Saturday' THEN 21 + 120 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Saturday' THEN 22 + 120 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Saturday' THEN 23 + 120 WHEN hour_of_crash_name = '0:00am-0:59am' AND day_of_week_name = 'Sunday' THEN 0 + 144 WHEN hour_of_crash_name = '1:00am-1:59am' AND day_of_week_name = 'Sunday' THEN 1 + 144 WHEN hour_of_crash_name = '2:00am-2:59am' AND day_of_week_name = 'Sunday' THEN 2 + 144 WHEN hour_of_crash_name = '3:00am-3:59am' AND day_of_week_name = 'Sunday' THEN 3 + 144 WHEN hour_of_crash_name = '4:00am-4:59am' AND day_of_week_name = 'Sunday' THEN 4 + 144 WHEN hour_of_crash_name = '5:00am-5:59am' AND day_of_week_name = 'Sunday' THEN 5 + 144 WHEN hour_of_crash_name = '6:00am-6:59am' AND day_of_week_name = 'Sunday' THEN 6 + 144 WHEN hour_of_crash_name = '7:00am-7:59am' AND day_of_week_name = 'Sunday' THEN 7 + 144 WHEN hour_of_crash_name = '8:00am-8:59am' AND day_of_week_name = 'Sunday' THEN 8 + 144 WHEN hour_of_crash_name = '9:00am-9:59am' AND day_of_week_name = 'Sunday' THEN 9 + 144 WHEN hour_of_crash_name = '10:00am-10:59am' AND day_of_week_name = 'Sunday' THEN 10 + 144 WHEN hour_of_crash_name = '11:00am-11:59am' AND day_of_week_name = 'Sunday' THEN 11 + 144 WHEN hour_of_crash_name = '12:00pm-12:59pm' AND day_of_week_name = 'Sunday' THEN 12 + 144 WHEN hour_of_crash_name = '1:00pm-1:59pm' AND day_of_week_name = 'Sunday' THEN 13 + 144 WHEN hour_of_crash_name = '2:00pm-2:59pm' AND day_of_week_name = 'Sunday' THEN 14 + 144 WHEN hour_of_crash_name = '3:00pm-3:59pm' AND day_of_week_name = 'Sunday' THEN 15 + 144 WHEN hour_of_crash_name = '4:00pm-4:59pm' AND day_of_week_name = 'Sunday' THEN 16 + 144 WHEN hour_of_crash_name = '5:00pm-5:59pm' AND day_of_week_name = 'Sunday' THEN 17 + 144 WHEN hour_of_crash_name = '6:00pm-6:59pm' AND day_of_week_name = 'Sunday' THEN 18 + 144 WHEN hour_of_crash_name = '7:00pm-7:59pm' AND day_of_week_name = 'Sunday' THEN 19 + 144 WHEN hour_of_crash_name = '8:00pm-8:59pm' AND day_of_week_name = 'Sunday' THEN 20 + 144 WHEN hour_of_crash_name = '9:00pm-9:59pm' AND day_of_week_name = 'Sunday' THEN 21 + 144 WHEN hour_of_crash_name = '10:00pm-10:59pm' AND day_of_week_name = 'Sunday' THEN 22 + 144 WHEN hour_of_crash_name = '11:00pm-11:59pm' AND day_of_week_name = 'Sunday' THEN 23 + 144 END DESC"

hours_days_fatalities = pd.read_gbq(query_hours_days_fatalities, project_id=project_id, dialect='standard')

plt.barh(range(0, 168), hours_days_fatalities["num_fatalities"])
plt.yticks(ticks=[0, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168], 
           labels=["11:59pm", "Sunday", "12am", "Saturday", "12am", "Friday", "12am", "Thursday", "12am", "Wednesday", "12am", "Tuesday", "12am", "Monday", "12am"])
plt.title("Number of Fatalitites By Day and Hour")
plt.xlabel("Frequency")
plt.ylabel("Time")

Again, we isolate by state:

In [None]:
%%bigquery --project $project_id

WITH state_hours_fatalities AS (
    SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, hour_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(number_of_fatalities) DESC) AS rank 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, hour_of_crash_name 
), state_fatalities AS (
    SELECT state_name, SUM(number_of_fatalities) AS total_fatalities_per_state 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)   
SELECT w.num_fatalities, t.total_fatalities_per_state AS total_fatalities, w.num_fatalities / t.total_fatalities_per_state * 100 AS proportion_fatalities, w.hour_of_crash_name, w.state_name, s.state 
FROM state_hours_fatalities AS w 
JOIN  state_fatalities AS t ON w.state_name = t.state_name
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE rank = 1
ORDER BY state_name

In [None]:
hours = range(24)
hours_in_day_color_set = px.colors.qualitative.Pastel1[:len(hours)]

hours_in_day_color_dict = dict(zip(hours, hours_in_day_color_set))

In [None]:

query_hours_fatalities_state = """WITH state_hours_fatalities AS (
    SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, hour_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(number_of_fatalities) DESC) AS rank 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, hour_of_crash_name 
), state_fatalities AS (
    SELECT state_name, SUM(number_of_fatalities) AS total_fatalities_per_state 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)   
SELECT w.num_fatalities, t.total_fatalities_per_state AS total_fatalities, w.num_fatalities / t.total_fatalities_per_state * 100 AS proportion_fatalities, w.hour_of_crash_name, w.state_name, s.state 
FROM state_hours_fatalities AS w 
JOIN  state_fatalities AS t ON w.state_name = t.state_name
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE rank = 1
ORDER BY 
    CASE 
        WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 
        WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 
        WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 
        WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 
        WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 
        WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 
        WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 
        WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 
        WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 
        WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 
        WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 
        WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 
        WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 
        WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 
        WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 
        WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 
        WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 
        WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 
        WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 
        WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 
        WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 
        WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 
        WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 
        WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 
    END"""
hours_fatalities_state = pd.read_gbq(query_hours_fatalities_state, project_id=project_id, dialect='standard')

fig_hours_fatalities_state = px.choropleth(
    hours_fatalities_state,
    locations='state',
    locationmode='USA-states',
    color='hour_of_crash_name',
    color_discrete_map = hours_in_day_color_dict,
    title='Hour with Most Fatalities',
)
fig_hours_fatalities_state.update_layout(
    title_text = 'Hour with Highest Number of Fatalities 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Unlike the day of week with the highest number of fatalities by state, the hour with the highest number of fatalities by state does not have a common hour across states. There also does not appear to be any regional pattern with the hours. It appears that geolocation does not effect the hour with the most fatalities across states. However, looking at the proportion of fatalities for each state at the hour with the largest number of fatalities, we discover that the hour does have a statistically significant impact on the number of fatalities for the state. Take Vermont for example. The hour with the highest number of fatalities is 4:00pm-4:59pm and it makes up 9.6% of accidents across the day. This percentage is significantly higher than the national proportion of 5.2%. The large difference these two values indicates that there is a relationship between hour and state fatalities.

The heightened proportion of fatal accidents may be attributed to the higher number of drivers. To mitigate this issue, we can normalize the number of fatalities by the number of accidents, or average fatalities per crash, in the hour. 

In [None]:
%%bigquery --project $project_id

SELECT AVG(number_of_fatalities) AS avg_fatalities, hour_of_crash_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY hour_of_crash_name
ORDER BY avg_fatalities DESC

When looking at the table we see that there is a slight increase in average fatalities per fatal accident during the hours from 1:00am to 3:59am for deadly accidents. The increased fatality rate could be caused by confounding factors such as drunk drivers at night or decreased illumination. Nevertheless the differences here may not be statistically significant enough when compared to the fatality rates across all hours.

The table as a graph:

In [None]:
query_hours_avg_fatalities  = """   SELECT AVG(number_of_fatalities) AS avg_fatalities, hour_of_crash_name 
                                    FROM `traffic_fatalities.traffic_features` 
                                    GROUP BY hour_of_crash_name 
                                    ORDER BY 
                                        CASE 
                                            WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 
                                            WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 
                                            WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 
                                            WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 
                                            WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 
                                            WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 
                                            WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 
                                            WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 
                                            WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 
                                            WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 
                                            WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 
                                            WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 
                                            WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 
                                            WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 
                                            WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 
                                            WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 
                                            WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 
                                            WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 
                                            WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 
                                            WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 
                                            WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 
                                            WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 
                                            WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 
                                            WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 
                                        END
                                    DESC"""

hours_avg_fatalities = pd.read_gbq(query_hours_avg_fatalities, project_id=project_id, dialect='standard')

plt.barh(hours_avg_fatalities["hour_of_crash_name"], hours_avg_fatalities["avg_fatalities"])
plt.title("Fatalities per Accident By Hour")
plt.xlabel("Fatalities per Accident")
plt.ylabel("Hour")

Again, we must view the fatality rate trends across states. Identifying the hour with the highest fatality rate for each state could help us understand regional trends.

In [None]:
%%bigquery --project $project_id

WITH state_hours_fatalities AS (
    SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, hour_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(number_of_fatalities) DESC) AS rank 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, hour_of_crash_name 
    ) 
SELECT w.avg_fatalities, w.hour_of_crash_name, w.state_name, s.state 
FROM state_hours_fatalities AS w 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE rank = 1
ORDER BY state_name

In [None]:
query_hours_avg_fatalities_state = """  WITH state_hours_fatalities AS (
                                            SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, hour_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(number_of_fatalities) DESC) AS rank 
                                            FROM `traffic_fatalities.traffic_features` 
                                            GROUP BY state_name, hour_of_crash_name 
                                            ) 
                                        SELECT w.avg_fatalities, w.hour_of_crash_name, w.state_name, s.state 
                                        FROM state_hours_fatalities AS w 
                                        JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                                        WHERE rank = 1
                                        ORDER BY 
                                            CASE 
                                                WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 
                                                WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 
                                                WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 
                                                WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 
                                                WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 
                                                WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 
                                                WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 
                                                WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 
                                                WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 
                                                WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 
                                                WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 
                                                WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 
                                                WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 
                                                WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 
                                                WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 
                                                WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 
                                                WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 
                                                WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 
                                                WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 
                                                WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 
                                                WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 
                                                WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 
                                                WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 
                                                WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 
                                            END"""
hours_avg_fatalities_state = pd.read_gbq(query_hours_avg_fatalities_state, project_id=project_id, dialect='standard')

fig_hours_avg_fatalities_state = px.choropleth(
    hours_avg_fatalities_state,
    locations='state',
    locationmode='USA-states',
    color='hour_of_crash_name',
    color_discrete_map = hours_in_day_color_dict,
)
fig_hours_avg_fatalities_state.update_layout(
    title_text = 'Hour with Highest Average Fatalities per Accident 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Similar to the last few heatmaps, there does not appear to be any regional trends for the hour with the highest fatality rate by state. This does not help our hypothesis that the geolocation affects state fatality rates. Further, the average fatality rate does appear to be statistically significant. The hour with the highest average fatality rate for each state is significantly higher than the average fatality rate for the hour across the country. However, as stated above, without additional cultural, social, or policy data, we cannot determine the trends behind the increased fatality rate.

### Month

Next we explore the fatality rates across months. We expect to see an increase in fatality rate during the winter months due to confounding factors such as undesirable weather and decreased daylight hours. Additionally, we do expect that the month has a statistically significant effect of state fatality rate through weather and climate differences.

First, the distribution of fatalities across the months

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 as proportion_fatalities, month_of_crash_name
FROM `traffic_fatalities.traffic_features`
GROUP BY month_of_crash_name
ORDER BY num_fatalities DESC

Just so, we see that the Summer months and the months following have higher incidences of fatal accidents. We believe this is directly correlated with increased number drivers on the road due to the holidays, and perhaps less wary driving due to the nicer weather. Nevertheless this is all just conjecture. There are several other factors that may have influenced these differences. Immediately our hypothesis that the number of fatalities increase during the winter months has been proven false.

In [None]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

In [None]:
query_months_fatalities  = """  SELECT SUM(number_of_fatalities) AS num_fatalities, SUM(number_of_fatalities)/(SELECT SUM(number_of_fatalities) FROM `traffic_fatalities.traffic_features`) * 100 as proportion_fatalities, month_of_crash_name
                                FROM `traffic_fatalities.traffic_features`
                                GROUP BY month_of_crash_name
                                ORDER BY 
                                    CASE 
                                        WHEN month_of_crash_name = 'January' THEN 1 
                                        WHEN month_of_crash_name = 'February' THEN 2 
                                        WHEN month_of_crash_name = 'March' THEN 3 
                                        WHEN month_of_crash_name = 'April' THEN 4 
                                        WHEN month_of_crash_name = 'May' THEN 5 
                                        WHEN month_of_crash_name = 'June' THEN 6 
                                        WHEN month_of_crash_name = 'July' THEN 7 
                                        WHEN month_of_crash_name = 'August' THEN 8 
                                        WHEN month_of_crash_name = 'September' THEN 9 
                                        WHEN month_of_crash_name = 'October' THEN 10 
                                        WHEN month_of_crash_name = 'November' THEN 11 
                                        WHEN month_of_crash_name = 'December' THEN 12 
                                    END"""

months_fatalities = pd.read_gbq(query_months_fatalities, project_id=project_id, dialect='standard')

plt.bar(months, months_fatalities["num_fatalities"])
plt.title("Fatalities by Month")
plt.xlabel("Month")
plt.ylabel("Frequency")

We noticed that by state, the months with highest fatalities occur in the months of May-October. However there is no real geological trend besides the typically warmer/arid states experiencing highest fatality months earlier in the year. We believe the clustering of deadly months during the warmer seasons of the year may be due to factors such as: 

- Increased Travel: The summer months, especially July and August, coincide with peak vacation periods and holidays like Independence Day. More people tend to travel during this time, leading to increased traffic on roads, highways, and interstates, thereby raising the likelihood of accidents.

- Tourism and Outdoor Activities: Warmer weather in summer encourages outdoor activities and tourism, leading to higher numbers of vehicles on the road, especially in areas known for tourist attractions or outdoor events.

- Longer Daylight Hours: Longer daylight hours during summer months allow for extended periods of driving, potentially increasing the overall exposure to road risks.

- Road Work and Construction: Construction and road repair projects are often more active during summer due to favorable weather conditions, which can lead to altered traffic patterns, detours, and temporary road layouts, contributing to accidents.

- Youth Activities and Events: Summer break for schools and colleges means increased activity among young drivers, potentially leading to more accidents involving inexperienced or youthful drivers.

- Alcohol Consumption: Holidays and warmer weather often lead to more social gatherings and outdoor events where alcohol consumption might be higher, contributing to an increased risk of accidents due to impaired driving.

Next, we analyze the top month with the most fatalities for each state

In [None]:
%%bigquery --project $project_id
WITH state_months_fatalities AS (
    SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, month_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(number_of_fatalities) DESC) AS rank 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, month_of_crash_name
), state_fatalities AS (
    SELECT state_name, SUM(number_of_fatalities) AS total_fatalities_per_state 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)   
SELECT w.num_fatalities, t.total_fatalities_per_state AS total_fatalities, w.num_fatalities / t.total_fatalities_per_state * 100 AS percent_fatalities_per_year, w.month_of_crash_name, w.state_name, s.state 
FROM state_months_fatalities AS w 
JOIN state_fatalities AS t ON w.state_name = t.state_name
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE rank = 1
ORDER BY state_name

In [None]:
months_in_year_color_set = px.colors.qualitative.Pastel1[:len(months)]

months_in_year_color_dict = dict(zip(range(1, 13), months_in_year_color_set))

In [None]:

query_months_fatalities_state = """ WITH state_months_fatalities AS (
                                        SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, month_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(number_of_fatalities) DESC) AS rank 
                                        FROM `traffic_fatalities.traffic_features` 
                                        GROUP BY state_name, month_of_crash_name
                                    ), state_fatalities AS (
                                        SELECT state_name, SUM(number_of_fatalities) AS total_fatalities_per_state 
                                        FROM `traffic_fatalities.traffic_features` 
                                        GROUP BY state_name
                                    )   
                                    SELECT w.num_fatalities, t.total_fatalities_per_state AS total_fatalities, w.num_fatalities / t.total_fatalities_per_state AS percent_fatalities_per_year, w.month_of_crash_name, w.state_name, s.state 
                                    FROM state_months_fatalities AS w 
                                    JOIN state_fatalities AS t ON w.state_name = t.state_name
                                    JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                                    WHERE rank = 1
                                    ORDER BY 
                                        CASE 
                                            WHEN month_of_crash_name = 'January' THEN 1 
                                            WHEN month_of_crash_name = 'February' THEN 2 
                                            WHEN month_of_crash_name = 'March' THEN 3 
                                            WHEN month_of_crash_name = 'April' THEN 4 
                                            WHEN month_of_crash_name = 'May' THEN 5 
                                            WHEN month_of_crash_name = 'June' THEN 6 
                                            WHEN month_of_crash_name = 'July' THEN 7 
                                            WHEN month_of_crash_name = 'August' THEN 8 
                                            WHEN month_of_crash_name = 'September' THEN 9 
                                            WHEN month_of_crash_name = 'October' THEN 10 
                                            WHEN month_of_crash_name = 'November' THEN 11 
                                            WHEN month_of_crash_name = 'December' THEN 12 
                                        END"""
months_fatalities_state = pd.read_gbq(query_months_fatalities_state, project_id=project_id, dialect='standard')

fig_months_fatalities_state = px.choropleth(
    months_fatalities_state,
    locations='state',
    locationmode='USA-states',
    color='month_of_crash_name',
    color_discrete_map = months_in_year_color_dict,
)
fig_months_fatalities_state.update_layout(
    title_text = 'Month with Highest Number of Fatalities 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

There appears to be some regional trends for the months with the highest fatalities. For example, the southwest all has October as the month with the highest proportion of fatalities, the northwest has July, and the midwest has August. Also, with the exception of Hawaii and Florida, all the states in the US have summer months (July - October). The appearance of the regional trend supports our hypothesis that geolocation effects state fatalities.

Moreover, only a few states have statistically significant differences in their proportion of fatalities by month. On one hand, some states like South Dakota have statistically significant differences in percent fatalities in their month (14.6% August in SD vs 9.1% August nationwide) while others like California don't (9.1% October in CA vs. 9.3% October nationwide). This reinforces our hypothesis that month does play a factor in state fatality rate.

We must again normalize the fatalities by the number of accidents in each state to produce the fatalities per accident for each month.

In [None]:
%%bigquery --project $project_id

SELECT AVG(number_of_fatalities) avg_fatalities, month_of_crash_name
FROM `traffic_fatalities.traffic_features`
GROUP BY month_of_crash_name
ORDER BY avg_fatalities DESC

In [None]:
query_months_avg_fatalities  = """  SELECT AVG(number_of_fatalities) avg_fatalities, month_of_crash_name
                                    FROM `traffic_fatalities.traffic_features`
                                    GROUP BY month_of_crash_name
                                    ORDER BY 
                                        CASE 
                                            WHEN month_of_crash_name = 'January' THEN 1 
                                            WHEN month_of_crash_name = 'February' THEN 2 
                                            WHEN month_of_crash_name = 'March' THEN 3 
                                            WHEN month_of_crash_name = 'April' THEN 4 
                                            WHEN month_of_crash_name = 'May' THEN 5 
                                            WHEN month_of_crash_name = 'June' THEN 6 
                                            WHEN month_of_crash_name = 'July' THEN 7 
                                            WHEN month_of_crash_name = 'August' THEN 8 
                                            WHEN month_of_crash_name = 'September' THEN 9 
                                            WHEN month_of_crash_name = 'October' THEN 10 
                                            WHEN month_of_crash_name = 'November' THEN 11 
                                            WHEN month_of_crash_name = 'December' THEN 12 
                                        END"""

months_avg_fatalities = pd.read_gbq(query_months_avg_fatalities, project_id=project_id, dialect='standard')

plt.bar(months, months_avg_fatalities["avg_fatalities"])
plt.title("Fatalities Per Accident by Month")

Yet again, the average fatalities per accident is similar across months which leads us to believe that the month does not have a statistically significant effect on fatality rate across the country. 

Looking at the average fatalities per accident across the months for each state, we discover:

In [None]:
%%bigquery --project $project_id
WITH state_months_fatalities AS (
    SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, month_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(number_of_fatalities) DESC) AS rank 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name, month_of_crash_name
    ) 
SELECT w.avg_fatalities, w.month_of_crash_name, w.state_name, s.state 
FROM state_months_fatalities AS w 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE rank = 1
ORDER BY state_name

In [None]:
query_months_avg_fatalities_state = """ WITH state_months_fatalities AS (
                                            SELECT AVG(number_of_fatalities) AS avg_fatalities, state_name, month_of_crash_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(number_of_fatalities) DESC) AS rank 
                                            FROM `traffic_fatalities.traffic_features` 
                                            GROUP BY state_name, month_of_crash_name
                                            ) 
                                        SELECT w.avg_fatalities, w.month_of_crash_name, w.state_name, s.state 
                                        FROM state_months_fatalities AS w 
                                        JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                                        WHERE rank = 1
                                        ORDER BY 
                                            CASE 
                                                WHEN month_of_crash_name = 'January' THEN 1 
                                                WHEN month_of_crash_name = 'February' THEN 2 
                                                WHEN month_of_crash_name = 'March' THEN 3 
                                                WHEN month_of_crash_name = 'April' THEN 4 
                                                WHEN month_of_crash_name = 'May' THEN 5 
                                                WHEN month_of_crash_name = 'June' THEN 6 
                                                WHEN month_of_crash_name = 'July' THEN 7 
                                                WHEN month_of_crash_name = 'August' THEN 8 
                                                WHEN month_of_crash_name = 'September' THEN 9 
                                                WHEN month_of_crash_name = 'October' THEN 10 
                                                WHEN month_of_crash_name = 'November' THEN 11 
                                                WHEN month_of_crash_name = 'December' THEN 12 
                                            END"""
months_avg_fatalities_state = pd.read_gbq(query_months_avg_fatalities_state, project_id=project_id, dialect='standard')

fig_months_avg_fatalities_state = px.choropleth(
    months_avg_fatalities_state,
    locations='state',
    locationmode='USA-states',
    color='month_of_crash_name',
    color_discrete_map = months_in_year_color_dict,
)
fig_months_avg_fatalities_state.update_layout(
    title_text = 'Month with Highest Fatalities Per Accident 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

There appears to be some regional trends for the months with the highest average fatalities per accident. First, the west coast has June as the month with the highest average fatalities per month. The average fatality rate is statistically significant here when comparing the rates for each state (California 1.1, Oregon 1.15, Washington 1.1) to the rate across the country (1.09). This regional trend may be able to be explained by increased travel for tourism to the west coast durng the month of June. The midwest also has a regional trend in that much of the area's top month is December while Michigan has January. These months indicate a higher fatality rate during the winter months when the midwest recieves heavy snowfall. The appearance of the regional trend strongly supports our hypothesis that the geolocation effects state fatalities. Furthermore, almost all states experienced an increase in fatality rate in comparison to the national rate. Thus, the month of the crash appears to effect the state fatality rate.

### Land Use

Here we investigate how different land uses can change the severity of fatal crashes. We expect to see a trend in the number of accidents and fatality rate for each land type across each state.

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS num_accidents, land_use_name
FROM `traffic_fatalities.traffic_features` 
GROUP BY land_use_name
ORDER BY avg_fatalities DESC

We notice that even though rural areas typically have much lower population density, we get that fatal accidents in rural areas on average have more fatalities than accidents in urban environments. Moreover despite rural areas having less people, we see that number of fatal accidents are comparable to the number of fatal accidents in urban areas. The difference is only 20,000, where as we know that the difference in population between these land use areas is much more significant.

We believe this may be due to the fact that in rural areas, accidents are more likely to occur on highways/interstates where there are typically more cars and higher speed limits which both contirbute to the severity of crashes. However this cannot be the only factor that is contributing to the number of fatal car accidents in rural areas. We believe that the incidence of reckless driving, drunk driving may be higher and quality of roads are lower in rural areas. 

In [None]:
query_avg_land  = """   SELECT AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS num_accidents, land_use_name
                        FROM `traffic_fatalities.traffic_features` 
                        WHERE land_use_name IN ('Rural', 'Urban')
                        GROUP BY land_use_name
                        """

avg_land = pd.read_gbq(query_avg_land, project_id=project_id, dialect='standard')

plt.bar(avg_land["land_use_name"], avg_land["avg_fatalities"])
plt.ylabel('Average Fatalities')
plt.xlabel('Land Type')
plt.title('Average Fatalities by Land Type')

The below plot is relatively straightforward, states with higher amounts of urban centres are more likely to have crashes that occur in urban areas such as FLA, CA etc.

In [None]:
%%bigquery --project $project_id
WITH state_accidents AS (
    SELECT COUNT(*) AS total_accidents_per_state, state_name 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)
SELECT COUNT(*)/t.total_accidents_per_state * 100 AS percent_urban, COUNT(*) AS num_urban, AVG(w.number_of_fatalities) AS avg_fatalities, w.state_name, w.land_use_name, s.state 
FROM `traffic_fatalities.traffic_features` AS w 
JOIN state_accidents AS t ON w.state_name = t.state_name 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE land_use_name = 'Urban' 
GROUP BY state_name, land_use_name, state, total_accidents_per_state
ORDER BY state_name

In [None]:
query_urban_per_state = """WITH state_accidents AS (
                                SELECT COUNT(*) AS total_accidents_per_state, state_name 
                                FROM `traffic_fatalities.traffic_features` 
                                GROUP BY state_name
                            )
                            SELECT COUNT(*)/t.total_accidents_per_state * 100 AS percent_urban, COUNT(*) AS num_urban, AVG(w.number_of_fatalities) AS avg_fatalities, w.state_name, w.land_use_name, s.state 
                            FROM `traffic_fatalities.traffic_features` AS w 
                            JOIN state_accidents AS t ON w.state_name = t.state_name 
                            JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                            WHERE land_use_name = 'Urban' 
                            GROUP BY state_name, land_use_name, state, total_accidents_per_state
                            ORDER BY state_name"""
urban_per_state = pd.read_gbq(query_urban_per_state, project_id=project_id, dialect='standard')
fig_urban_per_state = px.choropleth(
    urban_per_state,
    locations='state',
    locationmode='USA-states',
    color='percent_urban',
    color_continuous_scale = 'Reds',
)

fig_urban_per_state.update_layout(
    title_text = 'Percent of Accidents that are in Urban Areas by State 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Interestingly we see that average severity of crashes in urban areas occur in states where proportion of urban crashes are not the highest. This may be due to multitude of factors. For example such as drivers in states with less urban centres might be less used to driving in urban areas, and thus are more prone to severe crashes in them. Much more investigation is required here, but our data set is not robust enough.

In [None]:
fig_urban_avg_fatalities = px.choropleth(
    urban_per_state,
    locations='state',
    locationmode='USA-states',
    color='avg_fatalities',
    color_continuous_scale = 'Blues',
)
fig_urban_avg_fatalities.update_layout(
    title_text = 'Average Fatalities in Accidents that are in Urban areas by State 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Similarly as above we see that states with highest average severity for crashes in rural areas do not have highest proportion of fatal accidents occuring in rural areas compared to other states. This may be due to multitude of factors as mentioned above. Much more investigation is required.

In [None]:
%%bigquery --project $project_id

WITH state_accidents AS (
    SELECT COUNT(*) AS total_accidents_per_state, state_name 
    FROM `traffic_fatalities.traffic_features` 
    GROUP BY state_name
)
SELECT COUNT(*)/t.total_accidents_per_state * 100 AS percent_rural, COUNT(*) AS num_rural, AVG(w.number_of_fatalities) AS avg_fatalities, w.state_name, w.land_use_name, s.state 
FROM `traffic_fatalities.traffic_features` AS w 
JOIN state_accidents AS t ON w.state_name = t.state_name 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE land_use_name = 'Rural' 
GROUP BY state_name, land_use_name, state, total_accidents_per_state
ORDER BY state_name

In [None]:
query_rural_per_state = """ WITH state_accidents AS (
                                SELECT COUNT(*) AS total_accidents_per_state, state_name 
                                FROM `traffic_fatalities.traffic_features` 
                                GROUP BY state_name
                            )
                            SELECT COUNT(*)/t.total_accidents_per_state * 100 AS percent_rural, COUNT(*) AS num_rural, AVG(w.number_of_fatalities) AS avg_fatalities, w.state_name, w.land_use_name, s.state 
                            FROM `traffic_fatalities.traffic_features` AS w 
                            JOIN state_accidents AS t ON w.state_name = t.state_name 
                            JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                            WHERE land_use_name = 'Rural' 
                            GROUP BY state_name, land_use_name, state, total_accidents_per_state
                            ORDER BY state_name"""
rural_per_state = pd.read_gbq(query_rural_per_state, project_id=project_id, dialect='standard')
fig_rural_per_state = px.choropleth(
    rural_per_state,
    locations='state',
    locationmode='USA-states',
    color='percent_rural',
    color_continuous_scale = 'Reds',
)

fig_rural_per_state.update_layout(
    title_text = 'Percent of Accidents that are in Rural Areas by State 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

In [None]:
fig_rural_avg_fatalities = px.choropleth(
    rural_per_state,
    locations='state',
    locationmode='USA-states',
    color='avg_fatalities',
    color_continuous_scale = 'Blues',
)
fig_rural_avg_fatalities.update_layout(
    title_text = 'Average Fatalities in Accidents that are in Rural areas by State 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

The correlation between land type with state fatalities and average fatality rate is apparent, supporting our initial hypothesis. As the state is more rural, the number of rural fatalities increases, the rural average fatality rate decreases, the urban fatalities decreases, and the urban average fatality rate increases. On the other hand, as the state is more urban, the number of urban fatalities increases, the urban average fatality rate decreases, the rural fatalities decreases, and the rural average fatality rate increases.

### Atmospheric Conditions

We also wonder how unfavourable atmoshperic conditions contribute to the severity of fatal accidents. Does snow, rain etc. result in higher average fatalities per fatal accident?

First we explore the distribution of all atmospheric conditions with their average fatalities per accident.

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_fatalities) AS avg_fatalities, COUNT(*) AS num_accidents, atmospheric_conditions_name
FROM `traffic_fatalities.traffic_features` 
GROUP BY atmospheric_conditions_name
ORDER BY avg_fatalities DESC

It appears that accidents during sandstorms have a significantly higher fatality per accident in comparison to the other conditions. This indicates that the atmospheric condition may influence the fatality rate. However, sandstorms also have the lowest occurence of the conditions which may explain its higher fatality rate. In general, the conditions we defined as unfavorable have higher fatalities per accident. Therefore, our hypothesis that unfavorable conditions increase fatality rate holds true.

Next we can view the percentage of unfavorable accidents

In [None]:
%%bigquery --project $project_id

SELECT COUNT(*)/(SELECT COUNT(*) FROM `traffic_fatalities.traffic_features`) * 100 AS percent_unfavorable, COUNT(*) AS num_unfavorable
FROM `traffic_fatalities.traffic_features`
WHERE atmospheric_conditions_name IN ('Blowing Snow', 'Blowing Sand, Soil, Dirt', 'Sleet or Hail', 'Fog, Smog, Smoke', 'Rain', 'Severe Crosswinds', 'Freezing Rain or Drizzle', 'Snow')

As shown by the low percentage, there are not many fatal accidents that occur during unfavorable conditions. This leads us to believe that atmospheric conditions may not be a large factor in contributing to fatal accidents.

Nevertheless, we want to view the percent of unfavorable accidents across states to further develop regional and state trends for fatality rate.

In [None]:
%%bigquery --project $project_id

WITH state_accidents AS (
    SELECT COUNT(*) AS num_accidents, state_name, state_number
    FROM `traffic_fatalities.traffic_features`
    GROUP BY state_name, state_number
)
SELECT COUNT(*)/t.num_accidents * 100 AS percent_unfavorable, AVG(w.number_of_fatalities) AS avg_fatalities, w.state_name, s.state
FROM `traffic_fatalities.traffic_features` AS w
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
JOIN state_accidents AS t ON t.state_name = w.state_name
WHERE atmospheric_conditions_name IN ('Blowing Snow', 'Blowing Sand, Soil, Dirt', 'Sleet or Hail', 'Fog, Smog, Smoke', 'Rain', 'Severe Crosswinds', 'Freezing Rain or Drizzle', 'Snow')
GROUP BY state_name, state, num_accidents
ORDER BY state_name

Graphically,

In [None]:
query_unfavorable_per_state =    """WITH state_accidents AS (
                                        SELECT COUNT(*) AS num_accidents, state_name, state_number
                                        FROM `traffic_fatalities.traffic_features`
                                        GROUP BY state_name, state_number
                                    )
                                    SELECT COUNT(*)/t.num_accidents * 100 AS percent_unfavorable, AVG(w.number_of_fatalities) AS avg_fatalities, w.state_name, s.state
                                    FROM `traffic_fatalities.traffic_features` AS w
                                    JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                                    JOIN state_accidents AS t ON t.state_name = w.state_name
                                    WHERE atmospheric_conditions_name IN ('Blowing Snow', 'Blowing Sand, Soil, Dirt', 'Sleet or Hail', 'Fog, Smog, Smoke', 'Rain', 'Severe Crosswinds', 'Freezing Rain or Drizzle', 'Snow')
                                    GROUP BY state_name, state, num_accidents
                                    ORDER BY state_name"""
unfavorable_per_state = pd.read_gbq(query_unfavorable_per_state, project_id=project_id, dialect='standard')
fig_unfavorable_avg_fatalities = px.choropleth(
    unfavorable_per_state,
    locations='state',
    locationmode='USA-states',
    color='percent_unfavorable',
    color_continuous_scale = 'Blues',
)
fig_unfavorable_avg_fatalities.update_layout(
    title_text = 'Percent of Accidents that occur with Unfavorable Conditions 2015-2020',
    geo_scope='usa',
)

The data reveals significant variations in fatality rates among states. For instance, Alaska (AK) exhibits a notably high percentage of unfavorable conditions (19.14%) coupled with a relatively low average fatality rate (1.06), while the District of Columbia (DC) has a moderate percentage of unfavorable conditions (10.37%) but the lowest average fatality rate (1.00). Moreover, regional patterns emerge, with some states in the Midwest, like Nebraska (NE) and South Dakota (SD), experiencing both higher percentages of unfavorable conditions and higher average fatality rates, potentially indicating regional vulnerabilities. On the other hand, states in the Northeast, such as Massachusetts (MA) and New York (NY), display varying percentages of unfavorable conditions but relatively lower average fatality rates. 

While the heatmap helps us understand regional trends, it does not do as good a job showing the relationship between the percent of fatal accidents in unfavorable conditions and state fatality rates.

In [None]:
plt.scatter(percent_fatalities_per_state["yearly_fatalities_per_capita"], unfavorable_per_state["percent_unfavorable"])
plt.title("Yearly Fatalities Per Capita vs % of Crashes in Unfavourable Conditions")
plt.xlabel("Fatalities Per Capita")
plt.ylabel("%")

In [None]:
plt.scatter(unfavorable_per_state["avg_fatalities"], unfavorable_per_state["percent_unfavorable"])
plt.title("Fatalities Per Accident vs % of Crashes in Unfavourable Conditions")
plt.xlabel("Fatalities Per Accident")
plt.ylabel("%")

Sadly, there appears to be little to no correlation between fatalities per capita and % of accidents in unfavourable conditions or between fatalities per accident and % of accidents in unfavourable conditions. This tells us that an increased percent of unfavorable conditons does not indicate a higher fatality rate. Therefore, our original hypothesis that unfavorable accidents influence state fatality rate is disproven.

Nevertheless when compared to average amount of fatalities for all accidents despite conditions, we see that accidents in unfavourable coniditions have higher fatality rates compared to the nationwide mean across all atmospheric conditions, especially on the high-end among the states. Therefore we believe this feature will still be useful in our model.

### Drunk Drivers

Now we will investigate the most potent factor that may increase number and severity of fatal accidents (in our opinion). We expect that a higher number of drunk drivers increases the fatality rate. We wonder if this factor varies much from state to state. Note, we are determining the percent of drunk drivers not by the number of accidents that contain at least one drunk driver but by the number of drunk drivers in comparison to the total number of drivers (a moving vehicles indicates a driver).


We will first evaluate the percent of drunk drivers involved in fatal accidents over the course of a day.

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, hour_of_crash_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY hour_of_crash_name 

As expected, the percent of drunk drivers increases largely later in the night. Nevertheless we see that drunk driver rates continue to rise past the hours of the day that have the most fatal accidents as identified earlier (4:00pm-9:59pm). This is most likely due to the fact that there are less people on the road later in the night, but, of these drivers more of them are likely to be drunk. Graphically,

In [None]:
query_drunk_hours  = """SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, hour_of_crash_name 
                                FROM `traffic_fatalities.traffic_features` 
                                GROUP BY hour_of_crash_name 
                                ORDER BY 
                                    CASE 
                                        WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 
                                        WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 
                                        WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 
                                        WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 
                                        WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 
                                        WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 
                                        WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 
                                        WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 
                                        WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 
                                        WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 
                                        WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 
                                        WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 
                                        WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 
                                        WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 
                                        WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 
                                        WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 
                                        WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 
                                        WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 
                                        WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 
                                        WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 
                                        WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 
                                        WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 
                                        WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 
                                        WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 
                                    END
                                DESC"""

drunk_hours = pd.read_gbq(query_drunk_hours, project_id=project_id, dialect='standard')

plt.barh(drunk_hours["hour_of_crash_name"], drunk_hours["percent_drunk"])
plt.title("Percent of Drunk Drivers by Hour")
plt.xlabel("%")
plt.ylabel("Hour")

Next, we evaluate the percent of drunk drivers across the week.

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, day_of_week_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY day_of_week_name 

In [None]:
query_drunk_days  = """SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, day_of_week_name 
                        FROM `traffic_fatalities.traffic_features` 
                        GROUP BY day_of_week_name 
                        ORDER BY 
                            CASE 
                                WHEN day_of_week_name = 'Monday' THEN 0 
                                WHEN day_of_week_name = 'Tuesday' THEN 1 
                                WHEN day_of_week_name = 'Wednesday' THEN 3 
                                WHEN day_of_week_name = 'Thursday' THEN 4 
                                WHEN day_of_week_name = 'Friday' THEN 5 
                                WHEN day_of_week_name = 'Saturday' THEN 6 
                                WHEN day_of_week_name = 'Sunday' THEN 7 
                            END"""

drunk_days = pd.read_gbq(query_drunk_days, project_id=project_id, dialect='standard')

plt.bar(days, drunk_days["percent_drunk"])
plt.title("Percent of Drunk Drivers by Day of Week")
plt.ylabel("%")
plt.xlabel("Day of Week")

The proprotion of drunk drivers involved in fatal accidents increase drastically on the weekends. This aligns with our initial beliefs as the weekends are typically when people go out drinking.

Continuing on to the distribution of drunk drivers over the year.

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, month_of_crash_name 
FROM `traffic_fatalities.traffic_features` 
GROUP BY month_of_crash_name
ORDER BY percent_drunk DESC

In [None]:
query_drunk_months  = """   SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, month_of_crash_name 
                            FROM `traffic_fatalities.traffic_features` 
                            GROUP BY month_of_crash_name
                            ORDER BY
                                CASE 
                                    WHEN month_of_crash_name = 'January' THEN 1 
                                    WHEN month_of_crash_name = 'February' THEN 2 
                                    WHEN month_of_crash_name = 'March' THEN 3 
                                    WHEN month_of_crash_name = 'April' THEN 4 
                                    WHEN month_of_crash_name = 'May' THEN 5 
                                    WHEN month_of_crash_name = 'June' THEN 6 
                                    WHEN month_of_crash_name = 'July' THEN 7 
                                    WHEN month_of_crash_name = 'August' THEN 8 
                                    WHEN month_of_crash_name = 'September' THEN 9 
                                    WHEN month_of_crash_name = 'October' THEN 10 
                                    WHEN month_of_crash_name = 'November' THEN 11 
                                    WHEN month_of_crash_name = 'December' THEN 12 
                                END"""

drunk_months = pd.read_gbq(query_drunk_months, project_id=project_id, dialect='standard')

plt.bar(months, drunk_months["percent_drunk"])
plt.title("Percent of Drunk Drivers by Month")

We see that proportion of drunk drivers increase during the Summer months, this may be due to the fact that is it is Summer holidays for college age adults, and also the fact that many in the workforce take vacations during this time.

This is perhaps the most interesting statistic in this section. We see that on average, when more drunk drivers are involved in fatal accidents the average amount of fatalities increase drastically. The difference here is statistically different to just increases in the number of drivers. The cumulative effect of impaired judgment, diminished control, increased risk-taking, and reduced ability to respond effectively to hazardous situations potentially elevates the average fatalities in accidents involving drunk drivers. These effects are perhaps multiplied when there are more than one drunk driver, thus exponentially increasing crash severity.

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_fatalities) AS avg_fatalities, number_of_drunk_drivers AS num_dd
FROM `traffic_fatalities.traffic_features` 
GROUP BY num_dd
ORDER BY num_dd

In [None]:
%%bigquery --project $project_id
SELECT AVG(number_of_fatalities) AS avg_fatalities, number_of_motor_vehicles_in_transport_mvit AS num_drivers
FROM `traffic_fatalities.traffic_features` 
GROUP BY num_drivers
ORDER BY num_drivers
LIMIT 5

In [None]:
query_avg_dd  = """ SELECT AVG(number_of_fatalities) AS avg_fatalities, number_of_drunk_drivers AS num_dd
                    FROM `traffic_fatalities.traffic_features` 
                    GROUP BY num_dd
                    ORDER BY num_dd"""

avg_dd = pd.read_gbq(query_avg_dd, project_id=project_id, dialect='standard')

plt.bar(avg_dd["num_dd"], avg_dd["avg_fatalities"])
plt.title("Average Fatalities By Number of Drunk Drivers")

Now we will move on to investigating the % of drivers that are drunk in fatal accidents across different states

In [None]:
%%bigquery --project $project_id

SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, t.state_name, s.state 
FROM `traffic_fatalities.traffic_features` AS t 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON t.state_name = s.state_name 
GROUP BY state_name, state 
ORDER BY state_name

In [None]:
query_percent_drunk_drivers_per_state = """ SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, SUM(number_of_drunk_drivers) AS num_drunk, SUM(number_of_motor_vehicles_in_transport_mvit) AS num_cars, t.state_name, s.state 
                                            FROM `traffic_fatalities.traffic_features` AS t 
                                            JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON t.state_name = s.state_name 
                                            GROUP BY state_name, state 
                                            ORDER BY state_name"""
percent_drunk_drivers_per_state = pd.read_gbq(query_percent_drunk_drivers_per_state, project_id=project_id, dialect='standard')
fig_percent_drunk_drivers_per_state = px.choropleth(
    percent_drunk_drivers_per_state,
    locations='state',
    locationmode='USA-states',
    color='percent_drunk',
    color_continuous_scale = 'Reds',
)

fig_percent_drunk_drivers_per_state.update_layout(
    title_text = 'Percent Drunk Drivers 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Interestingly we see higher incidence of drunk drivers in the Midwest and the North East. This may be due to differences in alcohol related laws and accessibility. However this hypothesis is contrary to the higher concentration of drunk drivers in the North-East where drinking laws are typically more strict. In contrast to our earlier assumption, states with high tourism rates appear to have lower rates of drunk drivers in comparison to other states. More analysis needs to be done to determine the influence drunk drivers have on state fatality rates.

Plotting the number of drunk drivers per capita against the fatalities per capita yields:

In [None]:
query_drunk_per_cap =    """WITH state AS (
                                SELECT SUM(number_of_fatalities) AS num_fatalities, state_name, state_number, SUM(number_of_drunk_drivers) AS num_drunk
                                FROM `traffic_fatalities.traffic_features` 
                                GROUP BY state_name, state_number
                            ), per_cap AS (
                                SELECT f.num_fatalities / p.total_pop / 6 * 100 AS fatalities_per_capita_per_year, f.num_drunk/p.total_pop/ 6 *100 AS drunk_drivers_per_cap, f.state_name, s.state 
                                FROM state AS f 
                                JOIN `bigquery-public-data.census_bureau_acs.state_2020_5yr` AS p ON f.state_number = CAST(p.geo_id AS INT64) 
                                JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON f.state_name = s.state_name 
                                ORDER BY fatalities_per_capita_per_year DESC
                            )
                            SELECT drunk_drivers_per_cap, fatalities_per_capita_per_year, t.state_name, s.state 
                            FROM `traffic_fatalities.traffic_features` AS t 
                            JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON t.state_name = s.state_name 
                            JOIN per_cap AS p ON p.state_name = t.state_name 
                            GROUP BY state_name, state, drunk_drivers_per_cap, fatalities_per_capita_per_year
                            ORDER BY state_name"""
                            
drunk_per_cap = pd.read_gbq(query_drunk_per_cap, project_id=project_id, dialect='standard')
plt.scatter(drunk_per_cap["fatalities_per_capita_per_year"], drunk_per_cap["drunk_drivers_per_cap"])
plt.title("Yearly % of Drunk Drivers Per Cap vs Yearly Fatalities Per Cap")
plt.xlabel("Yearly Fatalities Per Capita")
plt.ylabel("Yearly Drunk Drivers Per Capita")

The number of drunk drivers per capita has a strong, positive correlation with the amount of fatal crashes per capita on a state by state basis. This means that as the percent of drunk drivers per capita increases, the fatalities per capita increases too. Therefore although this relationship isn't perfect we can infer that number of drunk drivers on the road is a good predictor for fatality rates.

Overall we have found that number of drunk drivers in fatal accidents is a very strong determinent for state fatality. And on a state wide basis average number of drunk drivers in fatal accidents per capita is correlated with number of fatal crashes per capita.

### Emergency Response Time

Finally we believe that time to scene/hospital will have a direct effect on the number of fatalities in a car accident. And we wonder if emergency response times vary significantly from state to state and if that mirrors the differences in average traffic fatalities per capita from state to state.

First, we view the number of fatalities in a fatal accident in comparison to the average time to scene

In [None]:
%%bigquery --project $project_id

SELECT number_of_fatalities, COUNT(*) AS count, AVG(time_to_scene) AS avg_time_to_scene
FROM `traffic_fatalities.traffic_features` 
WHERE time_to_scene != 9999
GROUP BY number_of_fatalities
ORDER BY number_of_fatalities

Disregarding the outliers that have little frequency, it appears there is a positive correlation between number of fatalities and average time to scene. That is, as the number of fatalities in a fatal accident increases, so does the average time to scene. This may mean that more people die when the first responders take longer to reach the scene.

Next, let's view this trend among each state

In [None]:
%%bigquery --project $project_id

SELECT AVG(w.time_to_scene) avg_time_to_scene, COUNT(*) AS count, s.state, w.state_name 
FROM `traffic_fatalities.traffic_features` AS w 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE time_to_scene != 9999 
GROUP BY state, state_name
ORDER BY state_name

In [None]:
query_tts_state =    """SELECT AVG(w.time_to_scene) avg_time_to_scene, COUNT(*) AS count, s.state, w.state_name 
                        FROM `traffic_fatalities.traffic_features` AS w 
                        JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                        WHERE time_to_scene != 9999
                        GROUP BY state, state_name
                        ORDER BY state_name"""

tts_state = pd.read_gbq(query_tts_state, project_id=project_id, dialect='standard')

fig_tts_state = px.choropleth(
    tts_state,
    locations='state',
    locationmode='USA-states',
    color='avg_time_to_scene',
    color_continuous_scale = 'Reds',
)

fig_tts_state.update_layout(
    title_text = 'Average Time to Scene 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

Here we see average emergency response time by state. As we can see, there is some variation in response time with no real regional consistencies. Perhaps this is due to just differences in spending on emergency services, in combination with average proximity to a hospital among states. Surprisingly we noticed that the average response time in California was not great, this may be due to the sheer population size in this state, and therefore congestion may be limiting the ability of emergency vehicles to reach the scene quickly. Or perhaps it's due to the fact that compared to other states in the East Coast, cities are more spread out and thus on average it takes longer for emergency vehicles to arrive.

In [None]:
fig_avg_fatalities_per_state.update()

The above heat maps show the strong similarities between average fatalities per accident and average emergency response time. This suggests that quality of emergency healthcare is certainly a strong determinent of number of fatalities during motor accidents.

In [None]:
plt.scatter(avg_fatalities_per_state["avg_fatalities"], tts_state["avg_time_to_scene"])
plt.title("Average Time to Scene vs Fatalities per Accident")
plt.ylabel("Average Time to Scene (minutes)")
plt.xlabel("Fatalities per Accident")

The strong, positive correlation between Fatalities per Accident and Average time to scene becomes even more apparent when graphed on a scatter plot. As the average time to scene increases, so does the fatality rate in a state. This suggests that the average time to scene is a strong determining factor in the fatality rate of a state.

Now moving onto average time to hospital arrival.

In [None]:
%%bigquery --project $project_id

SELECT number_of_fatalities, COUNT(*) AS count, AVG(time_to_hospital) AS avg_time_to_hospital
FROM `traffic_fatalities.traffic_features` 
WHERE time_to_hospital != 9999
GROUP BY number_of_fatalities
ORDER BY number_of_fatalities

The positive relationship nationwide between the number of fatalities and average time to hospital seems to be even more apparent than the one with average time to scene.

Viewing this trend across states yields:

In [None]:
%%bigquery --project $project_id

SELECT AVG(w.time_to_hospital) avg_time_to_hospital, COUNT(*) AS count, s.state, w.state_name 
FROM `traffic_fatalities.traffic_features` AS w 
JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
WHERE time_to_hospital != 9999 
GROUP BY state, state_name
ORDER BY state_name

In [None]:
query_avg_tth_state = """ SELECT AVG(w.number_of_fatalities) AS avg_fatalities, AVG(w.time_to_hospital) avg_time_to_hospital, COUNT(*) AS count, s.state, w.state_name 
                    FROM `traffic_fatalities.traffic_features` AS w 
                    JOIN `bigquery-public-data.geo_us_boundaries.states` AS s ON w.state_name = s.state_name 
                    WHERE time_to_hospital != 9999 
                    GROUP BY state, state_name
                    ORDER BY state_name"""

avg_tth_state = pd.read_gbq(query_avg_tth_state, project_id=project_id, dialect='standard')

fig_avg_tth_state = px.choropleth(
    avg_tth_state,
    locations='state',
    locationmode='USA-states',
    color='avg_time_to_hospital',
    color_continuous_scale = 'Reds',
)

fig_avg_tth_state.update_layout(
    title_text = 'Average Time to Hospital 2015-2020',
    geo_scope='usa', # limite map scope to USA
)

States such as California (CA) and Nevada (NV) stand out with remarkably short average times to hospital (50.91 minutes and 61.44 minutes, respectively), suggesting efficient emergency response systems. Conversely, states like Texas (TX) and North Dakota (ND) report longer average times to hospital (76.25 minutes and 87.47 minutes, respectively), indicating potential areas for improvement in emergency medical services.

Regional patterns are discernible, with Southern states such as Louisiana (LA) and Mississippi (MS) experiencing longer average times to hospital, potentially reflecting challenges in emergency response infrastructure. In contrast, states in the Midwest, including Ohio (OH) and Kansas (KS), demonstrate relatively shorter average times to hospital, emphasizing regional disparities in emergency medical service efficiency.

Plotting the average time to hospital against the average state fatalities per accident reveals the little correlation between the two variables

In [None]:
plt.scatter(avg_fatalities_per_state[avg_fatalities_per_state["state_name"] != 'Indiana']["avg_fatalities"], avg_tth_state["avg_time_to_hospital"], )
plt.title("Average Time to Hospital vs. Average Fatalities per Accident")
plt.ylabel("Average Time to Hospital (minutes)")
plt.ylabel("Average Fatalities per Accident")

There relationship is weakly positive at best. One would assume that the longer it takes to get to a hospital, the higher fatality rate. We wonder if much of the time when emergency vehicles arrive at a scene it's already too late. Therefore there might be less urgency to bring certain victims back to the hospital for autopsy instead of emergency surgery. We believe this is signficiantly skewing the data, and this is confirmed by the bimodal distribution of time_to_hopsital, with the earlier hump being when victims can still be saved and the second being when they are not.

In [None]:
query = """
SELECT time_to_hospital
FROM `traffic_fatalities.traffic_features`
WHERE time_to_hospital != 9999 AND time_to_hospital < 240
"""
query_job = client.query(query)
results = query_job.result()

# Store time_to_hospital in a list for plotting
time_to_hospital = [row.time_to_hospital for row in results]

# Plotting the distribution
plt.figure(figsize=(8, 6))
plt.hist(time_to_hospital, bins=100, color='skyblue', edgecolor='black')
plt.xlabel('Time to Hospital')
plt.ylabel('Frequency')
plt.title('Distribution of Time to Hospital')
plt.grid(True)
plt.show()

As you can see there is much less of a bimodal distribution for time for emergency vehicles to arrive at the scene as there is much less confirmation. The second hump can be attributed to accidents that are hard to get to or accidents with already confirmed fatalities.

In [None]:
query = """
SELECT time_to_scene
FROM `traffic_fatalities.traffic_features`
WHERE time_to_scene != 9999 AND time_to_scene < 240
"""
query_job = client.query(query)
results = query_job.result()

# Store time_to_hospital in a list for plotting
time_to_scene = [row.time_to_scene for row in results]

# Plotting the distribution
plt.figure(figsize=(8, 6))
plt.hist(time_to_scene, bins=100, color='skyblue', edgecolor='black')
plt.xlabel('Time to Scene')
plt.ylabel('Frequency')
plt.title('Distribution of Time to Scene')
plt.grid(True)
plt.show()

Taking all the data into account, there appears to be a moderate positive relationship between time to scene and fatality rate, confirming our hypothesis. On the contrary, there was little to no correlation between time to hospital and fatality rate. This revelation went against our prior hypothesis that longer time to hospital would yield more deaths. Across states, the average time to scene seemed to be a strong predictor of fatality rate while the average time to hospital was not.

---
## Baseline and Data Prediction
---

### Randomize Data and Create Traning - Test - Split

Number of rows to be used for 60 - 20 - 20 Training - Validation - Test split

In [None]:
%%bigquery --project $project_id
SELECT COUNT(*)
FROM `traffic_fatalities.traffic_features`

Randomize the order of entries with set seed

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE `traffic_fatalities.traffic_rand_data` AS
SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY CAST(FARM_FINGERPRINT(TO_JSON_STRING("fatalities")) AS INT64)) AS rand_id
FROM `traffic_fatalities.traffic_features`

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE `traffic_fatalities.traffic_training_set` AS(
SELECT
    id,
    state_number,
    state_name,
    number_of_motor_vehicles_in_transport_mvit,
    number_of_parked_working_vehicles,
    num_vehicles,
    number_of_persons_in_motor_vehicles_in_transport_mvit,
    number_of_persons_not_in_motor_vehicles_in_transport_mvit,
    num_people,
    month_of_crash,
    month_of_crash_name,
    day_of_week,
    day_of_week_name,
    hour_of_crash,
    hour_of_crash_name,
    land_use,
    land_use_name,
    atmospheric_conditions,
    atmospheric_conditions_name,
    time_to_scene,
    time_to_hospital,
    number_of_drunk_drivers,
    number_of_fatalities,
    label,
FROM `traffic_fatalities.traffic_rand_data`
WHERE rand_id <= 122079
)

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE `traffic_fatalities.traffic_validation_set` AS(
SELECT
    id,
    state_number,
    state_name,
    number_of_motor_vehicles_in_transport_mvit,
    number_of_parked_working_vehicles,
    num_vehicles,
    number_of_persons_in_motor_vehicles_in_transport_mvit,
    number_of_persons_not_in_motor_vehicles_in_transport_mvit,
    num_people,
    month_of_crash,
    month_of_crash_name,
    day_of_week,
    day_of_week_name,
    hour_of_crash,
    hour_of_crash_name,
    land_use,
    land_use_name,
    atmospheric_conditions,
    atmospheric_conditions_name,
    time_to_scene,
    time_to_hospital,
    number_of_drunk_drivers,
    number_of_fatalities,
    label,
FROM `traffic_fatalities.traffic_rand_data`
WHERE rand_id <= 136942 AND rand_id > 122079
)

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE `traffic_fatalities.traffic_test_set` AS(
SELECT
    id,
    state_number,
    state_name,
    number_of_motor_vehicles_in_transport_mvit,
    number_of_parked_working_vehicles,
    num_vehicles,
    number_of_persons_in_motor_vehicles_in_transport_mvit,
    number_of_persons_not_in_motor_vehicles_in_transport_mvit,
    num_people,
    month_of_crash,
    month_of_crash_name,
    day_of_week,
    day_of_week_name,
    hour_of_crash,
    hour_of_crash_name,
    land_use,
    land_use_name,
    atmospheric_conditions,
    atmospheric_conditions_name,
    time_to_scene,
    time_to_hospital,
    number_of_drunk_drivers,
    number_of_fatalities,
    label,
FROM `traffic_fatalities.traffic_rand_data`
WHERE rand_id > 136942
)

### Baseline

Let's establish a baseline for our machine learning model using K-nearest-neighbours

In [None]:
%%bigquery --project $project_id
WITH training_set AS (
    SELECT
        state_number,
        number_of_motor_vehicles_in_transport_mvit,
        number_of_parked_working_vehicles,
        num_vehicles,
        number_of_persons_in_motor_vehicles_in_transport_mvit,
        number_of_persons_not_in_motor_vehicles_in_transport_mvit,
        num_people,
        month_of_crash,
        day_of_week,
        hour_of_crash,
        land_use,
        atmospheric_conditions,
        time_to_scene,
        time_to_hospital,
        number_of_drunk_drivers,
        number_of_fatalities,
        label
    FROM `traffic_fatalities.traffic_training_set`
), prediction_set AS (
    SELECT
        id,
        state_number,
        number_of_motor_vehicles_in_transport_mvit,
        number_of_parked_working_vehicles,
        num_vehicles,
        number_of_persons_in_motor_vehicles_in_transport_mvit,
        number_of_persons_not_in_motor_vehicles_in_transport_mvit,
        num_people,
        month_of_crash,
        day_of_week,
        hour_of_crash,
        land_use,
        atmospheric_conditions,
        time_to_scene,
        time_to_hospital,
        number_of_drunk_drivers,
        number_of_fatalities,
    FROM `traffic_fatalities.traffic_test_set`
    WHERE num_vehicles > 2 AND num_people > 2 AND number_of_drunk_drivers >= 1 AND time_to_scene >= 5
    LIMIT 10
), generate_score AS (
    SELECT 
        p.id,
        t.label,
        SQRT(
            POWER(t.state_number - p.state_number, 2) + 
            POWER(t.number_of_motor_vehicles_in_transport_mvit - p.number_of_motor_vehicles_in_transport_mvit, 2) +
            POWER(t.number_of_parked_working_vehicles - p.number_of_parked_working_vehicles, 2) +
            POWER(t.num_vehicles - p.num_vehicles, 2) +
            POWER(t.number_of_persons_in_motor_vehicles_in_transport_mvit - p.number_of_persons_in_motor_vehicles_in_transport_mvit, 2) +
            POWER(t.number_of_persons_not_in_motor_vehicles_in_transport_mvit - p.number_of_persons_not_in_motor_vehicles_in_transport_mvit, 2) +
            POWER(t.num_people - p.num_people, 2) +
            POWER(t.month_of_crash - p.month_of_crash, 2) +
            POWER(t.day_of_week - p.day_of_week, 2) +
            POWER(t.hour_of_crash - p.hour_of_crash, 2) +
            POWER(t.land_use - p.land_use, 2) +
            POWER(t.atmospheric_conditions - p.atmospheric_conditions, 2) +
            POWER(t.time_to_scene - p.time_to_scene, 2) +
            POWER(t.time_to_hospital - p.time_to_hospital, 2) +
            POWER(t.number_of_drunk_drivers - p.number_of_drunk_drivers, 2) + 
            POWER(t.number_of_fatalities - p.number_of_fatalities, 2)
        ) AS score
    FROM training_set AS t
    CROSS JOIN prediction_set AS p
), rank_score AS (
    SELECT
        *,
        RANK() OVER (PARTITION BY id ORDER BY score) AS rank
    FROM generate_score
), knn AS (
    SELECT
        id,
        AVG(label) AS avg_knn_label
    FROM rank_score
    WHERE rank <= 50
    GROUP BY id
), predict AS (
    SELECT
        id,
        ROUND(avg_knn_label, 0) AS predicted_label,
        avg_knn_label AS predicted_prob
    FROM knn
), evaluate AS (
    SELECT COUNTIF(prediction.predicted_label = actual.label)/COUNT(*) AS accuracy
    FROM predict AS prediction
    JOIN `traffic_fatalities.traffic_test_set` AS actual ON prediction.id = actual.id
)

SELECT * FROM evaluate

The baseline achieved an 80% accuracy rate for the first 10 accidents in our test set that have more than two vehicles, more than 2 people, at least one drunk driver, and a time to scene of at least 5. We expect our logistic regression model to match, if not improve on that mark.

### Machine Learning

Train on 60% of the data 0.8 * 203465 = 122079. 

We chose to train on state_name, number_of_motor_vehicles_in_transport_mvit, number_of_parked_working_vehicles, number_of_persons_in_motor_vehicles_in_transport_mvit, number_of_persons_not_in_motor_vehicles_in_transport_mvit, month_of_crash_name, day_of_week_name, hour_of_crash_name, land_use_name, atmospheric_conditions_name, time_to_scene, time_to_hospital, and number_of_drunk_drivers because these were the variables we analyzed during the data analysis phase. The plethora of variables and meaning behind them should allow the model to accurately predict the number of fatalities (the label). We chose to include the id to identify the results during the prediction stage and verify how accurate our model is.

In [None]:
%%bigquery --project $project_id

# YOUR QUERY HERE
CREATE OR REPLACE MODEL `traffic_fatalities.traffic_model`
OPTIONS(model_type='logistic_reg') AS
SELECT 
    *
FROM `traffic_fatalities.traffic_training_set`

Get training statistics

In [None]:
%%bigquery --project $project_id

# Run cell to view training stats

SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `traffic_fatalities.traffic_model`)

The training statistics are very good. In this case, we are defining good as being accurate. The model achieved a training loss of only about 0.09. A training loss represents the difference between the model's predicted values and the actual values in the training dataset. In this case, the small training loss suggests that the model's predictions closely align with the actual outcomes, demonstrating its accuracy. The training loss is so low that it worries us that the data is overfitted. We will know if the model is overfitted when we evaluate the model on the validation set.

Evalute the model on validation set (next 20% of randomized data). 0.6 * 203465 ~ 122079 and 0.8 * 203465 ~ 136942

In [None]:
%%bigquery --project $project_id
# YOUR QUERY HERE


SELECT
  *
FROM
  ML.EVALUATE(MODEL `traffic_fatalities.traffic_model`, (
      SELECT
        *
      FROM `traffic_fatalities.traffic_validation_set`))

The model did well again showing that the model was not overfit. Precision is the ratio of true positive predictions to the total predicted positives. Here, about 92% of the positive predictions made by the model were actually correct so our model predicts severe accidents well. Recall is the ratio of true positive predictions to the total actual positives. The model identified approximately 87% of all actual positive instances. Accuracy is the ratio of correctly predicted instances to the total instances. The model achieved an accuracy of around 90%, indicating that it is fairly accurate. The F1 score is the harmonic mean of precision and recall. The F1 score is approximately 89%, suggesting a good balance between precision and recall. Log loss measures the performance of a classification model where the prediction output is a probability value between 0 and 1. Lower values are better. With a log loss of roughly 0.35, the model's predictions are semi-confident but with room to improve. Receiver Operating Characteristic Area Under the Curve (ROC AUC) measures the area under the ROC curve, which represents the trade-off between true positive rate and false positive rate. The ROC AUC score of 0.94 indicates a high level of discrimination between positive and negative instances. If I were to improve one value, it would be the log loss. A log loss of 0.35 is servicable but shows the model has room for improvement.

We originally got supbar results on the model when using a logistic regression model to classify the number of fatalities. The mdoel had high accuracy but low precision indicating it was poor at predicting any number of fatalities above one. Thus, we used the validation set to tune our hyperparameters and ultimately ended up on our new label. We want to address that the same problem that plagued our model when it was an evalutation model is still an issue with the newly defined label: the product of the number of fatalities to the ratio of number of total people to number of total cars. Since a large percent of the data had one car, one person and one fatality, the most common severe rating was one. Setting the label to be any severity above one, while dividing the dataset in half, made it easy to learn on. For example, when the model sees anything that has more than one person or one car, it can assume that it is a 'severe' accident.

Evalute the model on test set (last 20% of randomized data). 0.8 * 203465 ~ 136942

In [None]:
%%bigquery --project $project_id

# YOUR QUERY HERE
SELECT
  *
FROM
  ML.EVALUATE(MODEL `traffic_fatalities.traffic_model`, (
      SELECT
        *
      FROM `traffic_fatalities.traffic_test_set`))

The model achieved very similar results on the test data. Showing that the model does a good job accurately predicting severe accidents.

Use model to predict number of fatalities where the number of moving vehicles > 2, number of people in moving vehicles > 3, number of drunk drivers > 2, and time to scene >= 10 min among prediction set .

In [None]:
%%bigquery --project $project_id

# YOUR QUERY HERE
SELECT
  *
FROM
  ML.PREDICT(MODEL `traffic_fatalities.traffic_model`, (
  SELECT
    id,
    state_number,
    state_name,
    number_of_motor_vehicles_in_transport_mvit,
    number_of_parked_working_vehicles,
    num_vehicles,
    number_of_persons_in_motor_vehicles_in_transport_mvit,
    number_of_persons_not_in_motor_vehicles_in_transport_mvit,
    num_people,
    month_of_crash,
    month_of_crash_name,
    day_of_week,
    day_of_week_name,
    hour_of_crash,
    hour_of_crash_name,
    land_use,
    land_use_name,
    atmospheric_conditions,
    atmospheric_conditions_name,
    time_to_scene,
    time_to_hospital,
    number_of_drunk_drivers,
    number_of_fatalities,
  FROM `traffic_fatalities.traffic_test_set`
  WHERE num_vehicles > 2 AND num_people > 2 AND number_of_drunk_drivers >= 1 AND time_to_scene >= 5
  LIMIT 10
  ))

Compare the output of the model to the actual values

In [None]:
%%bigquery --project $project_id

SELECT
    id,
    label
FROM `traffic_fatalities.traffic_test_set`
WHERE num_vehicles > 2 AND num_people > 2 AND number_of_drunk_drivers >= 1 AND time_to_scene >= 5
LIMIT 10

### Model Comparisons

`**NOT DONE**`

The model misidentified two accidents: 4703212017 and 613822017. This gives the model an 80% accuracy rate for these entries. The baseline KNN query also achieved an accuracy of 80% for these 10 values. Therefore the model matched the baseline accuracy that we set out for it to accomplish. The difference between the two prediction methods is in complexity.

KNN has minimal training time because it simply stores the training data. KNN is a lazy learner, meaning it doesn't explicitly "train" on the data during the training phase. The majority of the computation occurs during prediction. Therefore, the prediction time in KNN can be higher, especially as the dataset size increases. Each prediction involves calculating distances between all combinations of training accidents and prediction accidents. As the inputted data sets get larger, the prediction time grows exponentially. On the other hand, training a Logistic Regression model involves iterative optimization algorithms (e.g., gradient descent). The training time can vary based on the convergence speed and the size of the dataset but will generally be more complex than that of a KNN algorithm. Logistic Regression generally has lower prediction time compared to KNN, especially for large datasets. Prediction for a new instance is typically a simple linear operation. 

KNN's I/O costs during prediction can be significant, as it needs to compare and store each query instance against all training instances. For large datasets, this can lead to a higher I/O cost. As such, the computational cost for KNN scales with the number of training instances, and using larger datasets can result in higher cloud computing costs due to increased computation and I/O operations. Logistic Regression typically has lower I/O costs since it involves a simple linear operation based on learned weights without storing any data (besides the weights). The training cost for Logistic Regression can be influenced by the size of the dataset. So, we see that the I/O and monetary costs increase with both the training and prediction phase of KNN as the dataset gets larger whereas the I/O and monetary costs of logistic regression is only reliant on the training step.

Logistic Regression is generally more scalable for large datasets, as its training and prediction times are less affected by the dataset size compared to KNN. Logistic Regression provides interpretable coefficients, which can be valuable for understanding feature importance. KNN, being a non-parametric method, doesn't offer straightforward interpretability.

## Conclusion

---

*TODO: Final conclusions based on the rest of your project*

---