## Are there specific states with higher or lower fatality rates? What factors might contribute to regional variations?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Run this cell to authenticate yourself to BigQuery
from google.oauth2 import service_account
key_path = './cs145-project2-406000-9a59fc7c0b3d.json'
credential = service_account.Credentials.from_service_account_file(key_path)

In [None]:
# Initialize BiqQuery client
from google.cloud import bigquery
%load_ext google.cloud.bigquery
%env GOOGLE_APPLICATION_CREDENTIALS=$key_path
project_id = "cs145-project2-406000"
client = bigquery.Client(credentials=credential, project=project_id)

## Taste of Dataset (for a particlar year)

This is the main table of the dataset. It has a high level description of each crash e.g. when, where, and some description of the actual crash. Moreover it gives us a handful of other useful columns such as atmospheric condition which indicates if it was raining, cloudy etc. Another column like light_condition gives information on how bright or dark it is on a numerical scale (e.g. dark and lighted, dusk, daylight etc.). However the more interesting columns are the crash time and arrival of emergency services at the scene - which may give us a lot of insight into how the quickness of emergency response can help reduce the number of fatalities in a car accident. Also note that the consecutive number is the code for the crash, which includes all cars involved in that particular accident. We will also use this identification number to join all the other tables to make the dataset more robust. There are some data inconsistencies from year to year, but we are mainly choosing to exclude 2015 as there are a couple of columns missing from this dataset that we will be using. So we will be drawing quantitative insights from five years of data, not six.

Moreover we need to be mindful that since there are multiple victims or participants in a fatal car accident, that there may be several rows in other tables below that refer to the same crash but for a different vehicle. For example in a three way crash, DISTRACT will include an entry for all three drivers. This is the same for other tables as well like MANEUVER etc. Therefore it is crucial we keep this in mind when we query data or train a model.

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
LIMIT 5


An add on where we can join consecutive numbers to the accidents table. This delineates the number and type of vision obstruction that occurred and contributed to the car accident e.g. “reflected glare, bright sunlight”. 

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. vision_2016`
LIMIT 5


This is a simple add on where we can join consecutive numbers to the accidents table. Here the driver_distracted_by_name will give us information on whether or not the driver was distracted. If not it will give us “not reported” or “unknown”. Or if the driver was distracted, will give us the source of distraction e.g. “distracted by an outside person”. 

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. distract_2016`
LIMIT 5


This is another add on where we can join consecutive numbers to the accidents table. Here the table will give us information on whether or not the driver had a physical impairment that may have contributed to the crash. If not it will give us “None/Apparently Normal” or “unknown”. 

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. drimpair_2016`
LIMIT 5


The below table we can join consecutive numbers to the accidents table. Here the table will give us information on whether or not the driver maneuvered to avoid an object or car e.g. “motor vehicle”

In [None]:
%%bigquery --project $project_id
SELECT *
FROM `bigquery-public-data.nhtsa_traffic_fatalities. maneuver_2016`
LIMIT 5

ISSUES WE NOTICED:

After a few hours of querying we realized that consecutive number provides unique ID codes for each accident, however some are reused from year to year, therefore we had to append year to these codes to ensure all accidents are unique in the aggregated table we made

No data for the state of Indiana 

CEVENT, VSOE, VEVENT although not included above seem to be describing very similar things, all including sequence of event coding for each crash (consec. number). Very detailed and descriptive data with not a lot of matching entries, hard to do much quantitative analysis on this. Therefore we thought it was quite unrelated to our question as this covers when crash actually occurs instead of contributing factors and therefore we excluded this from out analysis

Many of the entries from distract, drimpair etc. are written as 'unknown' or 'not reported' which may skew the data significantly. 

We noticed that there were differences in number of rows from ACCIDENTS to DRIMPAIR to DISTRACT etc. This is because a unique accident ID can have multiple distraction reports, impairment and vision impairments reports of varying size. We realized this would pose several issues when we are joining these tables to the accident tables. Moreover we had to account for the fact that if there are multiple participants within an accident there will be multiple entries for a specific accident id number e.g. if there is a head-on collision, there will be row entries for both drivers in drimpair table under the same accident ID that states if either driver was physically impaired or not. This resulted in certain specific crashes having thousands of different entries after joining on multiple different tables due to the sheer amount of permutations. Therefore we decided to cut down on the number of feature tables we use to just two, DISTRACT and DRIMPAIR. However the same issues persisted, the mismatch of number of entries per crash would make the data very confusing etc. Therefore the only way to incorporate other tables to accidents would be to count number of impairments, or maneuvers for a unique accident. However we decided that this would abstract the data too much such that it wouldn't be meaningful anymore. Therefore we are sticking just to joining the accident tables across all years except 2015. This is because 2015 has come missing useful columns present in other years. 






## Investigate how we should create our car severity equation

Below is a query that aims to investigate the number of accidents with different numbers of vehicles in transport. As we can see here, most accidents involve one car. And number of accidents increase as number of motor vehicles in transport decreases. Ultimately we want to see if number of vehicles in transport involved in accident should be used to determine crash severity. 

In [None]:
%%bigquery --project $project_id

SELECT number_of_motor_vehicles_in_transport_mvit, COUNT(number_of_motor_vehicles_in_transport_mvit) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_motor_vehicles_in_transport_mvit
ORDER BY number_of_motor_vehicles_in_transport_mvit

Below is a query that aims to investigate the number of accidents with different numbers of vehicles that are parked.

In [None]:
%%bigquery --project $project_id

SELECT number_of_parked_working_vehicles, COUNT(number_of_parked_working_vehicles) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_parked_working_vehicles
ORDER BY number_of_parked_working_vehicles

We also wanted to see how accidents were stratified by number of fatalities. Below we noticed that like number of vehicles involved (which makes sense), number of fatalities in an accident is mostly 1. As such we see that most fatal accidents in the US do not involve more than just the driver.

In [None]:
%%bigquery --project $project_id

SELECT number_of_fatalities, COUNT(number_of_fatalities) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY number_of_fatalities
ORDER BY number_of_fatalities

When determining the label, we were disappointed to find that the data set only includes accidents with fatalities. We could do so much more with a dataset that included accidents with and without fatalities. We found that in the vast majority of the dataset, entries only had one fatality, thus determining severity purely based on number of fatalities would not be realistic for a model.

 Just so we intitially decided to artificially make a label variable that estimates the severity of a crash and creates a nice distribution between low severity and high severity. We let the severity = LOG(number_of_motor_vehicles_in_transport_mvit) + 2 * LOG(number_of_parked_working_vehicles + 1) + number_of_fatalities = 1. This is because vehicles that are parked are less likely to have occupants. Moreover we think the number of injuries and damage caused will greatly increase with number of moving vehicles in the crash. Finally we weight the number of fatalities in an accident the most for obvious reasons. Below we see that the data is still slightly biased towards "low severity fatal accidents". 

 As you can see below the metric gives us around a 50/50 split in terms of high and low severity. Here we define high severity as any severity that isn't one. Which means more than one person and one car involved. 

Nevertheless we decided the number here was too abstract, and therefore opted to classify severe fatal crashes as any accident with more than one car or fatality. Therefore our label is - IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0)

In [None]:
%%bigquery --project $project_id

SELECT IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0) AS label, COUNT(*) AS count
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016`
GROUP BY label

## Aggregate the Data we want into one Table

We unioned all the accident tables from year to year (2016-2020). We noticed that the consecutive numbers (the unique ID code for a specific accident within a given year) are reused from year to year. To overcome this fact we appended the year number to the consecutive number to ensure that all accidents are unique when we join by year.

We excluded table from 2015 as it was missing some crucial columns.

WE ALSO ADDED TWO COLUMNS:
#If it is labelled as severe, this is an accident that involved more than one car or fatality.
LABEL = Severe = [0,1] 

#This gives us the response time, or the difference between time of accident and time of arrival of emergency servcies 
EMERGENCY RESPONSE DIFFERENCE =  IF(accident2016.hour_of_notification > 23 OR accident2016.hour_of_arrival_at_scene > 23 OR accident2016.minute_of_arrival_at_scene > 59, 9999, (IF(accident2016.hour_of_notification <= accident2016.hour_of_arrival_at_scene, accident2016.hour_of_arrival_at_scene - accident2016.hour_of_notification, accident2016.hour_of_arrival_at_scene - accident2016.hour_of_notification + 24) * 60 + IF(accident2016.minute_of_notification <= accident2016.minute_of_arrival_at_scene, accident2016.minute_of_arrival_at_scene - accident2016.minute_of_notification, accident2016.minute_of_arrival_at_scene - accident2016.minute_of_notification + 60))) AS time_to_scene



In [None]:
%%bigquery --project $project_id
CREATE OR REPLACE TABLE traffic_fatalities.traffic_features AS
SELECT
  CONCAT(accident2016.consecutive_number, accident2016.year_of_crash) AS id,    
  accident2016.state_name,
  accident2016.number_of_motor_vehicles_in_transport_mvit,
  accident2016.number_of_parked_working_vehicles,
  accident2016.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2016.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  accident2016.city_name,
  accident2016.day_name,
  accident2016.month_of_crash_name,
  accident2016.day_of_week_name,
  accident2016.hour_of_crash_name,
  accident2016.year_of_crash,
  accident2016.route_signing_name,
  accident2016.land_use_name,
  accident2016.first_harmful_event_name,
  accident2016.work_zone_name,
  accident2016.relation_to_trafficway_name,
  accident2016.light_condition_name,
  accident2016.atmospheric_conditions_name,
  IF(accident2016.hour_of_notification > 23 OR accident2016.hour_of_arrival_at_scene > 23 OR accident2016.minute_of_arrival_at_scene > 59, 9999, (IF(accident2016.hour_of_notification <= accident2016.hour_of_arrival_at_scene, accident2016.hour_of_arrival_at_scene - accident2016.hour_of_notification, accident2016.hour_of_arrival_at_scene - accident2016.hour_of_notification + 24) * 60 + IF(accident2016.minute_of_notification <= accident2016.minute_of_arrival_at_scene, accident2016.minute_of_arrival_at_scene - accident2016.minute_of_notification, accident2016.minute_of_arrival_at_scene - accident2016.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2016.hour_of_notification > 23 OR accident2016.hour_of_ems_arrival_at_hospital > 23 OR accident2016.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2016.hour_of_notification <= accident2016.hour_of_ems_arrival_at_hospital, accident2016.hour_of_ems_arrival_at_hospital - accident2016.hour_of_notification, accident2016.hour_of_ems_arrival_at_hospital - accident2016.hour_of_notification + 24) * 60 + IF(accident2016.minute_of_notification <= accident2016.minute_of_ems_arrival_at_hospital, accident2016.minute_of_ems_arrival_at_hospital - accident2016.minute_of_notification, accident2016.minute_of_ems_arrival_at_hospital - accident2016.minute_of_notification + 60))) AS time_to_hospital,
  accident2016.number_of_drunk_drivers,
  accident2016.number_of_fatalities,
  IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2016` AS accident2016
UNION ALL
SELECT
  CONCAT(accident2017.consecutive_number, accident2017.year_of_crash) AS id,    
  accident2017.state_name,
  accident2017.number_of_motor_vehicles_in_transport_mvit,
  accident2017.number_of_parked_working_vehicles,
  accident2017.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2017.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  accident2017.city_name,
  accident2017.day_name,
  accident2017.month_of_crash_name,
  accident2017.day_of_week_name,
  accident2017.hour_of_crash_name,
  accident2017.year_of_crash,
  accident2017.route_signing_name,
  accident2017.land_use_name,
  accident2017.first_harmful_event_name,
  accident2017.work_zone_name,
  accident2017.relation_to_trafficway_name,
  accident2017.light_condition_name,
  accident2017.atmospheric_conditions_name,
  IF(accident2017.hour_of_notification > 23 OR accident2017.hour_of_arrival_at_scene > 23 OR accident2017.minute_of_arrival_at_scene > 59, 9999, (IF(accident2017.hour_of_notification <= accident2017.hour_of_arrival_at_scene, accident2017.hour_of_arrival_at_scene - accident2017.hour_of_notification, accident2017.hour_of_arrival_at_scene - accident2017.hour_of_notification + 24) * 60 + IF(accident2017.minute_of_notification <= accident2017.minute_of_arrival_at_scene, accident2017.minute_of_arrival_at_scene - accident2017.minute_of_notification, accident2017.minute_of_arrival_at_scene - accident2017.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2017.hour_of_notification > 23 OR accident2017.hour_of_ems_arrival_at_hospital > 23 OR accident2017.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2017.hour_of_notification <= accident2017.hour_of_ems_arrival_at_hospital, accident2017.hour_of_ems_arrival_at_hospital - accident2017.hour_of_notification, accident2017.hour_of_ems_arrival_at_hospital - accident2017.hour_of_notification + 24) * 60 + IF(accident2017.minute_of_notification <= accident2017.minute_of_ems_arrival_at_hospital, accident2017.minute_of_ems_arrival_at_hospital - accident2017.minute_of_notification, accident2017.minute_of_ems_arrival_at_hospital - accident2017.minute_of_notification + 60))) AS time_to_hospital,
  accident2017.number_of_drunk_drivers,
  accident2017.number_of_fatalities,
  IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2017` AS accident2017
UNION ALL
SELECT
  CONCAT(accident2018.consecutive_number, accident2018.year_of_crash) AS id,    
  accident2018.state_name,
  accident2018.number_of_motor_vehicles_in_transport_mvit,
  accident2018.number_of_parked_working_vehicles,
  accident2018.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2018.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  accident2018.city_name,
  accident2018.day_name,
  accident2018.month_of_crash_name,
  accident2018.day_of_week_name,
  accident2018.hour_of_crash_name,
  accident2018.year_of_crash, 
  accident2018.route_signing_name,
  accident2018.land_use_name,
  accident2018.first_harmful_event_name,
  accident2018.work_zone_name,
  accident2018.relation_to_trafficway_name,
  accident2018.light_condition_name,
  accident2018.atmospheric_conditions_name,
  IF(accident2018.hour_of_notification > 23 OR accident2018.hour_of_arrival_at_scene > 23 OR accident2018.minute_of_arrival_at_scene > 59, 9999, (IF(accident2018.hour_of_notification <= accident2018.hour_of_arrival_at_scene, accident2018.hour_of_arrival_at_scene - accident2018.hour_of_notification, accident2018.hour_of_arrival_at_scene - accident2018.hour_of_notification + 24) * 60 + IF(accident2018.minute_of_notification <= accident2018.minute_of_arrival_at_scene, accident2018.minute_of_arrival_at_scene - accident2018.minute_of_notification, accident2018.minute_of_arrival_at_scene - accident2018.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2018.hour_of_notification > 23 OR accident2018.hour_of_ems_arrival_at_hospital > 23 OR accident2018.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2018.hour_of_notification <= accident2018.hour_of_ems_arrival_at_hospital, accident2018.hour_of_ems_arrival_at_hospital - accident2018.hour_of_notification, accident2018.hour_of_ems_arrival_at_hospital - accident2018.hour_of_notification + 24) * 60 + IF(accident2018.minute_of_notification <= accident2018.minute_of_ems_arrival_at_hospital, accident2018.minute_of_ems_arrival_at_hospital - accident2018.minute_of_notification, accident2018.minute_of_ems_arrival_at_hospital - accident2018.minute_of_notification + 60))) AS time_to_hospital,
  accident2018.number_of_drunk_drivers,
  accident2018.number_of_fatalities,
  IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2018` AS accident2018
UNION ALL
SELECT
  CONCAT(accident2019.consecutive_number, accident2019.year_of_crash) AS id,    
  accident2019.state_name,
  accident2019.number_of_motor_vehicles_in_transport_mvit,
  accident2019.number_of_parked_working_vehicles,
  accident2019.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2019.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  accident2019.city_name,
  accident2019.day_name,
  accident2019.month_of_crash_name,
  accident2019.day_of_week_name,
  accident2019.hour_of_crash_name,
  accident2019.year_of_crash, 
  accident2019.route_signing_name,
  accident2019.land_use_name,
  accident2019.first_harmful_event_name,
  accident2019.work_zone_name,
  accident2019.relation_to_trafficway_name,
  accident2019.light_condition_name,
  accident2019.atmospheric_conditions_name,
  IF(accident2019.hour_of_notification > 23 OR accident2019.hour_of_arrival_at_scene > 23 OR accident2019.minute_of_arrival_at_scene > 59, 9999, (IF(accident2019.hour_of_notification <= accident2019.hour_of_arrival_at_scene, accident2019.hour_of_arrival_at_scene - accident2019.hour_of_notification, accident2019.hour_of_arrival_at_scene - accident2019.hour_of_notification + 24) * 60 + IF(accident2019.minute_of_notification <= accident2019.minute_of_arrival_at_scene, accident2019.minute_of_arrival_at_scene - accident2019.minute_of_notification, accident2019.minute_of_arrival_at_scene - accident2019.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2019.hour_of_notification > 23 OR accident2019.hour_of_ems_arrival_at_hospital > 23 OR accident2019.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2019.hour_of_notification <= accident2019.hour_of_ems_arrival_at_hospital, accident2019.hour_of_ems_arrival_at_hospital - accident2019.hour_of_notification, accident2019.hour_of_ems_arrival_at_hospital - accident2019.hour_of_notification + 24) * 60 + IF(accident2019.minute_of_notification <= accident2019.minute_of_ems_arrival_at_hospital, accident2019.minute_of_ems_arrival_at_hospital - accident2019.minute_of_notification, accident2019.minute_of_ems_arrival_at_hospital - accident2019.minute_of_notification + 60))) AS time_to_hospital,
  accident2019.number_of_drunk_drivers,
  accident2019.number_of_fatalities,
  IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2019` AS accident2019
UNION ALL
SELECT
  CONCAT(accident2020.consecutive_number, accident2020.year_of_crash) AS id,    
  accident2020.state_name,
  accident2020.number_of_motor_vehicles_in_transport_mvit,
  accident2020.number_of_parked_working_vehicles,
  accident2020.number_of_persons_in_motor_vehicles_in_transport_mvit,
  accident2020.number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  accident2020.city_name,
  accident2020.day_name,
  accident2020.month_of_crash_name,
  accident2020.day_of_week_name,
  accident2020.hour_of_crash_name,
  accident2020.year_of_crash, 
  accident2020.route_signing_name,
  accident2020.land_use_name,
  accident2020.first_harmful_event_name,
  accident2020.work_zone_name,
  accident2020.relation_to_trafficway_name,
  accident2020.light_condition_name,
  accident2020.atmospheric_conditions_1_name AS atmospheric_conditions_name,
  IF(accident2020.hour_of_notification > 23 OR accident2020.hour_of_arrival_at_scene > 23 OR accident2020.minute_of_arrival_at_scene > 59, 9999, (IF(accident2020.hour_of_notification <= accident2020.hour_of_arrival_at_scene, accident2020.hour_of_arrival_at_scene - accident2020.hour_of_notification, accident2020.hour_of_arrival_at_scene - accident2020.hour_of_notification + 24) * 60 + IF(accident2020.minute_of_notification <= accident2020.minute_of_arrival_at_scene, accident2020.minute_of_arrival_at_scene - accident2020.minute_of_notification, accident2020.minute_of_arrival_at_scene - accident2020.minute_of_notification + 60))) AS time_to_scene,
  IF(accident2020.hour_of_notification > 23 OR accident2020.hour_of_ems_arrival_at_hospital > 23 OR accident2020.minute_of_ems_arrival_at_hospital > 59, 9999, (IF(accident2020.hour_of_notification <= accident2020.hour_of_ems_arrival_at_hospital, accident2020.hour_of_ems_arrival_at_hospital - accident2020.hour_of_notification, accident2020.hour_of_ems_arrival_at_hospital - accident2020.hour_of_notification + 24) * 60 + IF(accident2020.minute_of_notification <= accident2020.minute_of_ems_arrival_at_hospital, accident2020.minute_of_ems_arrival_at_hospital - accident2020.minute_of_notification, accident2020.minute_of_ems_arrival_at_hospital - accident2020.minute_of_notification + 60))) AS time_to_hospital,
  accident2020.number_of_drunk_drivers,
  accident2020.number_of_fatalities,
  IF(number_of_motor_vehicles_in_transport_mvit + number_of_parked_working_vehicles > 1 OR number_of_fatalities > 1, 1, 0) AS label
FROM `bigquery-public-data.nhtsa_traffic_fatalities. accident_2020` AS accident2020

## Number of Accidents

In [None]:
%%bigquery --project $project_id
SELECT COUNT(*)
FROM `traffic_fatalities.traffic_features`

## Crashes by Day of week

In [None]:
query_days_not_severe  = "SELECT COUNT(*) as num_crashes, day_of_week_name FROM `traffic_fatalities.traffic_features` WHERE label = 0 GROUP BY day_of_week_name ORDER BY CASE WHEN day_of_week_name = 'Moday' THEN 0 WHEN day_of_week_name = 'Tuesday' THEN 1 WHEN day_of_week_name = 'Wednesday' THEN 3 WHEN day_of_week_name = 'Thursday' THEN 4 WHEN day_of_week_name = 'Friday' THEN 5 WHEN day_of_week_name = 'Saturday' THEN 6 WHEN day_of_week_name = 'Sunday' THEN 7 END"

days_not_severe = pd.read_gbq(query_days_not_severe, project_id=project_id, dialect='standard')

plt.bar(days_not_severe["day_of_week_name"], days_not_severe["num_crashes"])

In [None]:
query_days_severe  = "SELECT COUNT(*) as num_crashes, day_of_week_name FROM `traffic_fatalities.traffic_features` WHERE label = 1 GROUP BY day_of_week_name ORDER BY CASE WHEN day_of_week_name = 'Moday' THEN 0 WHEN day_of_week_name = 'Tuesday' THEN 1 WHEN day_of_week_name = 'Wednesday' THEN 3 WHEN day_of_week_name = 'Thursday' THEN 4 WHEN day_of_week_name = 'Friday' THEN 5 WHEN day_of_week_name = 'Saturday' THEN 6 WHEN day_of_week_name = 'Sunday' THEN 7 END"

days_severe = pd.read_gbq(query_days_severe, project_id=project_id, dialect='standard')

plt.bar(days_severe["day_of_week_name"], days_severe["num_crashes"])

In [None]:
query_days  = "SELECT COUNT(*) as num_crashes, day_of_week_name FROM `traffic_fatalities.traffic_features` GROUP BY day_of_week_name ORDER BY CASE WHEN day_of_week_name = 'Moday' THEN 0 WHEN day_of_week_name = 'Tuesday' THEN 1 WHEN day_of_week_name = 'Wednesday' THEN 3 WHEN day_of_week_name = 'Thursday' THEN 4 WHEN day_of_week_name = 'Friday' THEN 5 WHEN day_of_week_name = 'Saturday' THEN 6 WHEN day_of_week_name = 'Sunday' THEN 7 END"

days = pd.read_gbq(query_days, project_id=project_id, dialect='standard')

plt.bar(days["day_of_week_name"], days["num_crashes"])

In [None]:
query_days_percent_severe  = "SELECT SUM(label)/COUNT(*) as percent_severe, day_of_week_name FROM `traffic_fatalities.traffic_features` GROUP BY day_of_week_name ORDER BY CASE WHEN day_of_week_name = 'Moday' THEN 0 WHEN day_of_week_name = 'Tuesday' THEN 1 WHEN day_of_week_name = 'Wednesday' THEN 3 WHEN day_of_week_name = 'Thursday' THEN 4 WHEN day_of_week_name = 'Friday' THEN 5 WHEN day_of_week_name = 'Saturday' THEN 6 WHEN day_of_week_name = 'Sunday' THEN 7 END"

days_percent_severe = pd.read_gbq(query_days_percent_severe, project_id=project_id, dialect='standard')

plt.bar(days_percent_severe["day_of_week_name"], days_percent_severe["percent_severe"])

## Number of Crashes by Hour

In [None]:
# Not Severe
query_hours_not_severe  = "SELECT COUNT(*) as num_crashes, hour_of_crash_name FROM `traffic_fatalities.traffic_features` WHERE label = 0 GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

hours_not_severe = pd.read_gbq(query_hours_not_severe, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, hours_not_severe["num_crashes"])
hours_not_severe

In [None]:
# Severe
query_hours_severe  = "SELECT COUNT(*) as num_crashes, hour_of_crash_name FROM `traffic_fatalities.traffic_features` WHERE label = 1 GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

hours_severe = pd.read_gbq(query_hours_severe, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, hours_severe["num_crashes"])

In [None]:
# All Severities
query_hours  = "SELECT COUNT(*) as num_crashes, hour_of_crash_name FROM `traffic_fatalities.traffic_features` GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

hours = pd.read_gbq(query_hours, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, hours["num_crashes"])
hours

In [None]:
query_hours_percent_severe  = "SELECT SUM(label)/COUNT(*) as percent_severe, hour_of_crash_name FROM `traffic_fatalities.traffic_features` GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

hours_percent_severe = pd.read_gbq(query_hours_percent_severe, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, hours_percent_severe["percent_severe"])
hours_percent_severe

## Crashes where Driver is drunk by Hour

In [None]:
# Percent of drunk drivers in fatal accidents by the hour with SEVERITY = 0
query_percent_drunk  = "SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, COUNT(id) AS num_crashes, hour_of_crash_name FROM `traffic_fatalities.traffic_features` WHERE label = 0 GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

percent_drunk = pd.read_gbq(query_percent_drunk, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, percent_drunk["percent_drunk"])
percent_drunk

In [None]:
# Percent of drunk drivers in fatal accidents by the hour with SEVERITY = 1
query_percent_drunk  = "SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, COUNT(id) AS num_crashes, hour_of_crash_name FROM `traffic_fatalities.traffic_features` WHERE label = 1 GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

percent_drunk = pd.read_gbq(query_percent_drunk, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, percent_drunk["percent_drunk"])
percent_drunk

In [None]:
# Percent of drunk drivers in fatal accidents by the hour
query_percent_drunk  = "SELECT SUM(number_of_drunk_drivers)/SUM(number_of_motor_vehicles_in_transport_mvit) * 100 as percent_drunk, COUNT(id) AS num_crashes, hour_of_crash_name FROM `traffic_fatalities.traffic_features` GROUP BY hour_of_crash_name ORDER BY CASE WHEN hour_of_crash_name = '0:00am-0:59am' THEN 0 WHEN hour_of_crash_name = '1:00am-1:59am' THEN 1 WHEN hour_of_crash_name = '2:00am-2:59am' THEN 2 WHEN hour_of_crash_name = '3:00am-3:59am' THEN 3 WHEN hour_of_crash_name = '4:00am-4:59am' THEN 4 WHEN hour_of_crash_name = '5:00am-5:59am' THEN 5 WHEN hour_of_crash_name = '6:00am-6:59am' THEN 6 WHEN hour_of_crash_name = '7:00am-7:59am' THEN 7 WHEN hour_of_crash_name = '8:00am-8:59am' THEN 8 WHEN hour_of_crash_name = '9:00am-9:59am' THEN 9 WHEN hour_of_crash_name = '10:00am-10:59am' THEN 10 WHEN hour_of_crash_name = '11:00am-11:59am' THEN 11 WHEN hour_of_crash_name = '12:00pm-12:59pm' THEN 12 WHEN hour_of_crash_name = '1:00pm-1:59pm' THEN 13 WHEN hour_of_crash_name = '2:00pm-2:59pm' THEN 14 WHEN hour_of_crash_name = '3:00pm-3:59pm' THEN 15 WHEN hour_of_crash_name = '4:00pm-4:59pm' THEN 16 WHEN hour_of_crash_name = '5:00pm-5:59pm' THEN 17 WHEN hour_of_crash_name = '6:00pm-6:59pm' THEN 18 WHEN hour_of_crash_name = '7:00pm-7:59pm' THEN 19 WHEN hour_of_crash_name = '8:00pm-8:59pm' THEN 20 WHEN hour_of_crash_name = '9:00pm-9:59pm' THEN 21 WHEN hour_of_crash_name = '10:00pm-10:59pm' THEN 22 WHEN hour_of_crash_name = '11:00pm-11:59pm' THEN 23 END"

percent_drunk = pd.read_gbq(query_percent_drunk, project_id=project_id, dialect='standard')

x = [i for i in range(24)]
plt.bar(x, percent_drunk["percent_drunk"])
percent_drunk

## Atmospheric Conditions contributing to Crashes Ranked

In [None]:
# What percent severe for atmospheric conditions
query_atmos_percent_severe = "SELECT count(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, light_condition_name FROM `traffic_fatalities.traffic_features` GROUP BY light_condition_name ORDER BY num_crashes DESC"
atmos_percent_severe = pd.read_gbq(query_atmos_percent_severe, project_id=project_id, dialect='standard')
atmos_percent_severe

In [None]:
# What percent severe for atmospheric conditions
query_atmos_percent_severe = "SELECT count(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, atmospheric_conditions_name FROM `traffic_fatalities.traffic_features` GROUP BY atmospheric_conditions_name ORDER BY num_crashes DESC"
atmos_percent_severe = pd.read_gbq(query_atmos_percent_severe, project_id=project_id, dialect='standard')
atmos_percent_severe

In [None]:
query_atmos_percent_severe = "SELECT count(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, atmospheric_conditions_name, light_condition_name FROM `traffic_fatalities.traffic_features` GROUP BY atmospheric_conditions_name, light_condition_name ORDER BY num_crashes DESC LIMIT 30"
atmos_percent_severe = pd.read_gbq(query_atmos_percent_severe, project_id=project_id, dialect='standard')
atmos_percent_severe

In [None]:
# What percent severe for atmospheric conditions
query_atmos_severe_state = "WITH atmos_severe_state AS ( SELECT count(*) as num_crashes, SUM(label) as num_severe, state_name, atmospheric_conditions_name, RANK() OVER (PARTITION BY state_name ORDER BY SUM(label) DESC) as rank FROM `traffic_fatalities.traffic_features` GROUP BY atmospheric_conditions_name, state_name ) SELECT num_crashes, num_severe, atmospheric_conditions_name, state_name FROM atmos_severe_state WHERE rank = 1"
atmos_severe_state = pd.read_gbq(query_atmos_severe_state, project_id=project_id, dialect='standard')
atmos_severe_state

In [None]:
# What percent severe for atmospheric conditions
query_atmos_percent_severe_state = "WITH atmos_percent_severe_state AS ( SELECT count(*) as num_crashes, AVG(label) as percent_severe, state_name, atmospheric_conditions_name, RANK() OVER (PARTITION BY state_name ORDER BY AVG(label) DESC) as rank FROM `traffic_fatalities.traffic_features` GROUP BY atmospheric_conditions_name, state_name ) SELECT num_crashes, percent_severe, atmospheric_conditions_name, state_name, rank FROM atmos_percent_severe_state WHERE rank = 1"
atmos_percent_severe_state = pd.read_gbq(query_atmos_percent_severe_state, project_id=project_id, dialect='standard')
atmos_percent_severe_state

## Nature of Crash

In [None]:
#Relation to first harmful event
query  = "SELECT COUNT(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, first_harmful_event_name FROM `traffic_fatalities.traffic_features` GROUP BY first_harmful_event_name ORDER BY num_crashes DESC LIMIT 20"
atmos = pd.read_gbq(query, project_id=project_id, dialect='standard')
atmos

In [None]:
#Relation to trafficway name
query  = "SELECT COUNT(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, relation_to_trafficway_name FROM `traffic_fatalities.traffic_features` GROUP BY relation_to_trafficway_name ORDER BY num_crashes DESC"
atmos = pd.read_gbq(query, project_id=project_id, dialect='standard')
atmos

In [None]:
#Relation to trafficway name & first harmful event
query  = "SELECT COUNT(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, relation_to_trafficway_name, first_harmful_event_name FROM `traffic_fatalities.traffic_features` GROUP BY relation_to_trafficway_name, first_harmful_event_name ORDER BY num_crashes DESC LIMIT 20"
atmos = pd.read_gbq(query, project_id=project_id, dialect='standard')
atmos

In [None]:
#Which combinations of conditions lead to high percentage of severe crashes 
query  = "SELECT COUNT(id) as num_crashes, SUM(label)/COUNT(*) as percent_severe, number_of_drunk_drivers, relation_to_trafficway_name, first_harmful_event_name FROM `traffic_fatalities.traffic_features` WHERE number_of_drunk_drivers > 0 GROUP BY relation_to_trafficway_name, first_harmful_event_name, number_of_drunk_drivers ORDER BY percent_severe DESC LIMIT 80"
atmos = pd.read_gbq(query, project_id=project_id, dialect='standard')
atmos

The reason why when first_harmful_event_name = Motor Vehicle In-Transport always results in a severe crash is because Motor Vehicle In-Transport always involves more than 1 vehicle

In [None]:
%%bigquery --project $project_id

SELECT SUM(label)/COUNT(*) AS percent_severe, SUM(label) AS num_severe, atmospheric_conditions_name
FROM (SELECT DISTINCT id, label, atmospheric_conditions_name FROM `traffic_fatalities.traffic_features`)
GROUP BY atmospheric_conditions_name
ORDER BY percent_severe


## Severity Ranked by Average Time to Scene

In [None]:
%%bigquery --project $project_id

# Non severe accidents that involve drunk drivers
SELECT label, AVG(time_to_scene) AS avg_time_to_scene FROM `traffic_fatalities.traffic_features` WHERE time_to_scene != 9999 GROUP BY label

## Number of Fatalities Ranked by Average time to Scene

In [None]:
%%bigquery --project $project_id

SELECT number_of_fatalities, COUNT(number_of_fatalities) AS count, AVG(time_to_scene) AS avg_time_to_scene
FROM `traffic_fatalities.traffic_features` 
WHERE time_to_scene != 9999
GROUP BY number_of_fatalities
ORDER BY number_of_fatalities

## Number of Fatalities per state Ranked by Average Time to Scene

In [None]:
%%bigquery --project $project_id

SELECT state_name, SUM(number_of_fatalities) AS total_fatalities, COUNT(state_name) AS num_crashes, AVG(time_to_scene) AS avg_time_to_scene
FROM (SELECT DISTINCT id, state_name, number_of_fatalities, time_to_scene FROM `traffic_fatalities.traffic_features` WHERE time_to_scene != 9999)
GROUP BY state_name
ORDER BY avg_time_to_scene

In [None]:
%%bigquery --project $project_id

SELECT state_name, SUM(number_of_fatalities)/(SUM(number_of_persons_in_motor_vehicles_in_transport_mvit) + SUM(number_of_persons_not_in_motor_vehicles_in_transport_mvit)) AS percent_fatalities, COUNT(*) AS num_crashes, AVG(time_to_scene) AS avg_time_to_scene
FROM `traffic_fatalities.traffic_features` 
WHERE time_to_scene != 9999
GROUP BY state_name
ORDER BY avg_time_to_scene

## Number of Fatalities per State Ranked by Average Time to Hospital

In [None]:
%%bigquery --project $project_id

SELECT state_name, SUM(number_of_fatalities) AS total_fatalities, COUNT(state_name) AS num_crashes, AVG(time_to_hospital) AS avg_time_to_hospital
FROM (SELECT DISTINCT id, state_name, number_of_fatalities, time_to_hospital FROM `traffic_fatalities.traffic_features` WHERE time_to_hospital != 9999)
GROUP BY state_name
ORDER BY avg_time_to_hospital

## Geographic Analysis

In [None]:
# This shows differences in frequency of high severity crashes across states
%%bigquery --project $project_id
SELECT state_name, count(*)
FROM (SELECT DISTINCT id, state_name, number_of_fatalities, time_to_hospital FROM `traffic_fatalities.traffic_features` WHERE time_to_hospital != 9999)
GROUP BY state_name
ORDER BY avg_time_to_hospital

In [None]:
query  = "SELECT COUNT(DISTINCT(id)) as num_crashes, city_name FROM `traffic_fatalities.traffic_features` WHERE city_name != 'NOT APPLICABLE' AND city_name != 'Not Reported' GROUP BY city_name ORDER BY num_crashes DESC LIMIT 10"
city = pd.read_gbq(query, project_id=project_id, dialect='standard')
plt.bar(city["city_name"], city["num_crashes"])
city

## Training Time :D

In [None]:
%%bigquery --project $project_id
SELECT COUNT(*)
FROM `traffic_fatalities.traffic_features`

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE `traffic_fatalities.traffic_rand_data` AS
SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY RAND()) AS rand_id
FROM `traffic_fatalities.traffic_features`

In [None]:
%%bigquery --project $project_id

# YOUR QUERY HERE
CREATE OR REPLACE MODEL `traffic_fatalities.traffic_model`
OPTIONS(model_type='logistic_reg') AS
SELECT 
    state_name,
    number_of_persons_in_motor_vehicles_in_transport_mvit,
    number_of_persons_not_in_motor_vehicles_in_transport_mvit,
    city_name,
    day_name,
    month_of_crash_name,
    day_of_week_name,
    hour_of_crash_name,
    route_signing_name,
    land_use_name,
    first_harmful_event_name,
    work_zone_name,
    relation_to_trafficway_name,
    light_condition_name,
    atmospheric_conditions_name,
    time_to_scene,
    time_to_hospital,
    number_of_drunk_drivers,
    label,
FROM `traffic_fatalities.traffic_rand_data`
WHERE rand_id <= 136942

In [None]:
%%bigquery --project $project_id

# Run cell to view training stats

SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `traffic_fatalities.traffic_model`)

In [None]:
%%bigquery --project $project_id
# YOUR QUERY HERE


SELECT
  *
FROM
  ML.EVALUATE(MODEL `traffic_fatalities.traffic_model`, (
      SELECT
        state_name,
        number_of_persons_in_motor_vehicles_in_transport_mvit,
        number_of_persons_not_in_motor_vehicles_in_transport_mvit,
        city_name,
        day_name,
        month_of_crash_name,
        day_of_week_name,
        hour_of_crash_name,
        route_signing_name,
        land_use_name,
        first_harmful_event_name,
        work_zone_name,
        relation_to_trafficway_name,
        light_condition_name,
        atmospheric_conditions_name,
        time_to_scene,
        time_to_hospital,
        number_of_drunk_drivers,
        label,
      FROM `traffic_fatalities.traffic_rand_data`
      WHERE rand_id > 136942 AND rand_id <= 154060))

In [None]:
%%bigquery --project $project_id

# YOUR QUERY HERE
SELECT
  *
FROM
  ML.EVALUATE(MODEL `traffic_fatalities.traffic_model`, (
      SELECT
        state_name,
        number_of_motor_vehicles_in_transport_mvit,
        number_of_parked_working_vehicles,
        number_of_persons_in_motor_vehicles_in_transport_mvit,
        number_of_persons_not_in_motor_vehicles_in_transport_mvit,
        city_name,
        day_name,
        month_of_crash_name,
        day_of_week_name,
        hour_of_crash_name,
        route_signing_name,
        land_use_name,
        first_harmful_event_name,
        work_zone_name,
        relation_to_trafficway_name,
        light_condition_name,
        atmospheric_conditions_name,
        time_to_scene,
        time_to_hospital,
        number_of_drunk_drivers,
        label,
      FROM `traffic_fatalities.traffic_rand_data`
      WHERE rand_id > 154060
  ))

In [None]:
%%bigquery --project $project_id

# YOUR QUERY HERE
SELECT
  state_name,
  number_of_persons_in_motor_vehicles_in_transport_mvit,
  number_of_persons_not_in_motor_vehicles_in_transport_mvit,
  city_name,
  day_name,
  month_of_crash_name,
  day_of_week_name,
  hour_of_crash_name,
  route_signing_name,
  land_use_name,
  first_harmful_event_name,
  work_zone_name,
  relation_to_trafficway_name,
  light_condition_name,
  atmospheric_conditions_name,
  time_to_scene,
  time_to_hospital,
  number_of_drunk_drivers,
  predicted_label
FROM
  ML.PREDICT(MODEL `traffic_fatalities.traffic_model`, (
  SELECT
    state_name,
    number_of_persons_in_motor_vehicles_in_transport_mvit,
    number_of_persons_not_in_motor_vehicles_in_transport_mvit,
    city_name,
    day_name,
    month_of_crash_name,
    day_of_week_name,
    hour_of_crash_name,
    route_signing_name,
    land_use_name,
    first_harmful_event_name,
    work_zone_name,
    relation_to_trafficway_name,
    light_condition_name,
    atmospheric_conditions_name,
    time_to_scene,
    time_to_hospital,
    number_of_drunk_drivers,
  FROM `traffic_fatalities.traffic_features`
  WHERE time_to_scene <= 5 AND first_harmful_event_name = 'Rollover/Overturn'))
LIMIT 10