Dataset Proposal

# First Dataset: NHTSA CISS Dataset

## Dataset Link: 
[ftp link](https://www.nhtsa.gov/node/97996/206)

Documentation is found at: <br>
[Overview](https://www.nhtsa.gov/crash-data-systems/crash-investigation-sampling-system)
<br>
[Analytical User Manual](https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812803)
<br>
[Coding Manual](https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812735)

## Description:
The Crash Investigation Sampling System (CISS) is a highly detailed dataset collected by specially trained teams within the National Highway Traffic Safety Administration (NHTSA) for use in improving vehicle safety. NHTSA has stated that the purpose of this database is to support scientists and engineers in analyzing motor vehicle crashes and injuries.

The data are collected using crash reconstruction (on-scene measurements, vehicle measurements, etc.), interviews of the individuals involved in the collision, and medical records. Inclusion criteria are that at least one of the vehicles involved in the collision are towed from the scene. Data are sampled from among 32 strategic areas across the USA, designed to represent the USA as a whole (weighting factors are provided for scaling and generalization of results, where applicable).

![CISS Map](CISS_MAP.png)

The database has numerous tables at the crash level, vehicle level, person level, and injury level. There are specific keys that are used to link the tables to each other. A simple data model for the CISS database is shown below.

<img src="CISS_Data_Mdl.jpg" width="600">

Only a subset of the available tables will be used. These will likely include:
- Crash
- Event
- Distract
- Avoid
- Seat
- Airbag
- Eject
- EMSCare

The 2017 dataset includes approximately 2,000 crashes. Future CISS data are anticipated to include approximately 5,000 crashes per year.

## Target Variable:
MIAS (Maximum Abbreviated Injury Scale for a person).

## Questions:
- How does distracted driving impact collision severity?
- Which model performs best for predicting injury severity - ordinal logit (using statsmodels) or random forest?
- Can random forest be interpreted at an aggregate level for each predictor (e.g., using partial dependence plots or other methods)? If so, how do the results compare with the ordinal logit interpretation?

Note: I have worked with this dataset quite a bit, including applying ordinal logistic regression to predicting injury severity, so the key learning opportunity is in using the random forest. The comparisons are something I am highly interested in.

# Second Dataset: NHTSA CRSS Database

## Dataset Link: 
[ftp link](https://www.nhtsa.gov/node/97996/221)

Documentation is found at: <br>
[Overview](https://www.nhtsa.gov/crash-data-systems/crash-report-sampling-system)
<br>
[Analytical User Manual](https://www.nhtsa.gov/filebrowser/download/176916)
<br>
[Coding Manual](https://www.nhtsa.gov/filebrowser/download/176926)

## Description:
The Crash Report Sampling System (CRSS) is a national dataset collected by the National Highway Traffic Safety Administration (NHTSA) for use in improving roadway safety. The data are collected using crash reports. Data are sampled from among 60 strategic areas across the USA, designed to represent the USA as a whole (weighting factors are provided for scaling and generalization of results, where applicable). Approximately 50,000 crashes are included in the database each year. Data from 2016-2018 are currently available, as of June 3rd, 2020.

![CRSS Map](CRSS.jpg)

The database has numerous tables at the crash level, vehicle level, and person level. There are specific keys that are used to link the tables to each other. A simple data model for the CRSS database is shown below.

![CRSS Data_Model](CRSS_Data.jpg)

Only a subset of the available tables will be used. These will likely include, at a minimum:
- Accident
- Damage
- Distract
- Drimpair
- Maneuver
- Safetyeq


## Target Variable:
Severity (on the KABCO scale: K=Fatal, A=Severe Injury, B=Injury, C=Possible Injury, O=No Injury)

## Questions:
- How does distracted driving impact collision severity, based on police-reported distracted driving?
- Which model performs best for predicting injury severity - ordinal logit (using statsmodels) or random forest?

# Third Dataset: Pittsburgh Traffic Count Data

## Dataset Link: 
[Data Link](https://data.wprdc.org/datastore/dump/6dfd4f8f-cbf5-4917-a5eb-fd07f4403167)
[Data Desciption](https://data.wprdc.org/dataset/traffic-count-data-city-of-pittsburgh/resource/6dfd4f8f-cbf5-4917-a5eb-fd07f4403167)

## Description:
This traffic-count data is provided by the City of Pittsburgh's Department of Mobility & Infrastructure (DOMI). Counters were deployed as part of traffic studies, including intersection studies, and studies covering where or whether to install speed humps. In some cases, data may have been collected by the Southwestern Pennsylvania Commission (SPC) or BikePGH.

Data is currently available for only the most-recent count at each location.

### Derived Variable
A variable will be derived for if the speed limit is legally posted, based on the most recent observations. This assumes that there are no statutory speed limits. Where there are no statutory speed limits, the legal posted speed limit should be within 5 mph of the 85th percentile speed.

## Target Variable:
Legal Posted Speed Limit
Percent of traffic above the speed limit

## Questions:
- Does the probability of a speed limit being legally posted vary by council, ward ,police zone, or average traffic volume?
- Doe the percentage of traffic traveling above the speed limit vary by council, ward ,police zone, or average traffic volume?
- Is there a good way to visualize the data and results?