### Business Problem

According to a report by CDC in 2019, road traffic crashes are a leading cause of death in the United States for people aged 1–54 and the leading cause of non-natural death for healthy U.S. citizens residing or traveling abroad. 
WHO in a 2018 report profiled the following facts concerning global road traffic injuries and deaths; 
*	Each year, 1.35 million people are killed on roadways around the world.
*	Every day, almost 3,700 people are killed globally in road traffic crashes involving cars, buses, motorcycles, bicycles, trucks, or pedestrians. More than half of those killed are pedestrians, motorcyclists, and cyclists.
*	Road traffic injuries are estimated to be the eighth leading cause of death globally for all age groups and the leading cause of death for children and young people 5–29 years of age.

Car crashes are a public health concern both globally and in the United States but, these injuries and deaths are preventable and the impact and consequences of these accidents can be minimised. Conventional techniques and methodologies have been used in the past to predict the severity of clashes, though these had a number of drawbacks in producing quality and accurate inferences. This project aims to predict the severity of accidents and how the impact can be minimised based on a number of factors and for the purposes of this project we will use data relating to Seattle city.

The solutions seeks to provide aid to the Seattle’s Department of Transportation (SDOT) in its transportation infrastructure planning, building and maintenance to ensure that it tailor makes its road networks in a manner that addresses the rise in accidents.

The solution will also help the Department of Health in planning for relevant equipment and resources required based on the predicted accident severities. This will help them in acquiring resources that appropriately address the problems and injuries on a particular accident scene. 

Road users in general, that is, pedestrians and motorists will be advised accordingly if the information is publicly availed to Seattle citizens and passer-by’s to take precautionary measures to reduce severity of accidents.


### Data Description

For the purposes of this project, the already existing Seattle City accident csv dataset will be used to predict the severity of an accident as such no further data gathering procedures or scraping processes were performed in an attempt to collect the data.

In general the whole dataset, shows the severity of an accident which we will in this particular case use as our label to predict using supervised learning. The dataset contains 194,673 unique instances that I will analyse and also use these to train and test the model. The model should be a more generalised state to allow it to be implemented in other similar instances avoiding over/under fitting problems. 

The data will be cleaned and sanitized using the following procedures shall be performed on the dataset to arrive at the most adequacy and clean state for the data. 

Data pre-processing and feature engineering techniques required;
*   Handling missing values will help us in filling null values or completely removing the records to ensure that a record is fully represented. 
*   Dropping useless columns to avoid noise and to use adequately informed data inputs and records
*   Encoding categorical values i.e. converting the information into a more machine-readable form
*   Changing data types will allow us to use appropriate data formats and forms during the analysis and model building.
*   Normalization and variable transformation using feature scaling will bring in numerical stability within the dataset and avoiding inaccurate inferences within the model.

In [14]:
import pandas as pd
df = pd.read_csv(r"C:\Users\leon.vambe.ECONETZW\Documents\seattle\seattle_collisions.csv")
print("Unique Records are",df['OBJECTID'].nunique())
df.describe()

Unique Records are 194673


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,65070.0,194673.0,194673.0,194673.0,194673.0,194673.0,194673.0,114936.0,194673.0,194673.0
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,37558.450576,1.298901,2.444427,0.037139,0.028391,1.92078,13.867768,7972521.0,269.401114,9782.452
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,51745.990273,0.457778,1.345929,0.19815,0.167413,0.631047,6.868755,2553533.0,3315.776055,72269.26
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,23807.0,1.0,0.0,0.0,0.0,0.0,0.0,1007024.0,0.0,0.0
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,28667.0,1.0,2.0,0.0,0.0,2.0,11.0,6040015.0,0.0,0.0
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,29973.0,1.0,2.0,0.0,0.0,2.0,13.0,8023022.0,0.0,0.0
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,33973.0,2.0,3.0,0.0,0.0,2.0,14.0,10155010.0,0.0,0.0
max,2.0,-122.238949,47.734142,219547.0,331454.0,332954.0,757580.0,2.0,81.0,6.0,2.0,12.0,69.0,13072020.0,525241.0,5239700.0
