# Capstone Project - Car accident severity (Week 3)
***

For this week, you will required to submit the following:

1. A description of the problem and a discussion of the background. __(15 marks)__
2. A description of the data and how it will be used to solve the problem. __(15 marks)__

For the 3rd week, the final deliverables of the project will be:

1. A link to your Notebook on your Github repository, showing your code. __(15 marks)__
2. A full report consisting of all of the following components __(15 marks)__:
 - Introduction where you discuss the business problem and who would be interested in this project.
 - Data where you describe the data that will be used to solve the problem and the source of the data.
 - Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
 - Results section where you discuss the results.
 - Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
 - Conclusion section where you conclude the report.
3. Your choice of a presentation or blogpost. __(10 marks)__
***

# Code and Analysis
***

### Packages and modules import

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

### Data import and cleansing

In [2]:
# read data in and have an initial look
df_data = pd.read_csv("https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv")
df_data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [3]:
# above warning indicates column 33 has mixed types. Find out that it's ST_COLCODE
df_data.columns[33]

'ST_COLCODE'

In [17]:
# choosing the data we will work with
data_of_interest = ["SEVERITYCODE","SPEEDING", "ROADCOND"]
df_data = df_data[data_of_interest]

In [18]:
# take a look at the distinct severity assigned and total # of records assigned to them. 
# SEVERITYCODE MEANING
#   0 - unknown
#   1 - prop damage
#   2 - injury
#   2b - serious injury
#   3 - fatality
#
# surprisingly (and thankfully), no reported car accidents result in death (SEVERITYCODE = 3)
print("Distinct values in SEVERITYCODE and their frequency")
for value in ["SEVERITYCODE"]:
    print(df_data[value].value_counts())

Distinct values in SEVERITYCODE and their frequency
1    136485
2     58188
Name: SEVERITYCODE, dtype: int64


In [5]:
# clean up the data a bit by replacing the nan's with a "N/A"
df_data["SPEEDING"] = df_data["SPEEDING"].fillna("N/A")

# clean up road condition as well
df_data["ROADCOND"] = df_data["ROADCOND"].fillna("Unknown")

# check data after replacing nan for SPEEDING
print("Distinct values in SPEEDING after cleanup:")
for value in ["SPEEDING"]:
    print(df_data[value].value_counts())
    
print("------------------------------------------")

# check data after replacing nan for ROADCON
print("Distinct values in ROADCON after cleanup:")
for value in ["ROADCOND"]:
    print(df_data[value].value_counts())

Distinct values in SPEEDING after cleanup:
N/A    185340
Y        9333
Name: SPEEDING, dtype: int64
------------------------------------------
Distinct values in ROADCON after cleanup:
Dry               124510
Wet                47474
Unknown            20090
Ice                 1209
Snow/Slush          1004
Other                132
Standing Water       115
Sand/Mud/Dirt         75
Oil                   64
Name: ROADCOND, dtype: int64


In [6]:
# since there are multiple road conditions assign categorically identifiable values to ROADCOND for use later
df_data["ROADCOND"].replace(to_replace=["Wet","Dry","Unknown","Snow/Slush","Ice","Other","Sand/Mud/Dirt","Standing Water","Oil"], 
                            value = ["BAD","GOOD","BAD","BAD","BAD","GOOD","BAD","BAD","BAD"], 
                            inplace=True)

# conver thtem into binary data to indicate either "good" or "bad" based on the replaced values
df_data['SPEEDING'].replace(to_replace=['N/A','Y'], value=[0, 1], inplace=True)
df_data['ROADCOND'].replace(to_replace=['BAD','GOOD'], value=[0, 1], inplace=True)

# check data after assigning category for SPEEDING
print("Distinct values in SPEEDING after assigning category:")
for value in ["SPEEDING"]:
    print(df_data[value].value_counts())

print("------------------------------------------")

# check data after assigning category for ROADCON
print("Distinct values in ROADCOND after assigning category:")
for value in ["ROADCOND"]:
    print(df_data[value].value_counts())


Distinct values in SPEEDING after assigning category:
0    185340
1      9333
Name: SPEEDING, dtype: int64
------------------------------------------
Distinct values in ROADCOND after assigning category:
1    124642
0     70031
Name: ROADCOND, dtype: int64


***

In [7]:
# get the testing dataset and show it
testing_dataset = df_data[["SPEEDING", "ROADCOND"]]
testing_dataset.head()

Unnamed: 0,SPEEDING,ROADCOND
0,0,0
1,0,0
2,0,1
3,0,1
4,0,0


### Preparing model and training

In [9]:
x = testing_dataset
y = df_data['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

# obtaining data dimensions
print("Training data: ", x_train.shape, y_train.shape)
print("Testing data: ", x_test.shape, y_test.shape)

Training data:  (155738, 2) (155738,)
Testing data:  (38935, 2) (38935,)


  return self.partial_fit(X, y)
  app.launch_new_instance()


### Model methodology

In [10]:
# KNN
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)

In [11]:
# logistic regression
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)
LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)

  'precision', 'predicted', average, warn_for)


In [12]:
# tree
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)

  'precision', 'predicted', average, warn_for)


### Result

In [14]:
results = {
    "Analysis Method": ["KNN", "Logistic Regression", "Decision Tree"],
    "F1-score": [KNN_f1, LR_f1, Tree_f1],
    "Accuracy": [KNN_acc, LR_acc, Tree_acc]
}

# assign and print out results
results = pd.DataFrame(results)
results


Unnamed: 0,Analysis Method,F1-score,Accuracy
0,KNN,0.591438,0.697085
1,Logistic Regression,0.576051,0.699679
2,Decision Tree,0.576051,0.699679


In [None]:
compare_results = {
    "Intercept": LR_model.intercept_,
    "SPEEDING ": LR_model.coef_[:,0],
    "ROADCOND ": LR_model.coef_[:,1],
}

# assign and display the comparison results
compare_results = pd.DataFrame(compare_results)
compare_results

***



# Presentation
***

## 1. Introduction - Problem and Background

Car accidents occurs everywhere worldwide and is one of the leading causes for people between the ages 5-29 years. According to World Health Organization (WHO), roughly 1.35 million people die from traffic collisons. More than half of road traffic deaths involve users such as pedestrians, cyclists, and motorcyclists. Road traffic accidents also attribute to 3% of the domestic product. The goal is to identity relevant factors and derive insight on what events lead to these car accident and what severity is depended on

__Problems and Factors__
* What attributes to car accidents?
* What can be done to remedy these?

__Source__ 

WHO: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries

## 2. Data

The dataset, containing roughly ~250,000 records as of Oct 2020, was used for analysis. It contains data collected from 2004 to 2020 and is based on accidents taken place in the state of Washington, Seattle. For each car accident, a severity (1 = prop damage and 2 = injury) code is assigned as well as other relevant information such as:
* location
* speeding involved
* road condition
* collision type
* weather condition
* lighting condition
* driver inattention
* number of cars involved
* number of people involved

These data points will be analyzed to what the major influences in car collisions

__Source__
Dataset: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv <br>
Metadata: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf

## 3. Methodology

Exploratory Analysis will be used to take a deep dive at the dataset while some data cleansing will be performed.

Machine learning methods used are: KNN (K-Nearest Neighbor), logistic regression, and decision trees

## 4. Results

Based on the the 3 results shown above in [Result](#Result), The F-1 score and Accuracy show that there are both driver speeding and road conditions play a significant role in the severity of a car accident

## 5. Discussion

The recommendation to improve the safety of everyone on the road (as well as pedestrians) is for the city of Seattle to enforce vehicles to operate above designated speed limit. Increase in violation fine may help with this. <br>
The state of Washington as a whole should also consider investigate the road conditions in areas where these car accidents occur and consider necessary repair and improvement

## 6. Conclusion
Based on the above analysis, I have arrived at the conclusion that the road conditions as well as whether or not if the driver is speeding contribute a lot to how the severity of traffic accidents can occur

***