# __Capstone Project: Predict severity of an accident__
### Applied Data Science Capstone

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## __Introduction: Business Problem__ <a name="introduction"></a>

Traffic accidents are a recurring problem in every corner of the world, perhaps the biggest problem are the injuried people. The first problem is the fatality of the injured, the second problem is the possibility of being seriously injured. Another problem no less important are the costs in repairing damages, costs that insurers assume, costs that governments assume and a long list of affected.

Governments in each country should analyze these types of studies to issue laws that try to reduce the risk of traffic accidents, the frequent driver who goes to their work every day should know the prevention recommendations that come from these reports because they aim to take care of their lives and the lives of their families.

## __Data__ <a name="data"></a>

The dataset has 37 features but we only use the next list: 'INATTENTIONIND', 'UNDERINFL', 'SPEEDING', 'LIGHTCOND', 'ROADCOND', 'WEATHER', 'SEVERITYCODE'.

In [1]:
import pandas as pd
import numpy as np

In [2]:
collisions_base = pd.read_csv('Data-Collisions.csv',low_memory=False)
print('OK')

OK


In [3]:
collisions = collisions_base[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING', 'LIGHTCOND', 'ROADCOND', 'WEATHER', 'SEVERITYCODE']].copy()
collisions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   INATTENTIONIND  29805 non-null   object
 1   UNDERINFL       189789 non-null  object
 2   SPEEDING        9333 non-null    object
 3   LIGHTCOND       189503 non-null  object
 4   ROADCOND        189661 non-null  object
 5   WEATHER         189592 non-null  object
 6   SEVERITYCODE    194673 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 10.4+ MB


The data receive binary processing where the value one (1) represents whether the feature influences the severity of the accident and the other way around (does not influence severity) is given by the value zero (0).

In [4]:
new_values = {'INATTENTIONIND': {'N':'0', 'Y':'1',np.nan:'0'},
              'UNDERINFL': {'N':'0', 'Y':'1',np.nan:'0'},
              'SPEEDING':  {'N':'0', 'Y':'1',np.nan:'0'}}
collisions.replace(new_values,inplace=True)

new_values = {'LIGHTCOND':{'Daylight':'0','Dark - Street Lights On': '1', 'Unknown': '1', 'Dusk': '1', 
                            'Dawn': '1', 'Dark - No Street Lights': '1', 'Dark - Street Lights Off': '1',
                            'Other': '1', 'Dark - Unknown Lighting':'1',np.nan:'1'},
              'ROADCOND':{'Dry':'0', 'Wet': '1', 'Unknown':'1', 'Ice':'1',
                              "Snow/Slush":'1', "Other":'1', "Standing Water":'1',
                              'Sand/Mud/Dirt':'1', 'Oil':'1',np.nan:'1'},
              'WEATHER':{'Clear':'0', 'Raining':'1', 'Overcast':'1', 'Unknown':'1',
                              'Snowing':'1', 'Other':'1', 'Fog/Smog/Smoke':'1', 'Sleet/Hail/Freezing Rain':'1',
                              'Blowing Sand/Dirt':'1', 'Severe Crosswind':'1', 'Partly Cloudy':'1',np.nan:'1'}}
collisions.replace(new_values, inplace=True)
# use 'int' type date for future treatment 
collisions = collisions.astype(int)

collisions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype
---  ------          --------------   -----
 0   INATTENTIONIND  194673 non-null  int32
 1   UNDERINFL       194673 non-null  int32
 2   SPEEDING        194673 non-null  int32
 3   LIGHTCOND       194673 non-null  int32
 4   ROADCOND        194673 non-null  int32
 5   WEATHER         194673 non-null  int32
 6   SEVERITYCODE    194673 non-null  int32
dtypes: int32(7)
memory usage: 5.2 MB


Unbalanced data is a feature of such datasets to address such problems, data subsampling is used to balance the two current data classes.

In [5]:
from sklearn.utils import resample

In [6]:
print('---------------------------------------')
print('The actual unbalanced data for SEVERITYCODE is:')
print('---------------------------------------')
print(collisions.SEVERITYCODE.value_counts())
print('---------------------------------------')
df_majority = collisions[collisions.SEVERITYCODE == 1]
df_minority = collisions[collisions.SEVERITYCODE == 2]
df_majority_downsampled = resample(df_majority,replace=False,n_samples=58188,random_state=1)
df_downsampled = pd.concat([df_majority_downsampled,df_minority])
print('The subsample data now is:')
print('---------------------------------------')
print(df_downsampled.SEVERITYCODE.value_counts())
print('---------------------------------------')

---------------------------------------
The actual unbalanced data for SEVERITYCODE is:
---------------------------------------
1    136485
2     58188
Name: SEVERITYCODE, dtype: int64
---------------------------------------
The subsample data now is:
---------------------------------------
2    58188
1    58188
Name: SEVERITYCODE, dtype: int64
---------------------------------------


## __Methodology__ <a name="methodology"></a>

Divide the features in two data sets: internal causes (for those driver's responsability) and external causes (for those related to road, weather conditions)

In [7]:
collisions_internal = df_downsampled[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING','SEVERITYCODE']].copy()
collisions_external = df_downsampled[['LIGHTCOND', 'ROADCOND', 'WEATHER', 'SEVERITYCODE']].copy()

__Modeling__

Let's work with internal causes first applyng Logistic Regression (LR) and Random Forest Classifier (RFC).

For LR we use two options for class_weight: default (with balanced data) and balanced (with inbalanced data)

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import jaccard_similarity_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.ensemble import RandomForestClassifier

In [9]:
x = collisions_internal.drop('SEVERITYCODE',axis=1)
y = collisions_internal.SEVERITYCODE
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

# modeling
LR = LogisticRegression(solver='lbfgs')
LR.fit(x_train, y_train)
y_hat = LR.predict(x_test)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test, y_hat).round(4))
y_prob = LR.predict_proba(x_test)
print('Probabilistic Evaluation:', log_loss(y_test, y_prob).round(4))

Model Evaluation: 0.5349
Probabilistic Evaluation: 0.6901


In [10]:
np.unique(y_hat,return_counts=True)

(array([1, 2]), array([17630,  5646], dtype=int64))

In [11]:
x1 = collisions[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING']]
y1 = collisions.SEVERITYCODE
x_train1, x_test1, y_train1, y_test1 = train_test_split(x1, y1, test_size=0.2, random_state=1)

#m modeling
LR1 = LogisticRegression(solver='lbfgs', class_weight='balanced')
LR1.fit(x_train1, y_train1)
y_hat1 = LR1.predict(x_test1)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test1, y_hat1).round(4))
y_prob1 = LR1.predict_proba(x_test1)
print('Probabilistic Evaluation:',log_loss(y_test1, y_prob1).round(4))

Model Evaluation: 0.6343
Probabilistic Evaluation: 0.69


In [12]:
np.unique(y_hat1,return_counts=True)

(array([1, 2]), array([29849,  9086], dtype=int64))

#### In the case of RFC we use three options for class_weight: default (with balanced data), balanced (with inbalanced data), and subsample (with inbalanced data)

In [13]:
# class_weight = 'none'
x2 = collisions_internal.drop('SEVERITYCODE',axis=1)
y2 = collisions_internal.SEVERITYCODE
x_train2, x_test2, y_train2, y_test2 = train_test_split(x2, y2, test_size=0.2, random_state=1)

# modeling
RFC2 = RandomForestClassifier(n_estimators=10)
RFC2.fit(x_train2,y_train2)
y_hat2 = RFC2.predict(x_test2)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test2, y_hat2).round(4))
y_prob2 = RFC2.predict_proba(x_test2)
print('Probabilistic Evaluation:',log_loss(y_test2, y_prob2).round(4))

Model Evaluation: 0.5349
Probabilistic Evaluation: 0.69


In [14]:
np.unique(y_hat2,return_counts=True)

(array([1, 2]), array([17630,  5646], dtype=int64))

In [15]:
# class_weight = 'balanced'
x3 = collisions[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING']]
y3 = collisions.SEVERITYCODE
x_train3, x_test3, y_train3, y_test3 = train_test_split(x3, y3, test_size=0.2, random_state=1)

# modeling
RFC3 = RandomForestClassifier(n_estimators=10, class_weight ='balanced')
RFC3.fit(x_train3,y_train3)
y_hat3 = RFC3.predict(x_test3)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test3, y_hat3).round(4))
y_prob3 = RFC3.predict_proba(x_test3)
print('Probabilistic Evaluation:',log_loss(y_test3, y_prob3).round(4))

Model Evaluation: 0.6343
Probabilistic Evaluation: 0.6899


In [16]:
np.unique(y_hat3,return_counts=True)

(array([1, 2]), array([29849,  9086], dtype=int64))

In [17]:
# class_weight = 'balanced_subsample'
x4 = collisions[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING']]
y4 = collisions.SEVERITYCODE
x_train4, x_test4, y_train4, y_test4 = train_test_split(x4, y4, test_size=0.2, random_state=1)

# modeling
RFC4 = RandomForestClassifier(n_estimators=10,class_weight ='balanced_subsample')
RFC4.fit(x_train4,y_train4)
y_hat4 = RFC4.predict(x_test4)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test4, y_hat4).round(4))
y_prob4 = RFC4.predict_proba(x_test4)
print('Probabilistic Evaluation:',log_loss(y_test4, y_prob4).round(4))

Model Evaluation: 0.6343
Probabilistic Evaluation: 0.69


In [18]:
np.unique(y_hat4,return_counts=True)

(array([1, 2]), array([29849,  9086], dtype=int64))

#### Let's repeat the previous process but this time using the dataframe with external causes

In [19]:
x5 = collisions_external.drop('SEVERITYCODE',axis=1)
y5 = collisions_external.SEVERITYCODE
x_train5, x_test5, y_train5, y_test5 = train_test_split(x5, y5, test_size=0.2, random_state=1)

# modeling
LR5 = LogisticRegression(solver='lbfgs')
LR5.fit(x_train5, y_train5)
y_hat5 = LR5.predict(x_test5)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test5, y_hat5).round(4))
y_prob5 = LR.predict_proba(x_test5)
print('Probabilistic Evaluation:', log_loss(y_test5, y_prob5).round(4))

Model Evaluation: 0.5391
Probabilistic Evaluation: 0.7443


In [20]:
np.unique(y_hat5,return_counts=True)

(array([1, 2]), array([12133, 11143], dtype=int64))

In [21]:
x6 = collisions[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING']]
y6 = collisions.SEVERITYCODE
x_train6, x_test6, y_train6, y_test6 = train_test_split(x6, y6, test_size=0.2, random_state=1)

#m modeling
LR6 = LogisticRegression(solver='lbfgs', class_weight='balanced')
LR6.fit(x_train6, y_train6)
y_hat6 = LR6.predict(x_test6)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test6, y_hat6).round(4))
y_prob6 = LR6.predict_proba(x_test6)
print('Probabilistic Evaluation:',log_loss(y_test6, y_prob6).round(4))

Model Evaluation: 0.6343
Probabilistic Evaluation: 0.69


In [22]:
np.unique(y_hat6,return_counts=True)

(array([1, 2]), array([29849,  9086], dtype=int64))

#### In the case of RFC with external causes ('LIGHTCOND', 'ROADCOND', 'WEATHER')

In [23]:
# class_weight = 'none'
x7 = collisions_external.drop('SEVERITYCODE',axis=1)
y7 = collisions_external.SEVERITYCODE
x_train7, x_test7, y_train7, y_test7 = train_test_split(x7, y7, test_size=0.2, random_state=1)

# modeling
RFC7 = RandomForestClassifier(n_estimators=10)
RFC7.fit(x_train7,y_train7)
y_hat7 = RFC7.predict(x_test7)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test7, y_hat7).round(4))
y_prob7 = RFC7.predict_proba(x_test7)
print('Probabilistic Evaluation:',log_loss(y_test7, y_prob7).round(4))

Model Evaluation: 0.5412
Probabilistic Evaluation: 0.6874


In [24]:
np.unique(y_hat7,return_counts=True)

(array([1, 2]), array([ 8990, 14286], dtype=int64))

In [25]:
# class_weight = 'balanced'
x8 = collisions[['LIGHTCOND', 'ROADCOND', 'WEATHER']]
y8 = collisions.SEVERITYCODE
x_train8, x_test8, y_train8, y_test8 = train_test_split(x8, y8, test_size=0.2, random_state=1)

# modeling
RFC8 = RandomForestClassifier(n_estimators=10,class_weight = 'balanced')
RFC8.fit(x_train8,y_train8)
y_hat8 = RFC8.predict(x_test8)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test8, y_hat8).round(4))
y_prob8 = RFC8.predict_proba(x_test8)
print('Probabilistic Evaluation:',log_loss(y_test8, y_prob8).round(4))

Model Evaluation: 0.4959
Probabilistic Evaluation: 0.6877


In [26]:
np.unique(y_hat8,return_counts=True)

(array([1, 2]), array([15529, 23406], dtype=int64))

In [27]:
# class_weight = 'balanced_subsample'
x9 = collisions[['LIGHTCOND', 'ROADCOND', 'WEATHER']]
y9 = collisions.SEVERITYCODE
x_train9, x_test9, y_train9, y_test9 = train_test_split(x9, y9, test_size=0.2, random_state=1)

# modeling
RFC9 = RandomForestClassifier(n_estimators=10,class_weight = 'balanced_subsample')
RFC9.fit(x_train9,y_train9)
y_hat9 = RFC9.predict(x_test9)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test9, y_hat9).round(4))
y_prob9 = RFC9.predict_proba(x_test9)
print('Probabilistic Evaluation:',log_loss(y_test9, y_prob9).round(4))

Model Evaluation: 0.4959
Probabilistic Evaluation: 0.6876


In [28]:
np.unique(y_hat9,return_counts=True)

(array([1, 2]), array([15529, 23406], dtype=int64))

### Finally we observe the two casuses: internal and external, but this time we only use RFC

In [29]:
# class_wieght = 'none'
x10 = df_downsampled.drop('SEVERITYCODE',axis=1)
y10 = df_downsampled.SEVERITYCODE
x_train10, x_test10, y_train10, y_test10 = train_test_split(x10, y10, test_size=0.2, random_state=1)

# modeling
RFC10 = RandomForestClassifier(n_estimators=10)
RFC10.fit(x_train10,y_train10)
y_hat10 = RFC10.predict(x_test10)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test10, y_hat10).round(4))
y_prob10 = RFC10.predict_proba(x_test10)
print('Probabilistic Evaluation:',log_loss(y_test10, y_prob10).round(4))

Model Evaluation: 0.5507
Probabilistic Evaluation: 0.6864


In [30]:
np.unique(y_hat10,return_counts=True)

(array([1, 2]), array([ 7039, 16237], dtype=int64))

In [31]:
# class_weight = 'balanced'
x11 = collisions[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING','LIGHTCOND', 'ROADCOND', 'WEATHER']]
y11 = collisions.SEVERITYCODE
x_train11, x_test11, y_train11, y_test11 = train_test_split(x11, y11, test_size=0.2, random_state=1)

# modeling
RFC11 = RandomForestClassifier(n_estimators=10,class_weight = 'balanced')
RFC11.fit(x_train11,y_train11)
y_hat11 = RFC11.predict(x_test11)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test11, y_hat11).round(4))
y_prob11 = RFC11.predict_proba(x_test11)
print('Probabilistic Evaluation:',log_loss(y_test11, y_prob11).round(4))

Model Evaluation: 0.4735
Probabilistic Evaluation: 0.6835


In [32]:
np.unique(y_hat11,return_counts=True)

(array([1, 2]), array([12480, 26455], dtype=int64))

In [33]:
# class_weight = 'balanced_subsample'
x12 = collisions[['INATTENTIONIND', 'UNDERINFL', 'SPEEDING','LIGHTCOND', 'ROADCOND', 'WEATHER']]
y12 = collisions.SEVERITYCODE
x_train12, x_test12, y_train12, y_test12 = train_test_split(x12, y12, test_size=0.2, random_state=1)

# modeling
RFC12 = RandomForestClassifier(n_estimators=10, class_weight='balanced_subsample')
RFC12.fit(x_train12,y_train12)
y_hat12 = RFC12.predict(x_test12)

# evaluating
print('Model Evaluation:',jaccard_similarity_score(y_test12, y_hat12).round(4))
y_prob12 = RFC12.predict_proba(x_test12)
print('Probabilistic Evaluation:',log_loss(y_test12, y_prob12).round(4))

Model Evaluation: 0.4779
Probabilistic Evaluation: 0.6833


In [34]:
np.unique(y_hat12,return_counts=True)

(array([1, 2]), array([12893, 26042], dtype=int64))

## __Results and Discussion__ <a name="results"></a>

In general the two models used (Logistic Regression, Random Forest Classifier) receive modest evaluations, both of which have no more than __63%__ accuracy in their evaluations with the Jaccard index. If we evaluate the probability that the model correctly classifies as prop damage or injurie, an improvement is observed that reaches up to 68% and 69%.

The RFC model additionally provides important data that is the feature importance which is very useful for making recommendations.

The results of the model for the internal causes that are attributed to the speed of the car, inattention while driving and the influence of alcohol are:

| Class_weight       | LR      |        | RFC     |        |
|--------------------|---------|--------|---------|--------|
|                    | Jaccard |LogLoss | Jaccard |LogLoss |
| default            | 0.5349  |0.6900  | 0.5349  |0.6900  |        
| balanced           | 0.6343  |0.6900  | 0.6343  |0.6895  |
| subsample          | NA      |NA      | 0.6343  |0.6900  |

The external causes referring to the light conditions, the state of the road, the weather; present the following evaluations:

| Class_weight       | LR      |        | RFC     |        |
|--------------------|---------|--------|---------|--------|
|                    | Jaccard |LogLoss | Jaccard |LogLoss |
| default            | 0.5391  |0.7443  | 0.5412  |0.6874  |        
| balanced           | 0.6343  |0.6900  | 0.4959  |0.6879  |
| subsample          | NA      |NA      | 0.4959  |0.6876  |

when modeling the complete dataset the precision drops to __47%__ but the probability of classification remains at __68%__.

| Class_weight       | RFC     |        |
|--------------------|---------|--------|
|                    | Jaccard |LogLoss |
| default            | 0.5513  |0.6864  |
| balanced           | 0.4741  |0.6842  |
| subsample          | 0.47.35 |0.6843  |

When looking at internal causes, the RFC provides additional information such as feature importance, so using subsampling, the evaluations are as follows:

In [35]:
list(zip(x4.columns,RFC4.feature_importances_.round(2)))

[('INATTENTIONIND', 0.44), ('UNDERINFL', 0.34), ('SPEEDING', 0.22)]

In the case of external causes the RFC with subsample indicates that light conditions of the road is the 62% of the injuries.

In [36]:
list(zip(x9.columns,RFC9.feature_importances_.round(2)))

[('LIGHTCOND', 0.59), ('ROADCOND', 0.21), ('WEATHER', 0.2)]

__When the six features are observed at the same time, a greater importance is clearly attributed to the light conditions (40%), then there is the influence of alcohol (16%) and road conditions(17%). It is striking that weather (6%) and speed (10%) are the features that least influence accidents with injuries to occupants.__

In [37]:
list(zip(x12.columns,RFC12.feature_importances_.round(2)))

[('INATTENTIONIND', 0.11),
 ('UNDERINFL', 0.16),
 ('SPEEDING', 0.1),
 ('LIGHTCOND', 0.4),
 ('ROADCOND', 0.17),
 ('WEATHER', 0.06)]

## __Conclusion__ <a name="conclusion"></a>

### When you go out to drive avoid drinking this can reduce injuries in traffic accidents by up to 34%. Respect the speed limits as you can reduce accidents by up to 22%, and above all do not get distracted while driving because 44% of accidents are caused by being distracted.
### It is possible that on the road the weather is not good and that the track conditions are not favorable, this is out of your control but the most important thing is the light conditions, if you have a lot of difficulty in vision it is better to postpone the trip, or otherwise always use the high beams to have better visibility.