In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
crash_data = pd.read_csv('Crash_Analysis_System_CAS_data.csv')
print(crash_data.shape)
crash_data.head()

(655697, 89)


Unnamed: 0,X,Y,OBJECTID,crashYear,crashFinancialYear,crashSeverity,fatalCount,seriousInjuryCount,minorInjuryCount,multiVehicle,...,moped,motorCycle,otherVehicleType,schoolBus,suv,taxi,truck,unknownVehicleType,vanOrUtility,pedestrian
0,174.760665,-36.85792,1001,2000,1999/2000,M,0,0,1,Vehicle(s)+Pedestrian(s),...,0,0,0,0,0,1,0,0,0,1
1,174.764136,-36.846918,1002,2000,1999/2000,S,0,1,0,Multi vehicle,...,0,0,0,0,0,0,0,0,0,0
2,174.76991,-36.848337,1003,2000,1999/2000,M,0,0,1,Multi vehicle,...,0,0,0,0,0,0,0,0,0,0
3,174.73634,-36.892313,1004,2000,1999/2000,M,0,0,4,Multi vehicle,...,0,0,0,0,0,0,0,0,0,0
4,174.729205,-36.89486,1005,2000,1999/2000,M,0,0,1,Vehicle(s)+Cyclist(s) only,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# outdated_location = crash_data[crash_data['outdatedLocationDescription'] != 'Current Location']
# print(outdated_location.shape)
counts = crash_data['outdatedLocationDescription'].value_counts()
for k,v in counts.items():
    print(k, '{:.2f}%'.format(v/crash_data.shape[0]*100))

Current location 99.35%
Outdated Location 0.65%
0 0.00%


In [20]:
# outdated_location = crash_data[crash_data['outdatedLocationDescription'] != 'Current Location']
# print(outdated_location.shape)
counts = crash_data['cornerRoadSideRoad'].value_counts()
for k,v in counts.items():
    print(k, '{:.2f}%'.format(v/crash_data.shape[0]*100))

1.0 93.18%
2.0 6.82%
0.0 0.00%


In [18]:
crash_data[crash_data['outdatedLocationDescription'] == '0']

Unnamed: 0,X,Y,OBJECTID,crashYear,crashFinancialYear,crashSeverity,fatalCount,seriousInjuryCount,minorInjuryCount,multiVehicle,...,moped,motorCycle,otherVehicleType,schoolBus,suv,taxi,truck,unknownVehicleType,vanOrUtility,pedestrian
600786,174.752926,-36.845173,596787,2016,2016/2017,N,0,0,0,Multi vehicle,...,0,0,0,0,0,0,0,0,0,0


first, we will remove the all columns that won't be of much use.

We will remove columns `X`, `Y` and `OBJECTID` since we won't be considering Latitude or Longitude and we don't care for the ID of each crash sample.

Next, we get rid of `crashFinancialYear`. We already have `crashYear` which is much more accurate. Also, `tlaID` since we will use `tlaName` to better understand the meaning of the data. If needed, we can come back to their respective IDs (for this and any other feature) when training the model for increased performance.

**`Easting` and `northing` could be leveraged withing a region or crash location for 'in-cluster analysis'**

`OutdatedLocationDescription` doesn't add any value as it's same for all samples in the dataset.

As per the definition documents for the columns, `crashRSRP` doesn't carry any value asociated with the crash itself; so we remove it as well.


`cornerRoadSideRoad` is actually `Crash Road Side Road`.

<u>**CHECK FOR MISSING VALUES ON EACH FEATURES AND DECIDE ON COURSE OF ACTION**</u>


**UPLOAD EVERYTHING TO <u>GITHUB REGULARLY</u>**

In [3]:
crash_data = crash_data.drop(crash_data[crash_data.crashYear == 2018].index)
print(crash_data.shape)

(646400, 89)


In [4]:
from operator import itemgetter
missing_values = {}
for column in crash_data.columns:
    missing = crash_data[column].isna().sum()
    if missing:
        missing_values[column] = missing
for column, missing in sorted(missing_values.items(), key=itemgetter(1), reverse=True):
    print('{0} ({2:.2f}%):{1:0,}'.format(column, missing, missing/crash_data.shape[0] * 100))

crashDirectionDescription (37.84%):244,608
trafficControl (21.96%):141,949
crashRPDirectionDescription (14.55%):94,040
crashRPSH (12.10%):78,225
crashRPNewsDescription (6.63%):42,844
roadLane (1.61%):10,407
speedLimit (0.00%):10
cornerRoadSideRoad (0.00%):2


In [5]:
outdated_locations = crash_data['outdatedLocationDescription'].value_counts()['Outdated Location'] / crash_data.shape[0] * 100
print('Samples with outdated Locations: {:.2f}%'.format(outdated_locations))

Samples with outdated Locations: 0.66%


In [6]:
cornerRoadSideRoad = crash_data['ditch'].value_counts(dropna=False)
print(cornerRoadSideRoad)
cornerRoadSideRoad_percent = cornerRoadSideRoad / crash_data.shape[0]
print(cornerRoadSideRoad_percent)


0    622537
1     23714
2       147
3         2
Name: ditch, dtype: int64
0    0.963083
1    0.036686
2    0.000227
3    0.000003
Name: ditch, dtype: float64


We attribute crashDirectionDescription and directionRoleDescription as features that speak of the mechanics of the crash and not of the conditions under which the crash happened. So we disregard them.

Or maybe not?

Yes, remove it!
It is supossed to refer to the principal car, but there's no second feature for any other vehicles involves.


REMOVE outdatedLocationDescription

**Read carefully all descriptions. there are many features that are redundant or derived by mixing/joining others**

**Check each feature individually, see the different values and their distribution to udnerstand the feature's usedfulness**

looks like `advisorySpeed` and `temporarySpeedLimit` are not populated properly

**HOW TO ASSESS FEATURE RELEVEANCE WHEN 98% OF SAMPLES SHARE A SINGLE VALUE?**
How to do feature selection?

Idea:<br>
build basic models, both regarding the algorithm and features used (i.e. all features) and then do feature selection measured by keeping accuracy

In [7]:
display(crash_data[crash_data['ditch'] > 2])

Unnamed: 0,X,Y,OBJECTID,crashYear,crashFinancialYear,crashSeverity,fatalCount,seriousInjuryCount,minorInjuryCount,multiVehicle,...,moped,motorCycle,otherVehicleType,schoolBus,suv,taxi,truck,unknownVehicleType,vanOrUtility,pedestrian
263340,176.919447,-39.571209,266341,2007,2006/2007,M,0,0,3,Multi vehicle,...,0,0,0,0,0,0,1,0,0,0
296227,170.138316,-45.963865,298228,2007,2006/2007,N,0,0,0,Multi vehicle,...,0,0,0,0,1,0,0,0,0,0


In [8]:
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100)


Check `crashLocation1` and `crashLocation2`. It seems their hierarchy is not clear

`crashDistance` may not be useful as it indicates distance from main reference for the crash. it doesn't say much about the crash itself.

check `multiVehicle` and its possibles derived values

In [10]:
crash_data[['crashSeverity', 'fatalCount']]

Unnamed: 0,crashSeverity,fatalCount
0,M,0
1,S,0
2,M,0
3,M,0
4,M,0
5,M,0
6,M,0
7,M,0
8,M,0
9,M,0


In [40]:
crash_data.groupby(['crashSeverity', 'fatalCount']).size().reset_index().sort_values(0, ascending=False)

Unnamed: 0,crashSeverity,fatalCount,0
8,N,0,460771
7,M,0,143932
9,S,0,35756
0,F,1,5355
1,F,2,447
2,F,3,97
3,F,4,34
4,F,5,4
5,F,6,3
6,F,9,1


## <u>Goal:</u>

Try to clasify `crashSeverity` and in class is *F*, try to predict `fatalCount`