# Capstone Project

In this project, I am using public traffic collision data from the French government.
3 files are involved, describing:
- the people involved in the collision (age, gender, route type)
- the  conditions surrounding the collision  (weather, road status, light conditions)
- information on the location where the collision took place (road type, traffic pattern)

We will be retaining the following predictors:
- light: <br/>
    1- Daylight <br/>
    2- Dusk <br/>
    3- Night with no public lights <br/>
    4- Night with public lights turned off <br/>
    5- Night with public lights turned on <br/>
- weather:  <br/>
    1- Normal <br/>
    2- Light rain <br/>
    3- Heavy rain <br/>
    4- Snow or hail <br/>
    5- Fog or smoke <br/>
    6- Strong wind / Storm <br/>
    7- Very sunny <br/>
    8- Overcast <br/>
    9- Other <br/>
- severity <br/>
    1- No injury <br/>
    2- Death <br/>
    3- Serious injuries <br/>
    4- Light injuries <br/>
- Road type <br/>
    1- Highway <br/>
    2- National road <br/>
    3- Departmental road <br/>
    4- Town road <br/>
    5- Non-public road <br/>
    6- Parking lot <br/>
    9- Other <br/>
- traffic pattern <br/>
    1- One way <br/>
    2- Two way, no divider <br/>
    3- Two way, with divider <br/>
    4- Alternate circulation lanes <br/>
- surface conditions <br/>
    1- dry <br/>
    2- wet <br/>
    3- puddles <br/>
    4- flooded <br/>
    5- snowy <br/>
    6- muddy <br/>
    7- icy <br/>
    8- oily <br/>
    9- other <br/>
- person type <br/>
    1- driver <br/>
    2- passenger <br/>
    3- pedestrian <br/>
    4- other <br/>
- sex <br/>
    1- male <br/>
    2- female <br/>
- birth year <br/>
- route <br/>
    1- work commute <br/>
    2- Home - school <br/>
    3- Shopping <br/>
    4- Work usage <br/>
    5- Leisure <br/>
    9- Other <br/>
    




Let's import all the libraries we'll need for this project

In [1]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [20]:
pd.options.mode.chained_assignment = None  # default='warn'

Let's get the data. There are 3 different files, which we will merge on the Num_Acc key (collision ID).

In [11]:
#Click here and press Shift+Enter
!wget -O data.csv https://www.data.gouv.fr/en/datasets/r/6eee0852-cbd7-447e-bd70-37c433029405

--2020-10-13 17:50:54--  https://www.data.gouv.fr/en/datasets/r/6eee0852-cbd7-447e-bd70-37c433029405
Resolving www.data.gouv.fr (www.data.gouv.fr)... 37.59.183.93
Connecting to www.data.gouv.fr (www.data.gouv.fr)|37.59.183.93|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-111741/caracteristiques-2018.csv [following]
--2020-10-13 17:50:56--  https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-111741/caracteristiques-2018.csv
Resolving static.data.gouv.fr (static.data.gouv.fr)... 37.59.183.93
Connecting to static.data.gouv.fr (static.data.gouv.fr)|37.59.183.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4638375 (4.4M) [text/csv]
Saving to: ‘data.csv’


2020-10-13 17:50:59 (4.84 MB/s) - ‘data.csv’ saved [4638375/4638375]



In [12]:
df_data = pd.read_csv("data.csv",encoding='latin-1')
df_data.shape

(57783, 16)

In [13]:
df_data.head()

Unnamed: 0,Num_Acc,an,mois,jour,hrmn,lum,agg,int,atm,col,com,adr,gps,lat,long,dep
0,201800000001,18,1,24,1505,1,1,4,1.0,1.0,5,route des Ansereuilles,M,5055737.0,294992.0,590
1,201800000002,18,2,12,1015,1,2,7,7.0,7.0,11,Place du général de Gaul,M,5052936.0,293151.0,590
2,201800000003,18,3,4,1135,1,2,3,1.0,7.0,477,Rue nationale,M,5051243.0,291714.0,590
3,201800000004,18,5,5,1735,1,2,1,7.0,3.0,52,30 rue Jules Guesde,M,5051974.0,289123.0,590
4,201800000005,18,6,26,1605,1,2,1,1.0,3.0,477,72 rue Victor Hugo,M,5051607.0,290605.0,590


Let's get the driver data:

In [15]:
!wget -O drivers.csv https://www.data.gouv.fr/en/datasets/r/72b251e1-d5e1-4c46-a1c2-c65f1b26549a

--2020-10-13 17:51:11--  https://www.data.gouv.fr/en/datasets/r/72b251e1-d5e1-4c46-a1c2-c65f1b26549a
Resolving www.data.gouv.fr (www.data.gouv.fr)... 37.59.183.93
Connecting to www.data.gouv.fr (www.data.gouv.fr)|37.59.183.93|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-112100/usagers-2018.csv [following]
--2020-10-13 17:51:12--  https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-112100/usagers-2018.csv
Resolving static.data.gouv.fr (static.data.gouv.fr)... 37.59.183.93
Connecting to static.data.gouv.fr (static.data.gouv.fr)|37.59.183.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5273810 (5.0M) [text/csv]
Saving to: ‘drivers.csv’


2020-10-13 17:51:13 (4.56 MB/s) - ‘drivers.csv’ saved [5273810/5273810]



In [16]:
df_drivers = pd.read_csv("drivers.csv",encoding='latin-1')
df_drivers.head()

Unnamed: 0,Num_Acc,place,catu,grav,sexe,trajet,secu,locp,actp,etatp,an_nais,num_veh
0,201800000001,1.0,1,3,1,0.0,11.0,0.0,0.0,0.0,1928.0,B01
1,201800000001,1.0,1,1,1,5.0,11.0,0.0,0.0,0.0,1960.0,A01
2,201800000002,1.0,1,1,1,0.0,11.0,0.0,0.0,0.0,1947.0,A01
3,201800000002,,3,4,1,0.0,2.0,2.0,3.0,1.0,1959.0,A01
4,201800000003,1.0,1,3,1,5.0,21.0,0.0,0.0,0.0,1987.0,A01


Let's see the  severity breakdown:

In [17]:
df_drivers['grav'].value_counts()

1    54248
4    50360
3    22169
2     3392
Name: grav, dtype: int64

We see that the serious collisions (code 2 & 3) are fewer than the non-serious ones. As we want to predict collisions, we will need to balance the dataset later on.

Let's merge the first 2 files:

In [23]:
df=pd.merge(df_drivers,df_data,on='Num_Acc',how='outer')
#df_drivers.join(df_data,on='Num_Acc',lsuffix='_left', rsuffix='_right')

In [24]:
df.head()

Unnamed: 0,Num_Acc,place,catu,grav,sexe,trajet,secu,locp,actp,etatp,...,agg,int,atm,col,com,adr,gps,lat,long,dep
0,201800000001,1.0,1,3,1,0.0,11.0,0.0,0.0,0.0,...,1,4,1.0,1.0,5,route des Ansereuilles,M,5055737.0,294992.0,590
1,201800000001,1.0,1,1,1,5.0,11.0,0.0,0.0,0.0,...,1,4,1.0,1.0,5,route des Ansereuilles,M,5055737.0,294992.0,590
2,201800000002,1.0,1,1,1,0.0,11.0,0.0,0.0,0.0,...,2,7,7.0,7.0,11,Place du général de Gaul,M,5052936.0,293151.0,590
3,201800000002,,3,4,1,0.0,2.0,2.0,3.0,1.0,...,2,7,7.0,7.0,11,Place du général de Gaul,M,5052936.0,293151.0,590
4,201800000003,1.0,1,3,1,5.0,21.0,0.0,0.0,0.0,...,2,3,1.0,7.0,477,Rue nationale,M,5051243.0,291714.0,590


In [25]:
!wget -O locations.csv https://www.data.gouv.fr/en/datasets/r/d9d65ca1-16a3-4ea3-b7c8-2412c92b69d9

--2020-10-13 17:54:11--  https://www.data.gouv.fr/en/datasets/r/d9d65ca1-16a3-4ea3-b7c8-2412c92b69d9
Resolving www.data.gouv.fr (www.data.gouv.fr)... 37.59.183.93
Connecting to www.data.gouv.fr (www.data.gouv.fr)|37.59.183.93|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-112036/lieux-2018.csv [following]
--2020-10-13 17:54:12--  https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20191014-112036/lieux-2018.csv
Resolving static.data.gouv.fr (static.data.gouv.fr)... 37.59.183.93
Connecting to static.data.gouv.fr (static.data.gouv.fr)|37.59.183.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2781213 (2.7M) [text/csv]
Saving to: ‘locations.csv’


2020-10-13 17:54:14 (3.34 MB/s) - ‘locations.csv’ saved [2781213/2781213]



In [26]:
df_locations = pd.read_csv("locations.csv",encoding='latin-1')
df_locations.head()

Unnamed: 0,Num_Acc,catr,voie,v1,v2,circ,nbv,pr,pr1,vosp,prof,plan,lartpc,larrout,surf,infra,situ,env1
0,201800000001,3,41.0,,C,2.0,2.0,,,0.0,1.0,3.0,,,1.0,0.0,1.0,0.0
1,201800000002,4,41.0,,D,2.0,2.0,,,0.0,1.0,2.0,,,1.0,0.0,1.0,0.0
2,201800000003,3,39.0,,D,2.0,2.0,,,0.0,1.0,1.0,,,1.0,0.0,1.0,0.0
3,201800000004,3,39.0,,,2.0,2.0,,,0.0,1.0,1.0,,,1.0,0.0,1.0,0.0
4,201800000005,4,,,,1.0,1.0,,,0.0,1.0,1.0,,,1.0,0.0,1.0,0.0


In [27]:
df=pd.merge(df,df_locations,on='Num_Acc',how='outer')
#df_drivers.join(df_data,on='Num_Acc',lsuffix='_left', rsuffix='_right')

In [28]:
df.shape

(130169, 44)

In [29]:
data=df[['lum','atm','grav','catr','surf','catu','sexe','an_nais','trajet','circ']]
data.head()

Unnamed: 0,lum,atm,grav,catr,surf,catu,sexe,an_nais,trajet,circ
0,1,1.0,3,3,1.0,1,1,1928.0,0.0,2.0
1,1,1.0,1,3,1.0,1,1,1960.0,5.0,2.0
2,1,7.0,1,4,1.0,1,1,1947.0,0.0,2.0
3,1,7.0,4,4,1.0,3,1,1959.0,0.0,2.0
4,1,1.0,3,3,1.0,1,1,1987.0,5.0,2.0


In [30]:
print(data.columns)

Index(['lum', 'atm', 'grav', 'catr', 'surf', 'catu', 'sexe', 'an_nais',
       'trajet', 'circ'],
      dtype='object')


Let's rename the columns for clarity:

In [33]:
data.columns=['light','weather','severity','road_type','road_condition','driver_type','sex','birth_year','route_type','traffic_pattern']

In [34]:
data.head()

Unnamed: 0,light,weather,severity,road_type,road_condition,driver_type,sex,birth_year,route_type,traffic_pattern
0,1,1.0,3,3,1.0,1,1,1928.0,0.0,2.0
1,1,1.0,1,3,1.0,1,1,1960.0,5.0,2.0
2,1,7.0,1,4,1.0,1,1,1947.0,0.0,2.0
3,1,7.0,4,4,1.0,3,1,1959.0,0.0,2.0
4,1,1.0,3,3,1.0,1,1,1987.0,5.0,2.0


Let's look at  the light breakdown:

In [30]:
data['light'].value_counts()

1    87798
5    19777
3    12783
2     8584
4     1227
Name: light, dtype: int64

In [59]:
df1=pd.crosstab(data.severity,data.light,normalize='index')
df1.head()

light,1,2,3,4,5
severity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.697574,0.065551,0.079948,0.008627,0.1483
2,0.577535,0.073408,0.246167,0.011498,0.091392
3,0.651856,0.070008,0.138301,0.009337,0.130498
4,0.666124,0.064079,0.09025,0.010187,0.169361


We can see that the amount of collisions in daylight is significantly lower for deadly collisions (severity=2), but similar for other severity levels.  We also see that 'no public  lights at night' amount for almost <b>a quarter</b> of deadly collisions.

In [60]:
data['road_condition'].value_counts()

1.0    103559
2.0     22760
0.0       729
9.0       557
7.0       492
5.0       386
8.0       233
3.0       216
4.0       159
6.0        52
Name: road_condition, dtype: int64

We see that most collisions occur during dry and clement weather, with a dry road. This is counterintuitive, suggesting that collisions are due to lack of attention rather than underestimation of the risks caused by external conditions.

Let's recode  the severity levels, and group 'deadly' and 'serious injury' together, which is what we would want to avoid as much as is possible.

In [61]:
data.loc[data.severity==1,'severity']=0
data.loc[data.severity==4,'severity']=0
data.loc[data.severity==2,'severity']=1
data.loc[data.severity==3,'severity']=1

In [62]:
data['severity'].head(10)

0    1
1    0
2    0
3    0
4    1
5    0
6    1
7    0
8    0
9    0
Name: severity, dtype: int64

Let's remove the rows with empty values

In [63]:
data.dropna(subset=['light','weather','road_type','road_condition','driver_type','sex','birth_year','route_type','traffic_pattern'],inplace=True)

In [64]:
data['severity'].value_counts()

0    103646
1     25287
Name: severity, dtype: int64

We see that the dataset is imbalanced; we need  to remedy that.

In [65]:
g=data.groupby(['severity'])

In [66]:
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))


Unnamed: 0_level_0,Unnamed: 1_level_0,light,weather,severity,road_type,road_condition,driver_type,sex,birth_year,route_type,traffic_pattern
severity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,1,2.0,0,4,2.0,1,2,1981.0,9.0,2.0
0,1,1,1.0,0,4,1.0,1,1,1992.0,0.0,1.0
0,2,1,1.0,0,4,1.0,1,1,1978.0,0.0,2.0
0,3,1,1.0,0,2,1.0,1,1,1973.0,4.0,3.0
0,4,1,1.0,0,4,1.0,1,2,1979.0,2.0,0.0
0,5,3,1.0,0,1,1.0,2,1,2004.0,5.0,3.0
0,6,1,1.0,0,3,1.0,1,1,1973.0,1.0,2.0
0,7,1,1.0,0,4,1.0,1,1,1974.0,0.0,1.0
0,8,1,1.0,0,1,1.0,1,1,1968.0,4.0,3.0
0,9,3,2.0,0,2,2.0,1,2,1988.0,9.0,2.0


In [67]:
data=g.head(50594)

Let's define the feature set:

In [68]:
X=np.asarray(data[['light','weather','road_type','road_condition','driver_type','sex','birth_year','route_type','traffic_pattern']])

In [69]:
y=np.asarray(data['severity'])
y[0:5]

array([1, 0, 0, 0, 1])

Let's standardize  the data now:

In [71]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.60050067, -0.37577869, -0.27683967, -0.30492995, -0.58656581,
        -0.70374075, -2.52156173, -1.40913175,  0.09101517],
       [-0.60050067, -0.37577869, -0.27683967, -0.30492995, -0.58656581,
        -0.70374075, -0.92686097,  0.56268188,  0.09101517],
       [-0.60050067,  3.15272949,  0.62944926, -0.30492995, -0.58656581,
        -0.70374075, -1.57470815, -1.40913175,  0.09101517],
       [-0.60050067,  3.15272949,  0.62944926, -0.30492995,  2.53360247,
        -0.70374075, -0.97669537, -1.40913175,  0.09101517],
       [-0.60050067, -0.37577869, -0.27683967, -0.30492995, -0.58656581,
        -0.70374075,  0.41866779,  0.56268188,  0.09101517]])

Let's partition the data so as to  train first, then test the model:

In [73]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (53116, 9) (53116,)
Test set: (22765, 9) (22765,)


In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [75]:
yhat = LR.predict(X_test)
yhat

array([0, 0, 0, ..., 0, 0, 0])

In [76]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

0.6777070063694267

The model scores 0.67 - not very good; let's see what a decision tree algorithm would produce.

In [77]:
from sklearn.model_selection import train_test_split

In [78]:
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 5)
drugTree # it shows the default parameters

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [79]:
drugTree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [81]:
predTree = drugTree.predict(X_test)

In [82]:
print (predTree [0:5])
print (y_test [0:5])

[0 0 0 0 0]
[0 1 1 0 1]


In [83]:
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

DecisionTrees's Accuracy:  0.6837250164726554


The decision tree is a  tiny bit better, but still: neither model is performing well, suggesting  that the factors that cause the collision are others than those listed  in the reports; for instance, lack of attention, or cell phone use. Also, factors like safety distance compliance, or traffic  density, could play a role.