# <font color='darkgreen'>Applied data science capstone- final assignment project</font>

---

## <font color='green'>Introduction</font>

Traffic accidents have an adverse effect both on an individual and societal level resulting in costs due to person injuries and property damage, increased travel times and emissions due to congestion.

Although the outcome of one accident can be very different to another, the underlying cause can be very similar. Roads shrouded in dense fog can make driving incredibly dangerous, limiting visibility of the looming road ahead. Large amounts of rain or snowfall can also provide treacherous driving conditions which will affect a great number of road users.

The question is:

### <i>Given the weather and the road conditions- what is the possibility of a person getting into a car accident and how severe it would be, so that he/she would drive more carefully or even change his/her travel if he/she is able to.</i>

#### Target audience for the presented research will be all of the people with the driving license.

---

## <font color='green'>Data description</font>

#### High level description

The dataset covers all types collisions (events gathered from 2004 till now) provided with such information as severity of the accident, weather or an indication, that the accident involved hitting the parked car or whether the driver was under the influence of drug or alcohol (to distingush accidents that not neccessarily have to be connected with weather conditions).

#### Features details

<b>SEVERITYCODE</b>- a code that corresponds to the severity of the collision: 3-fatality;2b-serious injury;2-injury;1-prop damage;0-unknown<br>
<b>UNDERINFL</b>- whether or not a driver involved was under the influence of drugs or alcohol<br>
<b>WEATHER</b>- a description of the weather conditions during the time of the collision<br>
<b>ROADCOND</b>- the condition of the road during the collision<br>
<b>LIGHTCOND</b>- the light conditions during the collision<br>
<b>SPEEDING</b>- whether or not speeding was a factor in the collision<br>
<b>HITPARKEDCAR</b>- whether or not the collision involved hitting a parked car<br>
<b>INATTENTIONIND</b>- whether or not collision was due to innatention

#### Example

In [91]:
df[4:5]

Unnamed: 0,WEATHER,SEVERITYCODE,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR,UNDERINFL,INATTENTIONIND
4,Raining,2,Wet,Daylight,N,N,N,N


Example shown above illustrates the incident resulting with injuries. No parked vehicles were taking part in the event and speeding or drugs/alcohol were not the reason. Incident occured in the daylight, but probably wet road conditions and rainy weather were the root cause.

---

## <font color='green'>Data preparation</font>

### Importing necessary modules

In [4]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt

### Reading the input dataset

In [112]:
# The code was removed by Watson Studio for sharing.

In [113]:
fullcoll_df = pd.read_csv(body)
fullcoll_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


### Picking needed columns

In [192]:
df=fullcoll_df[['WEATHER','SEVERITYCODE','ROADCOND','LIGHTCOND','SPEEDING','HITPARKEDCAR','UNDERINFL','INATTENTIONIND']]
df.head()

Unnamed: 0,WEATHER,SEVERITYCODE,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR,UNDERINFL,INATTENTIONIND
0,Overcast,2,Wet,Daylight,,N,N,
1,Raining,1,Wet,Dark - Street Lights On,,N,0,
2,Overcast,1,Dry,Daylight,,N,0,
3,Clear,1,Dry,Daylight,,N,N,
4,Raining,2,Wet,Daylight,,N,0,


### Counting null values

In [193]:
print(df.isnull().sum())
print(df.shape)

WEATHER             5081
SEVERITYCODE           0
ROADCOND            5012
LIGHTCOND           5170
SPEEDING          185340
HITPARKEDCAR           0
UNDERINFL           4884
INATTENTIONIND    164868
dtype: int64
(194673, 8)


### Results show most NULL values in the Speeding and Inattentionind columns, but it is proven below, that these should be replaced by 'N' (as the only remaining value is 'Y')

In [194]:
print('Unique values for the WEATHER:\n%s\n'% df['WEATHER'].unique())
print('Unique values for the SEVERITYCODE:\n%s\n'% df['SEVERITYCODE'].unique())
print('Unique values for the ROADCOND:\n%s\n'% df['ROADCOND'].unique())
print('Unique values for the LIGHTCOND:\n%s\n'% df['LIGHTCOND'].unique())
print('Unique values for the SPEEDING:\n%s\n'% df['SPEEDING'].unique())
print('Unique values for the HITPARKEDCAR:\n%s\n'% df['HITPARKEDCAR'].unique())
print('Unique values for the UNDERINFL:\n%s\n'% df['UNDERINFL'].unique())
print('Unique values for the INATTENTIONIND:\n%s\n'% df['INATTENTIONIND'].unique())

Unique values for WEATHER:
['Overcast' 'Raining' 'Clear' nan 'Unknown' 'Other' 'Snowing'
 'Fog/Smog/Smoke' 'Sleet/Hail/Freezing Rain' 'Blowing Sand/Dirt'
 'Severe Crosswind' 'Partly Cloudy']

Unique values for SEVERITYCODE:
[2 1]

Unique values for ROADCOND:
['Wet' 'Dry' nan 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
 'Standing Water' 'Oil']

Unique values for LIGHTCOND:
['Daylight' 'Dark - Street Lights On' 'Dark - No Street Lights' nan
 'Unknown' 'Dusk' 'Dawn' 'Dark - Street Lights Off' 'Other'
 'Dark - Unknown Lighting']

Unique values for SPEEDING:
[nan 'Y']

Unique values for HITPARKEDCAR:
['N' 'Y']

Unique values for UNDERINFL:
['N' '0' nan '1' 'Y']

Unique values for INATTENTIONIND:
[nan 'Y']



### After finding unique labels, when it comes to the severity, we are dealing only with injuries or property damage. We need to remove 'Other' ánd 'Unknown' road conditions as they are not informative enough. The same goes for 'nan','Unknown' and 'Óther' light conditions and 'nan','Other' and 'Unknown' in the 'Weather' column. In the column 'Speeding', we are dealing only with 'nan' or 'Y'- therefore we can assume, that 'nan' can be replaced by 'N'- we need to treat the data from 'INATTENTIONIND' analogically. In the column 'Underinfl' we need to remove 'nan' values, replace 0 with 'N' and 1 with 'Y'.

In [195]:
df['INATTENTIONIND'].fillna('N',inplace=True)
df['SPEEDING'].fillna('N',inplace=True)
df['UNDERINFL'].replace(to_replace="0", value="N",inplace=True)
df['UNDERINFL'].replace(to_replace="1", value="Y",inplace=True)

In [196]:
df.head()

Unnamed: 0,WEATHER,SEVERITYCODE,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR,UNDERINFL,INATTENTIONIND
0,Overcast,2,Wet,Daylight,N,N,N,N
1,Raining,1,Wet,Dark - Street Lights On,N,N,N,N
2,Overcast,1,Dry,Daylight,N,N,N,N
3,Clear,1,Dry,Daylight,N,N,N,N
4,Raining,2,Wet,Daylight,N,N,N,N


In [197]:
df=df.dropna(subset=df.columns)
df=df.loc[df['ROADCOND'] != 'Other']
df=df.loc[df['ROADCOND'] != 'Unknown']
df=df.loc[df['LIGHTCOND'] != 'Unknown']
df=df.loc[df['LIGHTCOND'] != 'Other']
df=df.loc[df['WEATHER'] != 'Unknown']
df=df.loc[df['WEATHER'] != 'Other']

In [198]:
df.shape

(169957, 8)

### We are excluding incidents associated with drugs/alcohol, as the main reason for these is not associated with the weather conditions

In [200]:
df=df.loc[df['UNDERINFL'] == 'N']
df=df.drop(['UNDERINFL'],axis=1)

In [201]:
df.shape

(160977, 7)

### We are minimizing the light conditions labels, as they don't not provide much additional information to the research

In [202]:
df['LIGHTCOND'].replace('Daylight','Day',inplace=True)
df['LIGHTCOND'].replace('Dusk','Day',inplace=True)
df['LIGHTCOND'].replace('Dark - Street Lights On','Night',inplace=True)
df['LIGHTCOND'].replace('Dark - No Street Lights','Night',inplace=True)
df['LIGHTCOND'].replace('Dawn','Night',inplace=True)
df['LIGHTCOND'].replace('Dark - Street Lights Off','Night',inplace=True)
df['LIGHTCOND'].replace('Dark - Unknown Lighting','Night',inplace=True)

In [203]:
df.head()

Unnamed: 0,WEATHER,SEVERITYCODE,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR,INATTENTIONIND
0,Overcast,2,Wet,Day,N,N,N
1,Raining,1,Wet,Night,N,N,N
2,Overcast,1,Dry,Day,N,N,N
3,Clear,1,Dry,Day,N,N,N
4,Raining,2,Wet,Day,N,N,N


In [206]:
print('Unique values for the WEATHER:\n%s\n'% df['WEATHER'].unique())
print('Unique values for the SEVERITYCODE:\n%s\n'% df['SEVERITYCODE'].unique())
print('Unique values for the ROADCOND:\n%s\n'% df['ROADCOND'].unique())
print('Unique values for the LIGHTCOND:\n%s\n'% df['LIGHTCOND'].unique())
print('Unique values for the SPEEDING:\n%s\n'% df['SPEEDING'].unique())
print('Unique values for the HITPARKEDCAR:\n%s\n'% df['HITPARKEDCAR'].unique())
print('Unique values for the INATTENTIONIND:\n%s\n'% df['INATTENTIONIND'].unique())

Unique values for the WEATHER:
['Overcast' 'Raining' 'Clear' 'Snowing' 'Fog/Smog/Smoke'
 'Sleet/Hail/Freezing Rain' 'Blowing Sand/Dirt' 'Severe Crosswind'
 'Partly Cloudy']

Unique values for the SEVERITYCODE:
[2 1]

Unique values for the ROADCOND:
['Wet' 'Dry' 'Snow/Slush' 'Ice' 'Sand/Mud/Dirt' 'Oil' 'Standing Water']

Unique values for the LIGHTCOND:
['Day' 'Night']

Unique values for the SPEEDING:
['N' 'Y']

Unique values for the HITPARKEDCAR:
['N' 'Y']

Unique values for the INATTENTIONIND:
['N' 'Y']



### Now we need to convert the string features values into the numbered labels

# przeformatowac kolumny na liczby i labelki

# wykresy z martlpotliba dla niektorych atrybutow